Python - Count Occurrences in String

The `count()` method is the most straightforward approach for counting non-overlapping occurrences of a substring. It's a string method that returns an integer representing how many times the...

Key Insights

  • Python offers multiple methods to count substring occurrences, from basic count() to regex patterns for complex matching scenarios
  • Understanding the difference between overlapping and non-overlapping matches is critical for accurate counting in text processing applications
  • Performance characteristics vary significantly between methods—count() is fastest for simple cases while regex provides flexibility at the cost of speed

Using the Built-in count() Method

The count() method is the most straightforward approach for counting non-overlapping occurrences of a substring. It’s a string method that returns an integer representing how many times the substring appears.

text = "Python is awesome. Python is powerful. Python is everywhere."
substring = "Python"

occurrences = text.count(substring)
print(f"'{substring}' appears {occurrences} times")  # Output: 'Python' appears 3 times

# Case-sensitive counting
mixed_case = "Python python PYTHON PyThOn"
print(mixed_case.count("Python"))  # Output: 1
print(mixed_case.count("python"))  # Output: 1

The count() method accepts optional start and end parameters to limit the search range:

text = "abcabcabcabc"
print(text.count("abc"))           # Output: 4
print(text.count("abc", 0, 6))     # Output: 2 (search from index 0 to 6)
print(text.count("abc", 3))        # Output: 3 (search from index 3 to end)

Case-Insensitive Counting

For case-insensitive counting, convert both the text and search string to the same case:

text = "Python PYTHON python PyThOn"
substring = "python"

# Method 1: Convert to lowercase
count_lower = text.lower().count(substring.lower())
print(f"Case-insensitive count: {count_lower}")  # Output: 4

# Method 2: Using a helper function
def count_case_insensitive(text, substring):
    return text.lower().count(substring.lower())

result = count_case_insensitive("API api Api API", "api")
print(result)  # Output: 4

Counting Individual Characters

Counting single character occurrences follows the same pattern but has specific use cases in text analysis:

text = "mississippi"

# Count specific character
print(text.count('s'))  # Output: 4
print(text.count('i'))  # Output: 4
print(text.count('p'))  # Output: 2

# Count all character occurrences using Counter
from collections import Counter

char_counts = Counter(text)
print(char_counts)  # Output: Counter({'i': 4, 's': 4, 'p': 2, 'm': 1})

# Get count for specific character
print(char_counts['s'])  # Output: 4
print(char_counts['x'])  # Output: 0 (returns 0 for missing keys)

Handling Overlapping Matches

The count() method only counts non-overlapping occurrences. For overlapping matches, implement a custom solution:

def count_overlapping(text, substring):
    count = 0
    start = 0
    
    while True:
        pos = text.find(substring, start)
        if pos == -1:
            break
        count += 1
        start = pos + 1  # Move by 1 to find overlapping matches
    
    return count

text = "aaaa"
print(f"Non-overlapping 'aa': {text.count('aa')}")  # Output: 2
print(f"Overlapping 'aa': {count_overlapping(text, 'aa')}")  # Output: 3

# Another example
text = "abababab"
print(f"Non-overlapping 'aba': {text.count('aba')}")  # Output: 2
print(f"Overlapping 'aba': {count_overlapping(text, 'aba')}")  # Output: 3

Using Regular Expressions for Pattern Matching

Regular expressions provide powerful pattern-based counting capabilities:

import re

text = "Contact us at support@example.com or sales@example.com"

# Count email addresses
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
email_count = len(re.findall(email_pattern, text))
print(f"Email addresses found: {email_count}")  # Output: 2

# Count words starting with 's'
text = "she sells seashells by the seashore"
s_words = len(re.findall(r'\bs\w*', text, re.IGNORECASE))
print(f"Words starting with 's': {s_words}")  # Output: 4

# Count digits
text = "Order #12345 contains 3 items, total: $99.99"
digit_sequences = len(re.findall(r'\d+', text))
print(f"Digit sequences: {digit_sequences}")  # Output: 4

For overlapping regex matches:

import re

def count_overlapping_regex(text, pattern):
    return sum(1 for _ in re.finditer(f'(?={pattern})', text))

text = "aaaa"
pattern = "aa"
count = count_overlapping_regex(text, pattern)
print(f"Overlapping regex matches: {count}")  # Output: 3

Counting Word Occurrences

Counting whole words requires handling word boundaries to avoid partial matches:

import re
from collections import Counter

text = "the cat and the dog and the bird"

# Method 1: Using split and count
words = text.lower().split()
word_to_find = "the"
count = words.count(word_to_find)
print(f"'{word_to_find}' appears {count} times")  # Output: 3

# Method 2: Counter for all words
word_counts = Counter(words)
print(word_counts)  # Output: Counter({'the': 3, 'and': 2, 'cat': 1, 'dog': 1, 'bird': 1})

# Method 3: Regex with word boundaries
def count_whole_word(text, word):
    pattern = r'\b' + re.escape(word) + r'\b'
    return len(re.findall(pattern, text, re.IGNORECASE))

text = "cat cats cattle"
print(count_whole_word(text, "cat"))  # Output: 1 (only exact matches)

Performance Comparison

Different methods have varying performance characteristics:

import re
import time
from collections import Counter

text = "lorem ipsum " * 10000
substring = "ipsum"

# Test count() method
start = time.perf_counter()
result1 = text.count(substring)
time1 = time.perf_counter() - start

# Test regex findall
start = time.perf_counter()
result2 = len(re.findall(substring, text))
time2 = time.perf_counter() - start

# Test Counter (for word counting)
start = time.perf_counter()
result3 = Counter(text.split())[substring]
time3 = time.perf_counter() - start

print(f"count(): {time1:.6f}s")
print(f"regex: {time2:.6f}s")
print(f"Counter: {time3:.6f}s")

# Typical output shows count() is fastest:
# count(): 0.000012s
# regex: 0.001234s
# Counter: 0.002345s

Practical Application: Log File Analysis

Here’s a real-world example analyzing log files:

import re
from collections import Counter

log_data = """
2024-01-15 ERROR: Database connection failed
2024-01-15 INFO: User login successful
2024-01-15 ERROR: Timeout occurred
2024-01-15 WARNING: High memory usage
2024-01-15 ERROR: Invalid request
2024-01-15 INFO: Cache cleared
"""

# Count log levels
error_count = log_data.count("ERROR")
warning_count = log_data.count("WARNING")
info_count = log_data.count("INFO")

print(f"Errors: {error_count}, Warnings: {warning_count}, Info: {info_count}")
# Output: Errors: 3, Warnings: 1, Info: 2

# Using regex for more complex patterns
log_pattern = r'(\w+):'
log_levels = re.findall(log_pattern, log_data)
level_counts = Counter(log_levels)

print("\nDetailed breakdown:")
for level, count in level_counts.most_common():
    print(f"{level}: {count}")

Choose count() for simple substring matching, regex for pattern-based counting, and Counter for frequency analysis across multiple items. Understanding these trade-offs ensures efficient text processing in production applications.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.