Python - Count Occurrences in String

Key Insights

Python offers multiple methods to count substring occurrences, from basic count() to regex patterns for complex matching scenarios
Understanding the difference between overlapping and non-overlapping matches is critical for accurate counting in text processing applications
Performance characteristics vary significantly between methods—count() is fastest for simple cases while regex provides flexibility at the cost of speed

Using the Built-in count() Method

The count() method is the most straightforward approach for counting non-overlapping occurrences of a substring. It’s a string method that returns an integer representing how many times the substring appears.

text = "Python is awesome. Python is powerful. Python is everywhere."
substring = "Python"

occurrences = text.count(substring)
print(f"'{substring}' appears {occurrences} times")  # Output: 'Python' appears 3 times

# Case-sensitive counting
mixed_case = "Python python PYTHON PyThOn"
print(mixed_case.count("Python"))  # Output: 1
print(mixed_case.count("python"))  # Output: 1

The count() method accepts optional start and end parameters to limit the search range:

text = "abcabcabcabc"
print(text.count("abc"))           # Output: 4
print(text.count("abc", 0, 6))     # Output: 2 (search from index 0 to 6)
print(text.count("abc", 3))        # Output: 3 (search from index 3 to end)

Case-Insensitive Counting

For case-insensitive counting, convert both the text and search string to the same case:

text = "Python PYTHON python PyThOn"
substring = "python"

# Method 1: Convert to lowercase
count_lower = text.lower().count(substring.lower())
print(f"Case-insensitive count: {count_lower}")  # Output: 4

# Method 2: Using a helper function
def count_case_insensitive(text, substring):
    return text.lower().count(substring.lower())

result = count_case_insensitive("API api Api API", "api")
print(result)  # Output: 4

Counting Individual Characters

Counting single character occurrences follows the same pattern but has specific use cases in text analysis:

text = "mississippi"

# Count specific character
print(text.count('s'))  # Output: 4
print(text.count('i'))  # Output: 4
print(text.count('p'))  # Output: 2

# Count all character occurrences using Counter
from collections import Counter

char_counts = Counter(text)
print(char_counts)  # Output: Counter({'i': 4, 's': 4, 'p': 2, 'm': 1})

# Get count for specific character
print(char_counts['s'])  # Output: 4
print(char_counts['x'])  # Output: 0 (returns 0 for missing keys)

Handling Overlapping Matches

The count() method only counts non-overlapping occurrences. For overlapping matches, implement a custom solution:

def count_overlapping(text, substring):
    count = 0
    start = 0
    
    while True:
        pos = text.find(substring, start)
        if pos == -1:
            break
        count += 1
        start = pos + 1  # Move by 1 to find overlapping matches
    
    return count

text = "aaaa"
print(f"Non-overlapping 'aa': {text.count('aa')}")  # Output: 2
print(f"Overlapping 'aa': {count_overlapping(text, 'aa')}")  # Output: 3

# Another example
text = "abababab"
print(f"Non-overlapping 'aba': {text.count('aba')}")  # Output: 2
print(f"Overlapping 'aba': {count_overlapping(text, 'aba')}")  # Output: 3

Using Regular Expressions for Pattern Matching

Regular expressions provide powerful pattern-based counting capabilities:

import re

text = "Contact us at support@example.com or sales@example.com"

# Count email addresses
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
email_count = len(re.findall(email_pattern, text))
print(f"Email addresses found: {email_count}")  # Output: 2

# Count words starting with 's'
text = "she sells seashells by the seashore"
s_words = len(re.findall(r'\bs\w*', text, re.IGNORECASE))
print(f"Words starting with 's': {s_words}")  # Output: 4

# Count digits
text = "Order #12345 contains 3 items, total: $99.99"
digit_sequences = len(re.findall(r'\d+', text))
print(f"Digit sequences: {digit_sequences}")  # Output: 4

For overlapping regex matches:

import re

def count_overlapping_regex(text, pattern):
    return sum(1 for _ in re.finditer(f'(?={pattern})', text))

text = "aaaa"
pattern = "aa"
count = count_overlapping_regex(text, pattern)
print(f"Overlapping regex matches: {count}")  # Output: 3

Counting Word Occurrences

Counting whole words requires handling word boundaries to avoid partial matches:

import re
from collections import Counter

text = "the cat and the dog and the bird"

# Method 1: Using split and count
words = text.lower().split()
word_to_find = "the"
count = words.count(word_to_find)
print(f"'{word_to_find}' appears {count} times")  # Output: 3

# Method 2: Counter for all words
word_counts = Counter(words)
print(word_counts)  # Output: Counter({'the': 3, 'and': 2, 'cat': 1, 'dog': 1, 'bird': 1})

# Method 3: Regex with word boundaries
def count_whole_word(text, word):
    pattern = r'\b' + re.escape(word) + r'\b'
    return len(re.findall(pattern, text, re.IGNORECASE))

text = "cat cats cattle"
print(count_whole_word(text, "cat"))  # Output: 1 (only exact matches)

Performance Comparison

Different methods have varying performance characteristics:

import re
import time
from collections import Counter

text = "lorem ipsum " * 10000
substring = "ipsum"

# Test count() method
start = time.perf_counter()
result1 = text.count(substring)
time1 = time.perf_counter() - start

# Test regex findall
start = time.perf_counter()
result2 = len(re.findall(substring, text))
time2 = time.perf_counter() - start

# Test Counter (for word counting)
start = time.perf_counter()
result3 = Counter(text.split())[substring]
time3 = time.perf_counter() - start

print(f"count(): {time1:.6f}s")
print(f"regex: {time2:.6f}s")
print(f"Counter: {time3:.6f}s")

# Typical output shows count() is fastest:
# count(): 0.000012s
# regex: 0.001234s
# Counter: 0.002345s

Practical Application: Log File Analysis

Here’s a real-world example analyzing log files:

import re
from collections import Counter

log_data = """
2024-01-15 ERROR: Database connection failed
2024-01-15 INFO: User login successful
2024-01-15 ERROR: Timeout occurred
2024-01-15 WARNING: High memory usage
2024-01-15 ERROR: Invalid request
2024-01-15 INFO: Cache cleared
"""

# Count log levels
error_count = log_data.count("ERROR")
warning_count = log_data.count("WARNING")
info_count = log_data.count("INFO")

print(f"Errors: {error_count}, Warnings: {warning_count}, Info: {info_count}")
# Output: Errors: 3, Warnings: 1, Info: 2

# Using regex for more complex patterns
log_pattern = r'(\w+):'
log_levels = re.findall(log_pattern, log_data)
level_counts = Counter(log_levels)

print("\nDetailed breakdown:")
for level, count in level_counts.most_common():
    print(f"{level}: {count}")

Choose count() for simple substring matching, regex for pattern-based counting, and Counter for frequency analysis across multiple items. Understanding these trade-offs ensures efficient text processing in production applications.