Python - Count Occurrences in String
The `count()` method is the most straightforward approach for counting non-overlapping occurrences of a substring. It's a string method that returns an integer representing how many times the...
Key Insights
- Python offers multiple methods to count substring occurrences, from basic
count()to regex patterns for complex matching scenarios - Understanding the difference between overlapping and non-overlapping matches is critical for accurate counting in text processing applications
- Performance characteristics vary significantly between methods—
count()is fastest for simple cases while regex provides flexibility at the cost of speed
Using the Built-in count() Method
The count() method is the most straightforward approach for counting non-overlapping occurrences of a substring. It’s a string method that returns an integer representing how many times the substring appears.
text = "Python is awesome. Python is powerful. Python is everywhere."
substring = "Python"
occurrences = text.count(substring)
print(f"'{substring}' appears {occurrences} times") # Output: 'Python' appears 3 times
# Case-sensitive counting
mixed_case = "Python python PYTHON PyThOn"
print(mixed_case.count("Python")) # Output: 1
print(mixed_case.count("python")) # Output: 1
The count() method accepts optional start and end parameters to limit the search range:
text = "abcabcabcabc"
print(text.count("abc")) # Output: 4
print(text.count("abc", 0, 6)) # Output: 2 (search from index 0 to 6)
print(text.count("abc", 3)) # Output: 3 (search from index 3 to end)
Case-Insensitive Counting
For case-insensitive counting, convert both the text and search string to the same case:
text = "Python PYTHON python PyThOn"
substring = "python"
# Method 1: Convert to lowercase
count_lower = text.lower().count(substring.lower())
print(f"Case-insensitive count: {count_lower}") # Output: 4
# Method 2: Using a helper function
def count_case_insensitive(text, substring):
return text.lower().count(substring.lower())
result = count_case_insensitive("API api Api API", "api")
print(result) # Output: 4
Counting Individual Characters
Counting single character occurrences follows the same pattern but has specific use cases in text analysis:
text = "mississippi"
# Count specific character
print(text.count('s')) # Output: 4
print(text.count('i')) # Output: 4
print(text.count('p')) # Output: 2
# Count all character occurrences using Counter
from collections import Counter
char_counts = Counter(text)
print(char_counts) # Output: Counter({'i': 4, 's': 4, 'p': 2, 'm': 1})
# Get count for specific character
print(char_counts['s']) # Output: 4
print(char_counts['x']) # Output: 0 (returns 0 for missing keys)
Handling Overlapping Matches
The count() method only counts non-overlapping occurrences. For overlapping matches, implement a custom solution:
def count_overlapping(text, substring):
count = 0
start = 0
while True:
pos = text.find(substring, start)
if pos == -1:
break
count += 1
start = pos + 1 # Move by 1 to find overlapping matches
return count
text = "aaaa"
print(f"Non-overlapping 'aa': {text.count('aa')}") # Output: 2
print(f"Overlapping 'aa': {count_overlapping(text, 'aa')}") # Output: 3
# Another example
text = "abababab"
print(f"Non-overlapping 'aba': {text.count('aba')}") # Output: 2
print(f"Overlapping 'aba': {count_overlapping(text, 'aba')}") # Output: 3
Using Regular Expressions for Pattern Matching
Regular expressions provide powerful pattern-based counting capabilities:
import re
text = "Contact us at support@example.com or sales@example.com"
# Count email addresses
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
email_count = len(re.findall(email_pattern, text))
print(f"Email addresses found: {email_count}") # Output: 2
# Count words starting with 's'
text = "she sells seashells by the seashore"
s_words = len(re.findall(r'\bs\w*', text, re.IGNORECASE))
print(f"Words starting with 's': {s_words}") # Output: 4
# Count digits
text = "Order #12345 contains 3 items, total: $99.99"
digit_sequences = len(re.findall(r'\d+', text))
print(f"Digit sequences: {digit_sequences}") # Output: 4
For overlapping regex matches:
import re
def count_overlapping_regex(text, pattern):
return sum(1 for _ in re.finditer(f'(?={pattern})', text))
text = "aaaa"
pattern = "aa"
count = count_overlapping_regex(text, pattern)
print(f"Overlapping regex matches: {count}") # Output: 3
Counting Word Occurrences
Counting whole words requires handling word boundaries to avoid partial matches:
import re
from collections import Counter
text = "the cat and the dog and the bird"
# Method 1: Using split and count
words = text.lower().split()
word_to_find = "the"
count = words.count(word_to_find)
print(f"'{word_to_find}' appears {count} times") # Output: 3
# Method 2: Counter for all words
word_counts = Counter(words)
print(word_counts) # Output: Counter({'the': 3, 'and': 2, 'cat': 1, 'dog': 1, 'bird': 1})
# Method 3: Regex with word boundaries
def count_whole_word(text, word):
pattern = r'\b' + re.escape(word) + r'\b'
return len(re.findall(pattern, text, re.IGNORECASE))
text = "cat cats cattle"
print(count_whole_word(text, "cat")) # Output: 1 (only exact matches)
Performance Comparison
Different methods have varying performance characteristics:
import re
import time
from collections import Counter
text = "lorem ipsum " * 10000
substring = "ipsum"
# Test count() method
start = time.perf_counter()
result1 = text.count(substring)
time1 = time.perf_counter() - start
# Test regex findall
start = time.perf_counter()
result2 = len(re.findall(substring, text))
time2 = time.perf_counter() - start
# Test Counter (for word counting)
start = time.perf_counter()
result3 = Counter(text.split())[substring]
time3 = time.perf_counter() - start
print(f"count(): {time1:.6f}s")
print(f"regex: {time2:.6f}s")
print(f"Counter: {time3:.6f}s")
# Typical output shows count() is fastest:
# count(): 0.000012s
# regex: 0.001234s
# Counter: 0.002345s
Practical Application: Log File Analysis
Here’s a real-world example analyzing log files:
import re
from collections import Counter
log_data = """
2024-01-15 ERROR: Database connection failed
2024-01-15 INFO: User login successful
2024-01-15 ERROR: Timeout occurred
2024-01-15 WARNING: High memory usage
2024-01-15 ERROR: Invalid request
2024-01-15 INFO: Cache cleared
"""
# Count log levels
error_count = log_data.count("ERROR")
warning_count = log_data.count("WARNING")
info_count = log_data.count("INFO")
print(f"Errors: {error_count}, Warnings: {warning_count}, Info: {info_count}")
# Output: Errors: 3, Warnings: 1, Info: 2
# Using regex for more complex patterns
log_pattern = r'(\w+):'
log_levels = re.findall(log_pattern, log_data)
level_counts = Counter(log_levels)
print("\nDetailed breakdown:")
for level, count in level_counts.most_common():
print(f"{level}: {count}")
Choose count() for simple substring matching, regex for pattern-based counting, and Counter for frequency analysis across multiple items. Understanding these trade-offs ensures efficient text processing in production applications.