Pandas - str.startswith() and str.endswith()
The `str.startswith()` and `str.endswith()` methods in pandas provide vectorized operations for pattern matching at the beginning and end of strings within Series objects. These methods return...
Key Insights
- The
str.startswith()andstr.endswith()methods enable efficient pattern matching at string boundaries in pandas Series, supporting both single patterns and tuples of multiple patterns for complex filtering scenarios. - These vectorized string methods significantly outperform iterative approaches like
apply()with lambda functions, delivering 10-50x performance improvements on large datasets while maintaining cleaner, more readable code. - Both methods integrate seamlessly with boolean indexing for DataFrame filtering and support case-sensitive matching by default, with regex-free alternatives available through parameter configuration.
Understanding String Boundary Methods
The str.startswith() and str.endswith() methods in pandas provide vectorized operations for pattern matching at the beginning and end of strings within Series objects. These methods return boolean Series that can be used directly for filtering, masking, or conditional operations.
import pandas as pd
import numpy as np
# Create sample dataset
data = {
'filename': ['report_2024.pdf', 'data_export.csv', 'image_001.png',
'report_2023.pdf', 'backup.csv', 'document.docx'],
'email': ['user@gmail.com', 'admin@company.org', 'test@yahoo.com',
'sales@company.org', 'info@outlook.com', 'support@company.org'],
'status': ['Completed', 'Pending', 'Failed', 'Completed', 'In Progress', 'Pending']
}
df = pd.DataFrame(data)
# Basic startswith usage
pdf_files = df['filename'].str.startswith('report')
print(df[pdf_files])
Output:
filename email status
0 report_2024.pdf user@gmail.com Completed
3 report_2023.pdf sales@company.org Completed
Multiple Pattern Matching
Both methods accept tuples of patterns, enabling efficient multi-pattern matching without chaining multiple conditions or using complex regex patterns.
# Match multiple file extensions
csv_or_pdf = df['filename'].str.endswith(('.csv', '.pdf'))
print(df[csv_or_pdf])
# Match multiple email providers
corporate_or_yahoo = df['email'].str.endswith(('company.org', 'yahoo.com'))
print(df[corporate_or_yahoo])
Output:
filename email status
0 report_2024.pdf user@gmail.com Completed
1 data_export.csv admin@company.org Pending
3 report_2023.pdf sales@company.org Completed
4 backup.csv info@outlook.com In Progress
filename email status
1 data_export.csv admin@company.org Pending
2 image_001.png test@yahoo.com Failed
3 report_2023.pdf sales@company.org Completed
5 document.docx support@company.org Pending
Handling Missing Values
These methods handle NaN values gracefully, returning False for missing entries by default. You can control this behavior using the na parameter.
# Create Series with missing values
files_with_nan = pd.Series(['config.json', np.nan, 'setup.py', None, 'README.md'])
# Default behavior - NaN returns False
print(files_with_nan.str.startswith('config'))
# Explicitly handle NaN as True
print(files_with_nan.str.startswith('config', na=True))
# Use NaN to identify missing values
print(files_with_nan.str.startswith('config', na=np.nan))
Output:
0 True
1 False
2 False
3 False
4 False
dtype: bool
0 True
1 True
2 False
3 True
4 False
dtype: bool
0 True
1 NaN
2 False
3 NaN
4 False
dtype: object
Practical Filtering Scenarios
Combine these methods with boolean indexing for powerful DataFrame filtering operations.
# Complex dataset
logs = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=6, freq='H'),
'log_level': ['INFO', 'ERROR', 'WARNING', 'INFO', 'ERROR', 'DEBUG'],
'message': ['System started', 'Connection failed', 'Low memory',
'Request processed', 'Timeout error', 'Variable dump'],
'source': ['app.py', 'database.py', 'memory.py', 'api.py', 'network.py', 'app.py']
})
# Filter error messages
errors = logs[logs['log_level'].str.startswith('ERROR')]
print(errors)
# Filter by source file pattern
app_logs = logs[logs['source'].str.startswith('app')]
print(app_logs)
# Combine multiple conditions
critical = logs[
(logs['log_level'].str.startswith(('ERROR', 'WARNING'))) &
(logs['source'].str.endswith('.py'))
]
print(critical)
Case Sensitivity Control
Unlike some string methods, startswith() and endswith() are case-sensitive by default and don’t provide a case parameter. Handle case-insensitive matching through preprocessing.
# Case-sensitive example
products = pd.Series(['iPhone 15', 'ipad Pro', 'IMAC', 'iWatch', 'MacBook'])
# Case-sensitive match (default)
print(products.str.startswith('i'))
# Case-insensitive approach
print(products.str.lower().str.startswith('i'))
# Alternative: using contains with regex
print(products.str.contains('^i', case=False, regex=True))
Output:
0 True
1 True
2 False
3 True
4 False
dtype: bool
0 True
1 True
2 True
3 True
4 False
dtype: bool
0 True
1 True
2 True
3 True
4 False
dtype: bool
Performance Comparison
Vectorized string methods dramatically outperform iterative approaches, especially on large datasets.
import time
# Generate large dataset
large_series = pd.Series(['file_' + str(i) + '.txt' for i in range(100000)])
# Method 1: Vectorized startswith
start = time.time()
result1 = large_series.str.startswith('file_1')
time1 = time.time() - start
# Method 2: Apply with lambda
start = time.time()
result2 = large_series.apply(lambda x: x.startswith('file_1'))
time2 = time.time() - start
# Method 3: List comprehension
start = time.time()
result3 = pd.Series([x.startswith('file_1') for x in large_series])
time3 = time.time() - start
print(f"Vectorized: {time1:.4f}s")
print(f"Apply/Lambda: {time2:.4f}s")
print(f"List Comp: {time3:.4f}s")
print(f"Speedup vs apply: {time2/time1:.1f}x")
Typical output:
Vectorized: 0.0045s
Apply/Lambda: 0.1823s
List Comp: 0.0892s
Speedup vs apply: 40.5x
Data Validation and Cleaning
Use these methods for data validation tasks such as URL scheme validation, file type verification, or prefix-based categorization.
# URL validation dataset
urls = pd.DataFrame({
'url': ['https://api.example.com/users', 'http://old.site.com',
'ftp://files.server.com', 'https://secure.bank.com',
'http://legacy.app.com', 'ws://socket.io']
})
# Identify secure URLs
urls['is_secure'] = urls['url'].str.startswith('https://')
# Categorize by protocol
urls['protocol'] = 'other'
urls.loc[urls['url'].str.startswith('https://'), 'protocol'] = 'HTTPS'
urls.loc[urls['url'].str.startswith('http://'), 'protocol'] = 'HTTP'
urls.loc[urls['url'].str.startswith('ftp://'), 'protocol'] = 'FTP'
print(urls)
Output:
url is_secure protocol
0 https://api.example.com/... True HTTPS
1 http://old.site.com False HTTP
2 ftp://files.server.com False FTP
3 https://secure.bank.com True HTTPS
4 http://legacy.app.com False HTTP
5 ws://socket.io False other
Negation and Complex Conditions
Combine with boolean operators for sophisticated filtering logic.
# Transaction data
transactions = pd.DataFrame({
'transaction_id': ['TXN_001', 'REF_002', 'TXN_003', 'CHG_004', 'TXN_005'],
'account': ['ACC_1234', 'ACC_5678', 'SAV_9012', 'ACC_3456', 'SAV_7890'],
'amount': [100, -50, 200, -25, 150]
})
# Find transactions (not refunds/charges) on checking accounts
valid_txns = transactions[
transactions['transaction_id'].str.startswith('TXN') &
transactions['account'].str.startswith('ACC')
]
print(valid_txns)
# Find all non-savings accounts
non_savings = transactions[~transactions['account'].str.startswith('SAV')]
print(non_savings)
Integration with Other String Methods
Chain with other pandas string methods for comprehensive text processing pipelines.
# Email processing
emails = pd.Series(['ADMIN@COMPANY.COM', 'user@gmail.com', 'SALES@COMPANY.COM'])
# Clean and filter corporate emails
corporate = (emails
.str.lower()
.str.strip()
[emails.str.lower().str.endswith('company.com')]
)
print(corporate)
# Extract domain from company emails
company_emails = emails[emails.str.upper().str.endswith('COMPANY.COM')]
domains = company_emails.str.split('@').str[-1]
print(domains)
These methods provide essential building blocks for string-based data filtering and validation in pandas, offering both simplicity and performance for production data pipelines.