Pandas - str.startswith() and str.endswith()

The `str.startswith()` and `str.endswith()` methods in pandas provide vectorized operations for pattern matching at the beginning and end of strings within Series objects. These methods return...

Key Insights

  • The str.startswith() and str.endswith() methods enable efficient pattern matching at string boundaries in pandas Series, supporting both single patterns and tuples of multiple patterns for complex filtering scenarios.
  • These vectorized string methods significantly outperform iterative approaches like apply() with lambda functions, delivering 10-50x performance improvements on large datasets while maintaining cleaner, more readable code.
  • Both methods integrate seamlessly with boolean indexing for DataFrame filtering and support case-sensitive matching by default, with regex-free alternatives available through parameter configuration.

Understanding String Boundary Methods

The str.startswith() and str.endswith() methods in pandas provide vectorized operations for pattern matching at the beginning and end of strings within Series objects. These methods return boolean Series that can be used directly for filtering, masking, or conditional operations.

import pandas as pd
import numpy as np

# Create sample dataset
data = {
    'filename': ['report_2024.pdf', 'data_export.csv', 'image_001.png', 
                 'report_2023.pdf', 'backup.csv', 'document.docx'],
    'email': ['user@gmail.com', 'admin@company.org', 'test@yahoo.com',
              'sales@company.org', 'info@outlook.com', 'support@company.org'],
    'status': ['Completed', 'Pending', 'Failed', 'Completed', 'In Progress', 'Pending']
}

df = pd.DataFrame(data)

# Basic startswith usage
pdf_files = df['filename'].str.startswith('report')
print(df[pdf_files])

Output:

         filename            email    status
0  report_2024.pdf    user@gmail.com  Completed
3  report_2023.pdf  sales@company.org  Completed

Multiple Pattern Matching

Both methods accept tuples of patterns, enabling efficient multi-pattern matching without chaining multiple conditions or using complex regex patterns.

# Match multiple file extensions
csv_or_pdf = df['filename'].str.endswith(('.csv', '.pdf'))
print(df[csv_or_pdf])

# Match multiple email providers
corporate_or_yahoo = df['email'].str.endswith(('company.org', 'yahoo.com'))
print(df[corporate_or_yahoo])

Output:

         filename            email    status
0  report_2024.pdf    user@gmail.com  Completed
1  data_export.csv  admin@company.org   Pending
3  report_2023.pdf  sales@company.org  Completed
4      backup.csv   info@outlook.com  In Progress

         filename            email    status
1  data_export.csv  admin@company.org   Pending
2    image_001.png     test@yahoo.com    Failed
3  report_2023.pdf  sales@company.org  Completed
5   document.docx  support@company.org   Pending

Handling Missing Values

These methods handle NaN values gracefully, returning False for missing entries by default. You can control this behavior using the na parameter.

# Create Series with missing values
files_with_nan = pd.Series(['config.json', np.nan, 'setup.py', None, 'README.md'])

# Default behavior - NaN returns False
print(files_with_nan.str.startswith('config'))

# Explicitly handle NaN as True
print(files_with_nan.str.startswith('config', na=True))

# Use NaN to identify missing values
print(files_with_nan.str.startswith('config', na=np.nan))

Output:

0     True
1    False
2    False
3    False
4    False
dtype: bool

0     True
1     True
2    False
3     True
4    False
dtype: bool

0    True
1     NaN
2   False
3     NaN
4   False
dtype: object

Practical Filtering Scenarios

Combine these methods with boolean indexing for powerful DataFrame filtering operations.

# Complex dataset
logs = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=6, freq='H'),
    'log_level': ['INFO', 'ERROR', 'WARNING', 'INFO', 'ERROR', 'DEBUG'],
    'message': ['System started', 'Connection failed', 'Low memory', 
                'Request processed', 'Timeout error', 'Variable dump'],
    'source': ['app.py', 'database.py', 'memory.py', 'api.py', 'network.py', 'app.py']
})

# Filter error messages
errors = logs[logs['log_level'].str.startswith('ERROR')]
print(errors)

# Filter by source file pattern
app_logs = logs[logs['source'].str.startswith('app')]
print(app_logs)

# Combine multiple conditions
critical = logs[
    (logs['log_level'].str.startswith(('ERROR', 'WARNING'))) &
    (logs['source'].str.endswith('.py'))
]
print(critical)

Case Sensitivity Control

Unlike some string methods, startswith() and endswith() are case-sensitive by default and don’t provide a case parameter. Handle case-insensitive matching through preprocessing.

# Case-sensitive example
products = pd.Series(['iPhone 15', 'ipad Pro', 'IMAC', 'iWatch', 'MacBook'])

# Case-sensitive match (default)
print(products.str.startswith('i'))

# Case-insensitive approach
print(products.str.lower().str.startswith('i'))

# Alternative: using contains with regex
print(products.str.contains('^i', case=False, regex=True))

Output:

0     True
1     True
2    False
3     True
4    False
dtype: bool

0    True
1    True
2    True
3    True
4    False
dtype: bool

0    True
1    True
2    True
3    True
4    False
dtype: bool

Performance Comparison

Vectorized string methods dramatically outperform iterative approaches, especially on large datasets.

import time

# Generate large dataset
large_series = pd.Series(['file_' + str(i) + '.txt' for i in range(100000)])

# Method 1: Vectorized startswith
start = time.time()
result1 = large_series.str.startswith('file_1')
time1 = time.time() - start

# Method 2: Apply with lambda
start = time.time()
result2 = large_series.apply(lambda x: x.startswith('file_1'))
time2 = time.time() - start

# Method 3: List comprehension
start = time.time()
result3 = pd.Series([x.startswith('file_1') for x in large_series])
time3 = time.time() - start

print(f"Vectorized: {time1:.4f}s")
print(f"Apply/Lambda: {time2:.4f}s")
print(f"List Comp: {time3:.4f}s")
print(f"Speedup vs apply: {time2/time1:.1f}x")

Typical output:

Vectorized: 0.0045s
Apply/Lambda: 0.1823s
List Comp: 0.0892s
Speedup vs apply: 40.5x

Data Validation and Cleaning

Use these methods for data validation tasks such as URL scheme validation, file type verification, or prefix-based categorization.

# URL validation dataset
urls = pd.DataFrame({
    'url': ['https://api.example.com/users', 'http://old.site.com',
            'ftp://files.server.com', 'https://secure.bank.com',
            'http://legacy.app.com', 'ws://socket.io']
})

# Identify secure URLs
urls['is_secure'] = urls['url'].str.startswith('https://')

# Categorize by protocol
urls['protocol'] = 'other'
urls.loc[urls['url'].str.startswith('https://'), 'protocol'] = 'HTTPS'
urls.loc[urls['url'].str.startswith('http://'), 'protocol'] = 'HTTP'
urls.loc[urls['url'].str.startswith('ftp://'), 'protocol'] = 'FTP'

print(urls)

Output:

                          url  is_secure protocol
0  https://api.example.com/...      True    HTTPS
1       http://old.site.com      False     HTTP
2    ftp://files.server.com      False      FTP
3   https://secure.bank.com      True    HTTPS
4    http://legacy.app.com      False     HTTP
5           ws://socket.io      False    other

Negation and Complex Conditions

Combine with boolean operators for sophisticated filtering logic.

# Transaction data
transactions = pd.DataFrame({
    'transaction_id': ['TXN_001', 'REF_002', 'TXN_003', 'CHG_004', 'TXN_005'],
    'account': ['ACC_1234', 'ACC_5678', 'SAV_9012', 'ACC_3456', 'SAV_7890'],
    'amount': [100, -50, 200, -25, 150]
})

# Find transactions (not refunds/charges) on checking accounts
valid_txns = transactions[
    transactions['transaction_id'].str.startswith('TXN') &
    transactions['account'].str.startswith('ACC')
]
print(valid_txns)

# Find all non-savings accounts
non_savings = transactions[~transactions['account'].str.startswith('SAV')]
print(non_savings)

Integration with Other String Methods

Chain with other pandas string methods for comprehensive text processing pipelines.

# Email processing
emails = pd.Series(['ADMIN@COMPANY.COM', 'user@gmail.com', 'SALES@COMPANY.COM'])

# Clean and filter corporate emails
corporate = (emails
    .str.lower()
    .str.strip()
    [emails.str.lower().str.endswith('company.com')]
)
print(corporate)

# Extract domain from company emails
company_emails = emails[emails.str.upper().str.endswith('COMPANY.COM')]
domains = company_emails.str.split('@').str[-1]
print(domains)

These methods provide essential building blocks for string-based data filtering and validation in pandas, offering both simplicity and performance for production data pipelines.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.