Pandas - str.replace() with Examples
The `str.replace()` method operates on Pandas Series containing string data. By default, it treats the search pattern as a regular expression, replacing all occurrences within each string.
Key Insights
str.replace()uses regex by default but can switch to literal string matching withregex=False, making it versatile for both pattern-based and exact replacements- Method chaining with
str.replace()enables complex multi-step text transformations while maintaining code readability and performance - Case-sensitive matching is default behavior; combine with regex flags or
str.lower()for case-insensitive operations when needed
Basic String Replacement
The str.replace() method operates on Pandas Series containing string data. By default, it treats the search pattern as a regular expression, replacing all occurrences within each string.
import pandas as pd
df = pd.DataFrame({
'text': ['Hello World', 'Hello Python', 'Hello Pandas', 'Goodbye World']
})
# Replace 'Hello' with 'Hi'
df['modified'] = df['text'].str.replace('Hello', 'Hi')
print(df)
Output:
text modified
0 Hello World Hi World
1 Hello Python Hi Python
2 Hello Pandas Hi Pandas
3 Goodbye World Goodbye World
The method returns a new Series with replacements applied. Original data remains unchanged unless you reassign to the same column.
Literal vs Regex Matching
Understanding when str.replace() uses regex versus literal matching prevents unexpected behavior. The regex parameter controls this.
df = pd.DataFrame({
'code': ['user.name', 'user.email', 'admin.role', 'user.id']
})
# Regex mode (default) - dot matches any character
df['regex_replace'] = df['code'].str.replace('user.', 'member.')
# Literal mode - dot is treated as literal period
df['literal_replace'] = df['code'].str.replace('user.', 'member.', regex=False)
print(df)
Output:
code regex_replace literal_replace
0 user.name member.name member.name
1 user.email member.email member.email
2 admin.role admin.role admin.role
3 user.id member.id member.id
In this example, both produce identical results because the strings match the pattern. However, with special regex characters like ., *, +, ?, [], the difference becomes critical.
df = pd.DataFrame({
'price': ['$100', '$200.50', '$1,500', '$75.99']
})
# Attempting to remove dollar signs - regex interprets $ as end-of-line
df['wrong'] = df['price'].str.replace('$', '')
# Correct approach - escape the special character or use literal mode
df['escaped'] = df['price'].str.replace('\$', '', regex=True)
df['literal'] = df['price'].str.replace('$', '', regex=False)
print(df[['price', 'escaped', 'literal']])
Pattern-Based Replacements with Regex
Regex patterns enable sophisticated text transformations based on character classes, quantifiers, and groups.
df = pd.DataFrame({
'phone': ['123-456-7890', '987.654.3210', '555 123 4567', '(800)555-1234']
})
# Standardize phone numbers to XXX-XXX-XXXX format
# Remove all non-digits first, then format
df['cleaned'] = df['phone'].str.replace(r'\D', '', regex=True)
df['formatted'] = df['cleaned'].str.replace(r'(\d{3})(\d{3})(\d{4})', r'\1-\2-\3', regex=True)
print(df[['phone', 'formatted']])
Output:
phone formatted
0 123-456-7890 123-456-7890
1 987.654.3210 987-654-3210
2 555 123 4567 555-123-4567
3 (800)555-1234 800-555-1234
Capture groups () in the pattern correspond to backreferences \1, \2, \3 in the replacement string, allowing you to rearrange matched components.
Case-Insensitive Replacement
Case-insensitive matching requires regex flags. Pass flags using the flags parameter with values from Python’s re module.
import re
df = pd.DataFrame({
'text': ['Python is great', 'PYTHON rocks', 'I love python', 'PyThOn forever']
})
# Case-insensitive replacement
df['normalized'] = df['text'].str.replace('python', 'Programming', flags=re.IGNORECASE, regex=True)
print(df)
Output:
text normalized
0 Python is great Programming is great
1 PYTHON rocks Programming rocks
2 I love python I love Programming
3 PyThOn forever Programming forever
The re.IGNORECASE flag makes the pattern match regardless of case. The replacement string maintains the case you specify.
Handling Missing Values
Missing values (NaN) in string columns require explicit handling. By default, str.replace() propagates NaN values without errors.
df = pd.DataFrame({
'status': ['active', None, 'inactive', 'active', pd.NA, 'pending']
})
# NaN values remain NaN after replacement
df['updated'] = df['status'].str.replace('active', 'enabled', regex=False)
print(df)
print(f"\nData types:\n{df.dtypes}")
Output:
status updated
0 active enabled
1 None None
2 inactive inactive
3 active enabled
4 <NA> <NA>
5 pending pending
To replace NaN values, use fillna() before or after string operations:
df['filled'] = df['status'].fillna('unknown').str.replace('active', 'enabled', regex=False)
Multiple Replacements with Chaining
Chain multiple str.replace() calls for sequential transformations. Each operation passes its result to the next.
df = pd.DataFrame({
'raw_text': [' Hello, World! ', ' Python 3.9 ', ' Data-Science ']
})
# Clean and standardize text
df['cleaned'] = (df['raw_text']
.str.strip() # Remove leading/trailing spaces
.str.replace(',', '', regex=False) # Remove commas
.str.replace('!', '', regex=False) # Remove exclamation marks
.str.replace('-', ' ', regex=False) # Replace hyphens with spaces
.str.replace(r'\s+', ' ', regex=True)) # Normalize multiple spaces
print(df)
Output:
raw_text cleaned
0 Hello, World! Hello World
1 Python 3.9 Python 3.9
2 Data-Science Data Science
Method chaining maintains readability while avoiding intermediate variables. Each step performs a focused transformation.
Replacing with Functions
For complex replacement logic, pass a callable that receives match objects and returns replacement strings.
df = pd.DataFrame({
'amounts': ['$10.50', '$200.00', '$1,500.75', '$50']
})
# Convert currency strings to numeric values with 10% markup
def add_markup(match):
value = match.group(0).replace('$', '').replace(',', '')
new_value = float(value) * 1.10
return f'${new_value:,.2f}'
df['with_markup'] = df['amounts'].str.replace(r'\$[\d,]+\.?\d*', add_markup, regex=True)
print(df)
Output:
amounts with_markup
0 $10.50 $11.55
1 $200.00 $220.00
2 $1,500.75 $1,650.82
3 $50 $55.00
The function receives a regex match object, extracts the matched text, performs calculations, and returns the formatted replacement.
Performance Considerations
For large datasets, replacement operations can impact performance. Consider these optimization strategies:
import numpy as np
# Create large dataset
df = pd.DataFrame({
'text': ['status_active'] * 500000 + ['status_inactive'] * 500000
})
# Method 1: str.replace (slower for simple replacements)
%timeit df['text'].str.replace('status_', '', regex=False)
# Method 2: Direct vectorized operation when applicable
%timeit df['text'].str[7:] # Slice off first 7 characters
# Method 3: np.where for conditional replacements (fastest)
%timeit np.where(df['text'] == 'status_active', 'active', 'inactive')
For simple prefix/suffix removal, string slicing outperforms str.replace(). For conditional replacements on exact matches, np.where() or map() with dictionaries provides better performance.
Common Patterns and Use Cases
df = pd.DataFrame({
'email': ['user@example.com', 'admin@test.org', 'info@company.net'],
'url': ['https://site.com', 'http://old-site.com', 'https://new.site.com'],
'code': ['ABC-123-XYZ', 'DEF-456-UVW', 'GHI-789-RST']
})
# Extract domain from email
df['domain'] = df['email'].str.replace(r'^.*@', '', regex=True)
# Standardize URLs to https
df['secure_url'] = df['url'].str.replace(r'^http://', 'https://', regex=True)
# Remove middle section from codes
df['simplified'] = df['code'].str.replace(r'-\d+-', '-', regex=True)
print(df)
Output:
email url code domain secure_url simplified
0 user@example.com https://site.com ABC-123-XYZ example.com https://site.com ABC-XYZ
1 admin@test.org http://old-site.com DEF-456-UVW test.org https://old-site.com DEF-UVW
2 info@company.net https://new.site.com GHI-789-RST company.net https://new.site.com GHI-RST
These patterns demonstrate practical data cleaning and standardization tasks common in data engineering workflows.