Pandas - str.replace() with Examples

The `str.replace()` method operates on Pandas Series containing string data. By default, it treats the search pattern as a regular expression, replacing all occurrences within each string.

Key Insights

  • str.replace() uses regex by default but can switch to literal string matching with regex=False, making it versatile for both pattern-based and exact replacements
  • Method chaining with str.replace() enables complex multi-step text transformations while maintaining code readability and performance
  • Case-sensitive matching is default behavior; combine with regex flags or str.lower() for case-insensitive operations when needed

Basic String Replacement

The str.replace() method operates on Pandas Series containing string data. By default, it treats the search pattern as a regular expression, replacing all occurrences within each string.

import pandas as pd

df = pd.DataFrame({
    'text': ['Hello World', 'Hello Python', 'Hello Pandas', 'Goodbye World']
})

# Replace 'Hello' with 'Hi'
df['modified'] = df['text'].str.replace('Hello', 'Hi')
print(df)

Output:

            text       modified
0   Hello World      Hi World
1  Hello Python     Hi Python
2  Hello Pandas     Hi Pandas
3  Goodbye World  Goodbye World

The method returns a new Series with replacements applied. Original data remains unchanged unless you reassign to the same column.

Literal vs Regex Matching

Understanding when str.replace() uses regex versus literal matching prevents unexpected behavior. The regex parameter controls this.

df = pd.DataFrame({
    'code': ['user.name', 'user.email', 'admin.role', 'user.id']
})

# Regex mode (default) - dot matches any character
df['regex_replace'] = df['code'].str.replace('user.', 'member.')

# Literal mode - dot is treated as literal period
df['literal_replace'] = df['code'].str.replace('user.', 'member.', regex=False)

print(df)

Output:

         code  regex_replace literal_replace
0   user.name    member.name     member.name
1  user.email   member.email    member.email
2  admin.role   admin.role      admin.role
3     user.id     member.id      member.id

In this example, both produce identical results because the strings match the pattern. However, with special regex characters like ., *, +, ?, [], the difference becomes critical.

df = pd.DataFrame({
    'price': ['$100', '$200.50', '$1,500', '$75.99']
})

# Attempting to remove dollar signs - regex interprets $ as end-of-line
df['wrong'] = df['price'].str.replace('$', '')

# Correct approach - escape the special character or use literal mode
df['escaped'] = df['price'].str.replace('\$', '', regex=True)
df['literal'] = df['price'].str.replace('$', '', regex=False)

print(df[['price', 'escaped', 'literal']])

Pattern-Based Replacements with Regex

Regex patterns enable sophisticated text transformations based on character classes, quantifiers, and groups.

df = pd.DataFrame({
    'phone': ['123-456-7890', '987.654.3210', '555 123 4567', '(800)555-1234']
})

# Standardize phone numbers to XXX-XXX-XXXX format
# Remove all non-digits first, then format
df['cleaned'] = df['phone'].str.replace(r'\D', '', regex=True)
df['formatted'] = df['cleaned'].str.replace(r'(\d{3})(\d{3})(\d{4})', r'\1-\2-\3', regex=True)

print(df[['phone', 'formatted']])

Output:

           phone    formatted
0  123-456-7890  123-456-7890
1  987.654.3210  987-654-3210
2  555 123 4567  555-123-4567
3  (800)555-1234  800-555-1234

Capture groups () in the pattern correspond to backreferences \1, \2, \3 in the replacement string, allowing you to rearrange matched components.

Case-Insensitive Replacement

Case-insensitive matching requires regex flags. Pass flags using the flags parameter with values from Python’s re module.

import re

df = pd.DataFrame({
    'text': ['Python is great', 'PYTHON rocks', 'I love python', 'PyThOn forever']
})

# Case-insensitive replacement
df['normalized'] = df['text'].str.replace('python', 'Programming', flags=re.IGNORECASE, regex=True)

print(df)

Output:

               text            normalized
0  Python is great  Programming is great
1     PYTHON rocks     Programming rocks
2    I love python    I love Programming
3   PyThOn forever   Programming forever

The re.IGNORECASE flag makes the pattern match regardless of case. The replacement string maintains the case you specify.

Handling Missing Values

Missing values (NaN) in string columns require explicit handling. By default, str.replace() propagates NaN values without errors.

df = pd.DataFrame({
    'status': ['active', None, 'inactive', 'active', pd.NA, 'pending']
})

# NaN values remain NaN after replacement
df['updated'] = df['status'].str.replace('active', 'enabled', regex=False)

print(df)
print(f"\nData types:\n{df.dtypes}")

Output:

     status  updated
0    active  enabled
1      None     None
2  inactive  inactive
3    active  enabled
4      <NA>     <NA>
5   pending  pending

To replace NaN values, use fillna() before or after string operations:

df['filled'] = df['status'].fillna('unknown').str.replace('active', 'enabled', regex=False)

Multiple Replacements with Chaining

Chain multiple str.replace() calls for sequential transformations. Each operation passes its result to the next.

df = pd.DataFrame({
    'raw_text': ['  Hello, World!  ', '  Python  3.9  ', '  Data-Science  ']
})

# Clean and standardize text
df['cleaned'] = (df['raw_text']
                 .str.strip()                           # Remove leading/trailing spaces
                 .str.replace(',', '', regex=False)      # Remove commas
                 .str.replace('!', '', regex=False)      # Remove exclamation marks
                 .str.replace('-', ' ', regex=False)     # Replace hyphens with spaces
                 .str.replace(r'\s+', ' ', regex=True))  # Normalize multiple spaces

print(df)

Output:

          raw_text         cleaned
0   Hello, World!      Hello World
1    Python  3.9        Python 3.9
2   Data-Science    Data Science

Method chaining maintains readability while avoiding intermediate variables. Each step performs a focused transformation.

Replacing with Functions

For complex replacement logic, pass a callable that receives match objects and returns replacement strings.

df = pd.DataFrame({
    'amounts': ['$10.50', '$200.00', '$1,500.75', '$50']
})

# Convert currency strings to numeric values with 10% markup
def add_markup(match):
    value = match.group(0).replace('$', '').replace(',', '')
    new_value = float(value) * 1.10
    return f'${new_value:,.2f}'

df['with_markup'] = df['amounts'].str.replace(r'\$[\d,]+\.?\d*', add_markup, regex=True)

print(df)

Output:

     amounts with_markup
0     $10.50      $11.55
1    $200.00     $220.00
2  $1,500.75   $1,650.82
3        $50      $55.00

The function receives a regex match object, extracts the matched text, performs calculations, and returns the formatted replacement.

Performance Considerations

For large datasets, replacement operations can impact performance. Consider these optimization strategies:

import numpy as np

# Create large dataset
df = pd.DataFrame({
    'text': ['status_active'] * 500000 + ['status_inactive'] * 500000
})

# Method 1: str.replace (slower for simple replacements)
%timeit df['text'].str.replace('status_', '', regex=False)

# Method 2: Direct vectorized operation when applicable
%timeit df['text'].str[7:]  # Slice off first 7 characters

# Method 3: np.where for conditional replacements (fastest)
%timeit np.where(df['text'] == 'status_active', 'active', 'inactive')

For simple prefix/suffix removal, string slicing outperforms str.replace(). For conditional replacements on exact matches, np.where() or map() with dictionaries provides better performance.

Common Patterns and Use Cases

df = pd.DataFrame({
    'email': ['user@example.com', 'admin@test.org', 'info@company.net'],
    'url': ['https://site.com', 'http://old-site.com', 'https://new.site.com'],
    'code': ['ABC-123-XYZ', 'DEF-456-UVW', 'GHI-789-RST']
})

# Extract domain from email
df['domain'] = df['email'].str.replace(r'^.*@', '', regex=True)

# Standardize URLs to https
df['secure_url'] = df['url'].str.replace(r'^http://', 'https://', regex=True)

# Remove middle section from codes
df['simplified'] = df['code'].str.replace(r'-\d+-', '-', regex=True)

print(df)

Output:

              email                 url          code       domain            secure_url simplified
0  user@example.com     https://site.com   ABC-123-XYZ  example.com     https://site.com    ABC-XYZ
1    admin@test.org  http://old-site.com   DEF-456-UVW     test.org  https://old-site.com    DEF-UVW
2  info@company.net  https://new.site.com   GHI-789-RST  company.net  https://new.site.com    GHI-RST

These patterns demonstrate practical data cleaning and standardization tasks common in data engineering workflows.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.