Pandas - str.get() - Get Character by Position

Key Insights

str.get() retrieves characters at specific positions from strings in pandas Series, returning NaN for out-of-bounds indices instead of raising errors
Unlike Python’s bracket notation, str.get() handles missing values gracefully and provides a default parameter for custom fallback values
The method works with both single characters and multi-character strings, making it essential for parsing fixed-width data and extracting specific positions from structured text

Understanding str.get() Basics

The str.get() method in pandas accesses characters at specified positions within strings stored in a Series. This vectorized operation applies to each string element, extracting the character at the given index position.

import pandas as pd

# Create a Series with various strings
s = pd.Series(['apple', 'banana', 'cherry', 'date'])

# Get the first character (index 0)
print(s.str.get(0))
# 0    a
# 1    b
# 2    c
# 3    d
# dtype: object

# Get the third character (index 2)
print(s.str.get(2))
# 0    p
# 1    n
# 2    e
# 3    t
# dtype: object

Negative indexing works identically to Python strings, allowing you to access characters from the end:

# Get the last character
print(s.str.get(-1))
# 0    e
# 1    a
# 2    y
# 3    e
# dtype: object

# Get the second-to-last character
print(s.str.get(-2))
# 0    l
# 1    n
# 2    r
# 3    t
# dtype: object

Handling Out-of-Bounds Indices

The critical difference between str.get() and standard Python indexing is error handling. When the index exceeds string length, str.get() returns NaN instead of raising an IndexError:

s = pd.Series(['cat', 'elephant', 'ox'])

# Attempt to get the 5th character
print(s.str.get(5))
# 0    NaN
# 1      h
# 2    NaN
# dtype: object

# Compare with bracket notation (would raise error)
# s.str[5]  # This would fail for 'cat' and 'ox'

This behavior makes str.get() particularly useful when working with variable-length strings where you need consistent behavior across all elements:

# Extract 10th character from product codes
product_codes = pd.Series([
    'PROD-12345-ABC',
    'PROD-67-XY',
    'PROD-890-DEFGH'
])

print(product_codes.str.get(10))
# 0      A
# 1    NaN
# 2      E
# dtype: object

Using the Default Parameter

The default parameter specifies a custom value to return when the index is out of bounds, replacing the default NaN:

s = pd.Series(['short', 'medium text', 'x'])

# Use empty string as default
print(s.str.get(10, default=''))
# 0         
# 1      t
# 2         
# dtype: object

# Use a placeholder character
print(s.str.get(5, default='?'))
# 0    ?
# 1    m
# 2    ?
# dtype: object

# Use None explicitly
print(s.str.get(8, default=None))
# 0    None
# 1       x
# 2    None
# dtype: object

This feature is valuable when building strings where missing positions should have specific placeholders:

# Extract specific positions for fixed-width parsing
data = pd.Series(['ABC123', 'XY', 'DEFGHI789'])

# Build a code with defaults
position_0 = data.str.get(0, default='_')
position_3 = data.str.get(3, default='0')
position_6 = data.str.get(6, default='X')

result = position_0 + position_3 + position_6
print(result)
# 0    A1X
# 1    Y0X
# 2    G7X
# dtype: object

Working with Missing Values

str.get() handles NaN values in the original Series gracefully, propagating them to the result:

s = pd.Series(['hello', None, 'world', pd.NA, ''])

print(s.str.get(0))
# 0       h
# 1    None
# 2       w
# 3    <NA>
# 4        
# dtype: object

print(s.str.get(1, default='X'))
# 0       e
# 1    None
# 2       o
# 3    <NA>
# 4       X
# dtype: object

Note that empty strings are valid strings, so str.get() on an empty string returns the default value (or NaN), not the original missing value.

Practical Applications

Parsing Fixed-Width Data

Fixed-width file formats store data at specific character positions. str.get() excels at extracting these fields:

# Parse fixed-width transaction records
transactions = pd.Series([
    'A20231115USD1000050',
    'B20231116EUR0750025',
    'C20231117GBP0500010'
])

df = pd.DataFrame({
    'type': transactions.str.get(0),
    'year': transactions.str[1:5],
    'currency_1st': transactions.str.get(9),
    'currency_2nd': transactions.str.get(10),
    'currency_3rd': transactions.str.get(11),
    'amount_check': transactions.str.get(12)
})

print(df)
#   type  year currency_1st currency_2nd currency_3rd amount_check
# 0    A  2023            U            S            D            1
# 1    B  2023            E            U            R            0
# 2    C  2023            G            B            P            0

Validating String Formats

Check if strings conform to expected patterns by examining specific positions:

# Validate product SKUs (format: L-DDDD-L where L=letter, D=digit)
skus = pd.Series(['A-1234-X', 'B-567-Y', 'C-8901-Z', '12-3456-A'])

def validate_sku(s):
    return pd.DataFrame({
        'sku': s,
        'first_char_alpha': s.str.get(0).str.isalpha(),
        'second_char_dash': s.str.get(1) == '-',
        'seventh_char_dash': s.str.get(6, default='') == '-',
        'last_char_alpha': s.str.get(-1).str.isalpha()
    })

print(validate_sku(skus))
#         sku  first_char_alpha  second_char_dash  seventh_char_dash  last_char_alpha
# 0  A-1234-X              True              True               True             True
# 1   B-567-Y              True              True              False             True
# 2  C-8901-Z              True              True               True             True
# 3  12-3456-A            False              True               True             True

Building Character Masks

Create masks for data cleaning or transformation based on character positions:

# Flag records where specific positions contain certain characters
codes = pd.Series(['X1234', 'Y5678', 'X9012', 'Z3456'])

# Find codes starting with 'X' and having '2' in 4th position
mask = (codes.str.get(0) == 'X') & (codes.str.get(3) == '2')
print(codes[mask])
# 0    X1234
# dtype: object

# Extract codes where 2nd character is a digit > 5
numeric_2nd = pd.to_numeric(codes.str.get(1), errors='coerce')
high_value_codes = codes[numeric_2nd > 5]
print(high_value_codes)
# 2    X9012
# dtype: object

Performance Considerations

str.get() is optimized for vectorized operations but still processes each string individually. For large datasets with repeated position extractions, consider combining operations:

# Less efficient: multiple str.get() calls
s = pd.Series(['data'] * 100000)
char1 = s.str.get(0)
char2 = s.str.get(1)
char3 = s.str.get(2)

# More efficient: single slice operation when possible
chars = s.str[0:3]  # Returns substring, not individual chars

# For truly individual characters, str.get() remains necessary
# But batch the logic when possible
def extract_positions(series, positions):
    return pd.DataFrame({
        f'pos_{i}': series.str.get(i) 
        for i in positions
    })

result = extract_positions(s, [0, 2, 4, 6])

Comparison with Alternatives

While str.get() is purpose-built for single character extraction, understand when alternatives are more appropriate:

s = pd.Series(['example', 'test', 'data'])

# str.get() - single character
print(s.str.get(2))  # Returns: 'a', 's', 't'

# str[] slicing - substrings
print(s.str[2:5])  # Returns: 'amp', 'st', 'ta'

# str.extract() - pattern-based extraction
print(s.str.extract(r'(\w{2})'))  # Returns first 2 chars using regex

# Choose str.get() when you need:
# - Single character at known position
# - Graceful handling of variable-length strings
# - Custom defaults for missing positions

The method integrates seamlessly into pandas method chains, making it a clean solution for character-level string operations in data pipelines.