Pandas - str.get() - Get Character by Position
The `str.get()` method in pandas accesses characters at specified positions within strings stored in a Series. This vectorized operation applies to each string element, extracting the character at...
Key Insights
str.get()retrieves characters at specific positions from strings in pandas Series, returningNaNfor out-of-bounds indices instead of raising errors- Unlike Python’s bracket notation,
str.get()handles missing values gracefully and provides a default parameter for custom fallback values - The method works with both single characters and multi-character strings, making it essential for parsing fixed-width data and extracting specific positions from structured text
Understanding str.get() Basics
The str.get() method in pandas accesses characters at specified positions within strings stored in a Series. This vectorized operation applies to each string element, extracting the character at the given index position.
import pandas as pd
# Create a Series with various strings
s = pd.Series(['apple', 'banana', 'cherry', 'date'])
# Get the first character (index 0)
print(s.str.get(0))
# 0 a
# 1 b
# 2 c
# 3 d
# dtype: object
# Get the third character (index 2)
print(s.str.get(2))
# 0 p
# 1 n
# 2 e
# 3 t
# dtype: object
Negative indexing works identically to Python strings, allowing you to access characters from the end:
# Get the last character
print(s.str.get(-1))
# 0 e
# 1 a
# 2 y
# 3 e
# dtype: object
# Get the second-to-last character
print(s.str.get(-2))
# 0 l
# 1 n
# 2 r
# 3 t
# dtype: object
Handling Out-of-Bounds Indices
The critical difference between str.get() and standard Python indexing is error handling. When the index exceeds string length, str.get() returns NaN instead of raising an IndexError:
s = pd.Series(['cat', 'elephant', 'ox'])
# Attempt to get the 5th character
print(s.str.get(5))
# 0 NaN
# 1 h
# 2 NaN
# dtype: object
# Compare with bracket notation (would raise error)
# s.str[5] # This would fail for 'cat' and 'ox'
This behavior makes str.get() particularly useful when working with variable-length strings where you need consistent behavior across all elements:
# Extract 10th character from product codes
product_codes = pd.Series([
'PROD-12345-ABC',
'PROD-67-XY',
'PROD-890-DEFGH'
])
print(product_codes.str.get(10))
# 0 A
# 1 NaN
# 2 E
# dtype: object
Using the Default Parameter
The default parameter specifies a custom value to return when the index is out of bounds, replacing the default NaN:
s = pd.Series(['short', 'medium text', 'x'])
# Use empty string as default
print(s.str.get(10, default=''))
# 0
# 1 t
# 2
# dtype: object
# Use a placeholder character
print(s.str.get(5, default='?'))
# 0 ?
# 1 m
# 2 ?
# dtype: object
# Use None explicitly
print(s.str.get(8, default=None))
# 0 None
# 1 x
# 2 None
# dtype: object
This feature is valuable when building strings where missing positions should have specific placeholders:
# Extract specific positions for fixed-width parsing
data = pd.Series(['ABC123', 'XY', 'DEFGHI789'])
# Build a code with defaults
position_0 = data.str.get(0, default='_')
position_3 = data.str.get(3, default='0')
position_6 = data.str.get(6, default='X')
result = position_0 + position_3 + position_6
print(result)
# 0 A1X
# 1 Y0X
# 2 G7X
# dtype: object
Working with Missing Values
str.get() handles NaN values in the original Series gracefully, propagating them to the result:
s = pd.Series(['hello', None, 'world', pd.NA, ''])
print(s.str.get(0))
# 0 h
# 1 None
# 2 w
# 3 <NA>
# 4
# dtype: object
print(s.str.get(1, default='X'))
# 0 e
# 1 None
# 2 o
# 3 <NA>
# 4 X
# dtype: object
Note that empty strings are valid strings, so str.get() on an empty string returns the default value (or NaN), not the original missing value.
Practical Applications
Parsing Fixed-Width Data
Fixed-width file formats store data at specific character positions. str.get() excels at extracting these fields:
# Parse fixed-width transaction records
transactions = pd.Series([
'A20231115USD1000050',
'B20231116EUR0750025',
'C20231117GBP0500010'
])
df = pd.DataFrame({
'type': transactions.str.get(0),
'year': transactions.str[1:5],
'currency_1st': transactions.str.get(9),
'currency_2nd': transactions.str.get(10),
'currency_3rd': transactions.str.get(11),
'amount_check': transactions.str.get(12)
})
print(df)
# type year currency_1st currency_2nd currency_3rd amount_check
# 0 A 2023 U S D 1
# 1 B 2023 E U R 0
# 2 C 2023 G B P 0
Validating String Formats
Check if strings conform to expected patterns by examining specific positions:
# Validate product SKUs (format: L-DDDD-L where L=letter, D=digit)
skus = pd.Series(['A-1234-X', 'B-567-Y', 'C-8901-Z', '12-3456-A'])
def validate_sku(s):
return pd.DataFrame({
'sku': s,
'first_char_alpha': s.str.get(0).str.isalpha(),
'second_char_dash': s.str.get(1) == '-',
'seventh_char_dash': s.str.get(6, default='') == '-',
'last_char_alpha': s.str.get(-1).str.isalpha()
})
print(validate_sku(skus))
# sku first_char_alpha second_char_dash seventh_char_dash last_char_alpha
# 0 A-1234-X True True True True
# 1 B-567-Y True True False True
# 2 C-8901-Z True True True True
# 3 12-3456-A False True True True
Building Character Masks
Create masks for data cleaning or transformation based on character positions:
# Flag records where specific positions contain certain characters
codes = pd.Series(['X1234', 'Y5678', 'X9012', 'Z3456'])
# Find codes starting with 'X' and having '2' in 4th position
mask = (codes.str.get(0) == 'X') & (codes.str.get(3) == '2')
print(codes[mask])
# 0 X1234
# dtype: object
# Extract codes where 2nd character is a digit > 5
numeric_2nd = pd.to_numeric(codes.str.get(1), errors='coerce')
high_value_codes = codes[numeric_2nd > 5]
print(high_value_codes)
# 2 X9012
# dtype: object
Performance Considerations
str.get() is optimized for vectorized operations but still processes each string individually. For large datasets with repeated position extractions, consider combining operations:
# Less efficient: multiple str.get() calls
s = pd.Series(['data'] * 100000)
char1 = s.str.get(0)
char2 = s.str.get(1)
char3 = s.str.get(2)
# More efficient: single slice operation when possible
chars = s.str[0:3] # Returns substring, not individual chars
# For truly individual characters, str.get() remains necessary
# But batch the logic when possible
def extract_positions(series, positions):
return pd.DataFrame({
f'pos_{i}': series.str.get(i)
for i in positions
})
result = extract_positions(s, [0, 2, 4, 6])
Comparison with Alternatives
While str.get() is purpose-built for single character extraction, understand when alternatives are more appropriate:
s = pd.Series(['example', 'test', 'data'])
# str.get() - single character
print(s.str.get(2)) # Returns: 'a', 's', 't'
# str[] slicing - substrings
print(s.str[2:5]) # Returns: 'amp', 'st', 'ta'
# str.extract() - pattern-based extraction
print(s.str.extract(r'(\w{2})')) # Returns first 2 chars using regex
# Choose str.get() when you need:
# - Single character at known position
# - Graceful handling of variable-length strings
# - Custom defaults for missing positions
The method integrates seamlessly into pandas method chains, making it a clean solution for character-level string operations in data pipelines.