Pandas - Read Fixed-Width File (read_fwf)
• `read_fwf()` handles fixed-width format files where columns are defined by character positions rather than delimiters, common in legacy systems and government data
Key Insights
• read_fwf() handles fixed-width format files where columns are defined by character positions rather than delimiters, common in legacy systems and government data
• Column specifications can be defined using tuples of start-end positions, automatic width detection, or column width lists—each method suited for different file structures
• Proper handling of whitespace, data types, and missing values is critical since fixed-width formats often contain padded spaces and implicit null representations
Understanding Fixed-Width File Format
Fixed-width files allocate a specific number of characters for each field, padding shorter values with spaces. Unlike CSV files that use delimiters, each column occupies predetermined character positions. This format remains prevalent in mainframe systems, financial data feeds, and government datasets.
import pandas as pd
from io import StringIO
# Example fixed-width data
data = """
John Doe 35Marketing
Jane Smith 28Engineering
Michael Johnson 42Sales
"""
# Basic read_fwf usage with column specifications
df = pd.read_fwf(StringIO(data),
colspecs=[(0, 10), (10, 20), (20, 22), (22, 33)],
names=['FirstName', 'LastName', 'Age', 'Department'])
print(df)
Output:
FirstName LastName Age Department
0 John Doe 35 Marketing
1 Jane Smith 28 Engineering
2 Michael Johnson 42 Sales
Defining Column Specifications
Using colspecs Parameter
The colspecs parameter accepts a list of tuples defining start and end positions for each column. Positions are zero-indexed and the end position is exclusive.
# Reading employee data with explicit column positions
employee_data = """
EMP001 Alice Johnson 2015-03-15 085000
EMP002 Bob Williams 2018-07-22 072000
EMP003 Catherine Brown 2020-01-10 095000
"""
df = pd.read_fwf(
StringIO(employee_data),
colspecs=[
(0, 6), # Employee ID
(7, 17), # First Name
(18, 28), # Last Name
(29, 39), # Hire Date
(40, 46) # Salary
],
names=['EmployeeID', 'FirstName', 'LastName', 'HireDate', 'Salary'],
dtype={'EmployeeID': str, 'Salary': int}
)
print(df)
Using widths Parameter
For simpler cases where you know column widths, use the widths parameter instead of calculating positions.
# Product inventory with known column widths
inventory = """
PROD001 Laptop 001250 0025
PROD002 Mouse 000035 0150
PROD003 Keyboard 000089 0075
"""
df = pd.read_fwf(
StringIO(inventory),
widths=[7, 14, 6, 4],
names=['ProductID', 'ProductName', 'Price', 'Quantity']
)
print(df)
Automatic Width Detection
When colspecs is not specified, pandas attempts to infer column widths by analyzing whitespace patterns. This works well for well-formatted files but may fail with irregular spacing.
# Well-formatted data for automatic detection
formatted_data = """
Name Age City Salary
John 35 New York 85000
Jane 28 Boston 72000
Michael 42 Chicago 95000
"""
df = pd.read_fwf(StringIO(formatted_data))
print(df)
Handling Data Types and Conversions
Fixed-width files often require explicit type conversions since all data is initially read as strings.
from datetime import datetime
# Financial transaction data
transactions = """
20230115 ACC001 0001250.50 D
20230116 ACC002 0000875.25 C
20230117 ACC001 0000500.00 D
"""
df = pd.read_fwf(
StringIO(transactions),
colspecs=[(0, 8), (9, 15), (16, 26), (27, 28)],
names=['Date', 'Account', 'Amount', 'Type'],
converters={
'Date': lambda x: pd.to_datetime(x, format='%Y%m%d'),
'Amount': lambda x: float(x)
}
)
print(df.dtypes)
print(df)
Managing Missing Values and Padding
Fixed-width formats often represent missing values as spaces or specific placeholder characters.
# Data with missing values (spaces and NA markers)
customer_data = """
CUST001 John Doe john@email.com 555-1234
CUST002 Jane Smith 555-5678
CUST003 Bob Johnson bob@email.com
CUST004 Alice alice@email.com 555-9012
"""
df = pd.read_fwf(
StringIO(customer_data),
colspecs=[(0, 7), (8, 18), (18, 28), (28, 46), (46, 58)],
names=['CustomerID', 'FirstName', 'LastName', 'Email', 'Phone'],
na_values=['', 'NA', 'NULL'],
keep_default_na=True
)
# Strip whitespace from string columns
string_cols = df.select_dtypes(include=['object']).columns
df[string_cols] = df[string_cols].apply(lambda x: x.str.strip())
print(df)
Reading from Files with Headers and Footers
Real-world fixed-width files often contain header and footer rows that need to be skipped.
# File with header and footer information
report_data = """
DAILY SALES REPORT - 2023-01-15
STORE PRODUCT QUANTITY REVENUE
================================================
STORE001 Product A 00125 012500.00
STORE001 Product B 00089 008900.00
STORE002 Product A 00156 015600.00
================================================
TOTAL RECORDS: 3
"""
df = pd.read_fwf(
StringIO(report_data),
colspecs=[(0, 8), (9, 21), (22, 31), (32, 42)],
names=['Store', 'Product', 'Quantity', 'Revenue'],
skiprows=3, # Skip header rows
skipfooter=2, # Skip footer rows
engine='python' # Required for skipfooter
)
print(df)
Processing Large Fixed-Width Files
For large files, use chunking to process data in manageable pieces.
# Processing large file in chunks
def process_large_fwf(filename, chunksize=10000):
chunks = []
for chunk in pd.read_fwf(
filename,
colspecs=[(0, 10), (10, 20), (20, 30)],
names=['ID', 'Name', 'Value'],
chunksize=chunksize
):
# Process each chunk
chunk['Value'] = pd.to_numeric(chunk['Value'], errors='coerce')
chunk = chunk[chunk['Value'] > 0] # Filter
chunks.append(chunk)
return pd.concat(chunks, ignore_index=True)
# Example with StringIO simulating a file
large_data = "\n".join([f"ID{i:06d} Name{i:04d} {i*100:08d}" for i in range(100)])
# Process in chunks of 25 rows
result = process_large_fwf(StringIO(large_data), chunksize=25)
print(f"Processed {len(result)} rows")
Handling Complex Nested Structures
Some fixed-width files contain hierarchical data with different record types identified by specific codes.
# Multi-record type file
hierarchical_data = """
H 20230115 BATCH001
D ITEM001 0125.50 05
D ITEM002 0089.75 10
T 00002 0215.25
H 20230116 BATCH002
D ITEM003 0156.00 03
T 00001 0156.00
"""
# Separate by record type
header_spec = [(0, 1), (2, 10), (11, 19)]
detail_spec = [(0, 1), (2, 9), (10, 17), (18, 20)]
trailer_spec = [(0, 1), (2, 7), (11, 18)]
lines = hierarchical_data.strip().split('\n')
headers, details, trailers = [], [], []
for line in lines:
if line.startswith('H'):
headers.append(line)
elif line.startswith('D'):
details.append(line)
elif line.startswith('T'):
trailers.append(line)
df_details = pd.read_fwf(
StringIO('\n'.join(details)),
colspecs=detail_spec,
names=['RecType', 'ItemID', 'Price', 'Qty']
)
print(df_details)
Performance Optimization Tips
# Optimized reading with specific dtypes and options
df = pd.read_fwf(
'large_file.txt',
colspecs=[(0, 10), (10, 20), (20, 30), (30, 40)],
names=['ID', 'Name', 'Value', 'Date'],
dtype={'ID': 'category', 'Name': str, 'Value': 'float32'},
parse_dates=['Date'],
infer_datetime_format=True,
na_filter=False, # Disable if no missing values
low_memory=False,
engine='c' # Faster C engine (default)
)
The read_fwf() function provides robust handling of fixed-width formats with flexible column specification methods. Understanding the structure of your data and choosing appropriate parameters ensures efficient parsing while maintaining data integrity. Always validate output with sample data before processing production files.