Pandas - Read Fixed-Width File (read_fwf)

• `read_fwf()` handles fixed-width format files where columns are defined by character positions rather than delimiters, common in legacy systems and government data

Key Insights

read_fwf() handles fixed-width format files where columns are defined by character positions rather than delimiters, common in legacy systems and government data • Column specifications can be defined using tuples of start-end positions, automatic width detection, or column width lists—each method suited for different file structures • Proper handling of whitespace, data types, and missing values is critical since fixed-width formats often contain padded spaces and implicit null representations

Understanding Fixed-Width File Format

Fixed-width files allocate a specific number of characters for each field, padding shorter values with spaces. Unlike CSV files that use delimiters, each column occupies predetermined character positions. This format remains prevalent in mainframe systems, financial data feeds, and government datasets.

import pandas as pd
from io import StringIO

# Example fixed-width data
data = """
John      Doe       35Marketing 
Jane      Smith     28Engineering
Michael   Johnson   42Sales      
"""

# Basic read_fwf usage with column specifications
df = pd.read_fwf(StringIO(data), 
                 colspecs=[(0, 10), (10, 20), (20, 22), (22, 33)],
                 names=['FirstName', 'LastName', 'Age', 'Department'])

print(df)

Output:

  FirstName  LastName  Age    Department
0      John       Doe   35     Marketing
1      Jane     Smith   28   Engineering
2   Michael   Johnson   42         Sales

Defining Column Specifications

Using colspecs Parameter

The colspecs parameter accepts a list of tuples defining start and end positions for each column. Positions are zero-indexed and the end position is exclusive.

# Reading employee data with explicit column positions
employee_data = """
EMP001 Alice      Johnson    2015-03-15 085000
EMP002 Bob        Williams   2018-07-22 072000
EMP003 Catherine  Brown      2020-01-10 095000
"""

df = pd.read_fwf(
    StringIO(employee_data),
    colspecs=[
        (0, 6),      # Employee ID
        (7, 17),     # First Name
        (18, 28),    # Last Name
        (29, 39),    # Hire Date
        (40, 46)     # Salary
    ],
    names=['EmployeeID', 'FirstName', 'LastName', 'HireDate', 'Salary'],
    dtype={'EmployeeID': str, 'Salary': int}
)

print(df)

Using widths Parameter

For simpler cases where you know column widths, use the widths parameter instead of calculating positions.

# Product inventory with known column widths
inventory = """
PROD001 Laptop        001250 0025
PROD002 Mouse         000035 0150
PROD003 Keyboard      000089 0075
"""

df = pd.read_fwf(
    StringIO(inventory),
    widths=[7, 14, 6, 4],
    names=['ProductID', 'ProductName', 'Price', 'Quantity']
)

print(df)

Automatic Width Detection

When colspecs is not specified, pandas attempts to infer column widths by analyzing whitespace patterns. This works well for well-formatted files but may fail with irregular spacing.

# Well-formatted data for automatic detection
formatted_data = """
Name        Age  City         Salary
John        35   New York     85000
Jane        28   Boston       72000
Michael     42   Chicago      95000
"""

df = pd.read_fwf(StringIO(formatted_data))
print(df)

Handling Data Types and Conversions

Fixed-width files often require explicit type conversions since all data is initially read as strings.

from datetime import datetime

# Financial transaction data
transactions = """
20230115 ACC001 0001250.50 D
20230116 ACC002 0000875.25 C
20230117 ACC001 0000500.00 D
"""

df = pd.read_fwf(
    StringIO(transactions),
    colspecs=[(0, 8), (9, 15), (16, 26), (27, 28)],
    names=['Date', 'Account', 'Amount', 'Type'],
    converters={
        'Date': lambda x: pd.to_datetime(x, format='%Y%m%d'),
        'Amount': lambda x: float(x)
    }
)

print(df.dtypes)
print(df)

Managing Missing Values and Padding

Fixed-width formats often represent missing values as spaces or specific placeholder characters.

# Data with missing values (spaces and NA markers)
customer_data = """
CUST001 John      Doe       john@email.com    555-1234    
CUST002 Jane      Smith                       555-5678    
CUST003 Bob       Johnson   bob@email.com                 
CUST004 Alice              alice@email.com   555-9012    
"""

df = pd.read_fwf(
    StringIO(customer_data),
    colspecs=[(0, 7), (8, 18), (18, 28), (28, 46), (46, 58)],
    names=['CustomerID', 'FirstName', 'LastName', 'Email', 'Phone'],
    na_values=['', 'NA', 'NULL'],
    keep_default_na=True
)

# Strip whitespace from string columns
string_cols = df.select_dtypes(include=['object']).columns
df[string_cols] = df[string_cols].apply(lambda x: x.str.strip())

print(df)

Reading from Files with Headers and Footers

Real-world fixed-width files often contain header and footer rows that need to be skipped.

# File with header and footer information
report_data = """
DAILY SALES REPORT - 2023-01-15
STORE    PRODUCT      QUANTITY  REVENUE
================================================
STORE001 Product A    00125     012500.00
STORE001 Product B    00089     008900.00
STORE002 Product A    00156     015600.00
================================================
TOTAL RECORDS: 3
"""

df = pd.read_fwf(
    StringIO(report_data),
    colspecs=[(0, 8), (9, 21), (22, 31), (32, 42)],
    names=['Store', 'Product', 'Quantity', 'Revenue'],
    skiprows=3,  # Skip header rows
    skipfooter=2,  # Skip footer rows
    engine='python'  # Required for skipfooter
)

print(df)

Processing Large Fixed-Width Files

For large files, use chunking to process data in manageable pieces.

# Processing large file in chunks
def process_large_fwf(filename, chunksize=10000):
    chunks = []
    
    for chunk in pd.read_fwf(
        filename,
        colspecs=[(0, 10), (10, 20), (20, 30)],
        names=['ID', 'Name', 'Value'],
        chunksize=chunksize
    ):
        # Process each chunk
        chunk['Value'] = pd.to_numeric(chunk['Value'], errors='coerce')
        chunk = chunk[chunk['Value'] > 0]  # Filter
        chunks.append(chunk)
    
    return pd.concat(chunks, ignore_index=True)

# Example with StringIO simulating a file
large_data = "\n".join([f"ID{i:06d}  Name{i:04d}  {i*100:08d}" for i in range(100)])

# Process in chunks of 25 rows
result = process_large_fwf(StringIO(large_data), chunksize=25)
print(f"Processed {len(result)} rows")

Handling Complex Nested Structures

Some fixed-width files contain hierarchical data with different record types identified by specific codes.

# Multi-record type file
hierarchical_data = """
H 20230115 BATCH001
D ITEM001  0125.50 05
D ITEM002  0089.75 10
T 00002    0215.25
H 20230116 BATCH002
D ITEM003  0156.00 03
T 00001    0156.00
"""

# Separate by record type
header_spec = [(0, 1), (2, 10), (11, 19)]
detail_spec = [(0, 1), (2, 9), (10, 17), (18, 20)]
trailer_spec = [(0, 1), (2, 7), (11, 18)]

lines = hierarchical_data.strip().split('\n')
headers, details, trailers = [], [], []

for line in lines:
    if line.startswith('H'):
        headers.append(line)
    elif line.startswith('D'):
        details.append(line)
    elif line.startswith('T'):
        trailers.append(line)

df_details = pd.read_fwf(
    StringIO('\n'.join(details)),
    colspecs=detail_spec,
    names=['RecType', 'ItemID', 'Price', 'Qty']
)

print(df_details)

Performance Optimization Tips

# Optimized reading with specific dtypes and options
df = pd.read_fwf(
    'large_file.txt',
    colspecs=[(0, 10), (10, 20), (20, 30), (30, 40)],
    names=['ID', 'Name', 'Value', 'Date'],
    dtype={'ID': 'category', 'Name': str, 'Value': 'float32'},
    parse_dates=['Date'],
    infer_datetime_format=True,
    na_filter=False,  # Disable if no missing values
    low_memory=False,
    engine='c'  # Faster C engine (default)
)

The read_fwf() function provides robust handling of fixed-width formats with flexible column specification methods. Understanding the structure of your data and choosing appropriate parameters ensures efficient parsing while maintaining data integrity. Always validate output with sample data before processing production files.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.