Pandas: Reading and Writing Files

Key Insights

Pandas supports dozens of file formats out of the box, but choosing the right one for your use case can dramatically impact both performance and memory usage—Parquet files can be 10x smaller and 5x faster to read than equivalent CSVs.
The read_csv() function has over 50 parameters; mastering just a handful (dtype, usecols, chunksize, na_values) will solve 90% of your real-world data loading problems.
Always specify data types explicitly when reading files—letting Pandas infer types wastes memory and can introduce subtle bugs when integer columns contain missing values.

Introduction to Pandas I/O

Every data project starts and ends with file operations. You pull data from CSVs, databases, or APIs, transform it, then export results for downstream consumers. Pandas makes this deceptively simple—pd.read_csv() just works. But “just works” isn’t good enough when you’re processing gigabytes of data or building production pipelines.

Pandas supports an impressive array of file formats: CSV, Excel, JSON, Parquet, Feather, HDF5, SQL databases, and more. Each has trade-offs in terms of human readability, compression, schema preservation, and read/write speed. Understanding these trade-offs lets you make informed decisions rather than defaulting to CSV for everything.

This article covers the file formats you’ll encounter most frequently, with practical code examples and opinionated recommendations for when to use each.

Reading CSV Files

CSV remains the lingua franca of data exchange. It’s human-readable, universally supported, and simple. It’s also inefficient, schema-less, and prone to encoding issues. Despite its flaws, you’ll work with CSVs constantly.

The read_csv() function handles most scenarios with sensible defaults:

import pandas as pd

# Basic read - Pandas infers everything
df = pd.read_csv('sales_data.csv')

# Production-grade read - explicit control
df = pd.read_csv(
    'sales_data.csv',
    sep=',',                          # Delimiter (use '\t' for TSV)
    header=0,                         # Row number for column names
    usecols=['date', 'product', 'revenue'],  # Only load needed columns
    dtype={
        'product': 'category',        # Categorical for low-cardinality strings
        'revenue': 'float32'          # Smaller float type saves memory
    },
    na_values=['', 'NULL', 'N/A'],    # Treat these as missing
    parse_dates=['date']              # Parse date column
)

The usecols parameter is criminally underused. If your CSV has 50 columns and you need 5, loading all 50 wastes memory and slows everything down. Always specify what you need.

For large files that don’t fit in memory, use chunked reading:

# Process a 10GB file in manageable chunks
chunks = []
for chunk in pd.read_csv('massive_file.csv', chunksize=100_000):
    # Filter or aggregate each chunk
    filtered = chunk[chunk['status'] == 'active']
    chunks.append(filtered)

df = pd.concat(chunks, ignore_index=True)

This pattern processes data incrementally without loading everything at once. You can also aggregate on-the-fly to keep memory usage constant regardless of file size.

Writing CSV Files

Exporting DataFrames to CSV is straightforward, but a few options matter:

# Basic export - includes index by default (usually unwanted)
df.to_csv('output.csv')

# Clean export - no index, specific columns
df.to_csv(
    'output.csv',
    index=False,                      # Don't write row numbers
    columns=['date', 'product', 'revenue'],  # Column order
    encoding='utf-8'                  # Explicit encoding
)

# Compressed output - smaller files, slower writes
df.to_csv('output.csv.gz', compression='gzip', index=False)
df.to_csv('output.csv.bz2', compression='bz2', index=False)

The index=False parameter is almost always what you want. The default behavior of writing the DataFrame index as the first column creates headaches for downstream consumers who then have an unnamed column to deal with.

Compression is worth considering for archival or transfer. Gzip typically achieves 5-10x compression on text data with minimal CPU overhead.

Working with Excel Files

Excel files are unavoidable in business contexts. Pandas handles them well, though you’ll need additional libraries installed (openpyxl for .xlsx, xlrd for legacy .xls).

# Read first sheet (default behavior)
df = pd.read_excel('report.xlsx')

# Read specific sheet by name or index
df = pd.read_excel('report.xlsx', sheet_name='Q4 Summary')
df = pd.read_excel('report.xlsx', sheet_name=2)  # Third sheet (0-indexed)

# Read all sheets into a dictionary
all_sheets = pd.read_excel('report.xlsx', sheet_name=None)
for sheet_name, sheet_df in all_sheets.items():
    print(f"{sheet_name}: {len(sheet_df)} rows")

Writing multiple DataFrames to different sheets requires ExcelWriter:

# Write multiple DataFrames to one Excel file
with pd.ExcelWriter('quarterly_report.xlsx', engine='openpyxl') as writer:
    sales_df.to_excel(writer, sheet_name='Sales', index=False)
    inventory_df.to_excel(writer, sheet_name='Inventory', index=False)
    summary_df.to_excel(writer, sheet_name='Summary', index=False)

Excel files preserve some type information and support multiple sheets, but they’re slow to read/write and have row limits (1,048,576 rows). Use them for human consumption, not data pipelines.

JSON File Operations

JSON is essential for web APIs and configuration data. Pandas supports various JSON structures through the orient parameter:

# Read standard JSON array of objects
df = pd.read_json('data.json')

# Read JSON with specific orientation
df = pd.read_json('data.json', orient='records')  # [{col: val}, ...]
df = pd.read_json('data.json', orient='columns')  # {col: {idx: val}}
df = pd.read_json('data.json', orient='index')    # {idx: {col: val}}

# Read JSON lines format (one JSON object per line)
df = pd.read_json('data.jsonl', lines=True)

For nested JSON from APIs, you’ll often need json_normalize:

import json

# Nested JSON structure
with open('api_response.json') as f:
    data = json.load(f)

# Flatten nested structure
df = pd.json_normalize(
    data['results'],
    record_path=['items'],           # Path to nested records
    meta=['order_id', 'customer'],   # Include parent fields
    sep='_'                          # Separator for nested column names
)

Writing JSON offers the same orientation options:

# Export as array of records (most common for APIs)
df.to_json('output.json', orient='records', indent=2)

# Export as JSON lines (better for streaming/large files)
df.to_json('output.jsonl', orient='records', lines=True)

JSON lines format (lines=True) is superior for large datasets—it’s streamable, appendable, and doesn’t require loading the entire file to parse.

Other File Formats

Parquet

Parquet is a columnar format designed for analytics. It’s compressed, preserves types, and reads dramatically faster than CSV for large datasets:

# Write to Parquet (requires pyarrow or fastparquet)
df.to_parquet('data.parquet', engine='pyarrow', compression='snappy')

# Read from Parquet
df = pd.read_parquet('data.parquet')

# Read only specific columns (columnar format advantage)
df = pd.read_parquet('data.parquet', columns=['date', 'revenue'])

Parquet should be your default for any data pipeline. It’s 5-10x smaller than CSV and reads 3-5x faster. The only downside is it’s not human-readable.

SQL Databases

Pandas integrates directly with SQL databases via SQLAlchemy:

from sqlalchemy import create_engine

# Create database connection
engine = create_engine('postgresql://user:pass@localhost/mydb')

# Read from SQL query
df = pd.read_sql('SELECT * FROM sales WHERE year = 2024', engine)

# Read entire table
df = pd.read_sql_table('customers', engine)

# Write DataFrame to table
df.to_sql('sales_backup', engine, if_exists='replace', index=False)

Pickle

Pickle serializes Python objects directly. It’s fast and preserves everything, but it’s Python-specific and has security concerns with untrusted data:

# Save DataFrame state exactly
df.to_pickle('checkpoint.pkl')

# Restore later
df = pd.read_pickle('checkpoint.pkl')

Use pickle for checkpointing during development, not for data exchange.

Best Practices and Performance Tips

Always specify dtypes. Pandas’ type inference is convenient but wasteful. A column of integers 0-100 doesn’t need int64 (8 bytes per value) when int8 (1 byte) suffices:

# Memory comparison
import numpy as np

# Create test data
n_rows = 1_000_000
data = {
    'id': np.random.randint(0, 1000000, n_rows),
    'category': np.random.choice(['A', 'B', 'C', 'D'], n_rows),
    'value': np.random.random(n_rows)
}

# Default types - wasteful
df_default = pd.DataFrame(data)
print(f"Default: {df_default.memory_usage(deep=True).sum() / 1e6:.1f} MB")

# Optimized types - efficient
df_optimized = pd.DataFrame(data).astype({
    'id': 'int32',
    'category': 'category',
    'value': 'float32'
})
print(f"Optimized: {df_optimized.memory_usage(deep=True).sum() / 1e6:.1f} MB")

Choose formats based on use case:

Human consumption → CSV or Excel
Data pipelines → Parquet
Web APIs → JSON
Python-only checkpoints → Pickle
Database integration → SQL

Use chunked reading for memory constraints. If a file is too large, process it in pieces rather than upgrading your hardware.

The right file format and loading strategy can make the difference between a script that crashes and one that processes data efficiently. Invest time in understanding these options—it pays dividends on every project.