How to Read Parquet Files in Pandas
Parquet is a columnar storage format that has become the de facto standard for analytical workloads. Unlike row-based formats like CSV where data is stored record by record, Parquet stores data...
Key Insights
- Parquet’s columnar format combined with Pandas’
columnsparameter lets you read only the data you need, reducing memory usage by 80-90% on wide datasets - Choose
pyarrowas your default engine for better performance and broader compatibility; only switch tofastparquetwhen you need specific features like append operations - Partitioned Parquet datasets with predicate pushdown filters can turn multi-minute queries into sub-second operations by skipping irrelevant data files entirely
Introduction to Parquet Format
Parquet is a columnar storage format that has become the de facto standard for analytical workloads. Unlike row-based formats like CSV where data is stored record by record, Parquet stores data column by column. This design choice has profound implications for performance.
When you query specific columns from a CSV file, the reader must scan through every row to extract the values you need. With Parquet, the reader jumps directly to the column blocks it needs and ignores everything else. Add built-in compression that exploits column-level data patterns, and you get files that are both smaller and faster to read.
The format originated from the Hadoop ecosystem but has since spread everywhere—data lakes, analytics pipelines, machine learning workflows, and anywhere large datasets live. If you’re working with data larger than a few hundred megabytes, Parquet should be your default choice over CSV.
Prerequisites and Setup
Pandas doesn’t include Parquet support out of the box. You need a backend engine to handle the actual file parsing. You have two options: pyarrow (backed by Apache Arrow) or fastparquet (a pure Python implementation).
Install the recommended setup:
pip install pandas pyarrow
Or if you prefer fastparquet:
pip install pandas fastparquet
For most users, I recommend pyarrow. It’s faster, more actively maintained, and handles edge cases better. The Arrow project has significant backing from the data engineering community, and its memory format integrates seamlessly with other tools in the ecosystem.
Check your installation:
import pandas as pd
import pyarrow
print(f"Pandas version: {pd.__version__}")
print(f"PyArrow version: {pyarrow.__version__}")
You’ll want Pandas 1.0+ and PyArrow 8.0+ for the best experience. Older versions work but lack performance optimizations and bug fixes that matter for production workloads.
Basic Parquet Reading with pd.read_parquet()
Reading a Parquet file is straightforward:
import pandas as pd
df = pd.read_parquet('sales_data.parquet')
print(df.head())
print(f"Shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1e6:.2f} MB")
The function accepts local file paths, S3 URLs, GCS paths, and any file-like object. For cloud storage, you’ll need additional dependencies:
# Reading from S3 (requires s3fs)
df = pd.read_parquet('s3://my-bucket/data/sales.parquet')
# Reading from GCS (requires gcsfs)
df = pd.read_parquet('gs://my-bucket/data/sales.parquet')
# Reading from Azure (requires adlfs)
df = pd.read_parquet('abfs://container@account.dfs.core.windows.net/sales.parquet')
Install the appropriate filesystem library for your cloud provider:
pip install s3fs # For AWS S3
pip install gcsfs # For Google Cloud Storage
pip install adlfs # For Azure Data Lake Storage
Reading Specific Columns
Wide datasets with dozens or hundreds of columns are common in analytics. Loading all columns when you only need three wastes memory and time. The columns parameter solves this:
# Only load the columns you actually need
df = pd.read_parquet(
'customer_data.parquet',
columns=['customer_id', 'email', 'signup_date']
)
This isn’t just a convenience feature—it’s a fundamental performance optimization. Parquet’s columnar structure means unselected columns are never read from disk. On a 50-column dataset where you need 5 columns, you’ll see roughly 90% reduction in I/O and memory usage.
Compare the difference:
import time
# Load everything
start = time.time()
df_full = pd.read_parquet('wide_dataset.parquet')
full_time = time.time() - start
full_memory = df_full.memory_usage(deep=True).sum() / 1e6
# Load selectively
start = time.time()
df_subset = pd.read_parquet(
'wide_dataset.parquet',
columns=['id', 'timestamp', 'value']
)
subset_time = time.time() - start
subset_memory = df_subset.memory_usage(deep=True).sum() / 1e6
print(f"Full load: {full_time:.2f}s, {full_memory:.2f} MB")
print(f"Selective load: {subset_time:.2f}s, {subset_memory:.2f} MB")
Always specify columns explicitly in production code. It documents your data dependencies and prevents accidental memory bloat when someone adds new columns to the source data.
Working with Partitioned Parquet Datasets
Real-world Parquet data often lives in partitioned directory structures. Partitioning splits data into subdirectories based on column values, typically by date or category:
sales_data/
├── year=2023/
│ ├── month=01/
│ │ └── data.parquet
│ ├── month=02/
│ │ └── data.parquet
│ └── ...
└── year=2024/
├── month=01/
│ └── data.parquet
└── ...
Pandas reads partitioned datasets by pointing to the root directory:
# Read entire partitioned dataset
df = pd.read_parquet('sales_data/')
The partition columns (year, month) are automatically added to the DataFrame. But the real power comes from filtering at read time:
# Only read 2024 data - skips all 2023 files entirely
df = pd.read_parquet(
'sales_data/',
filters=[('year', '=', 2024)]
)
# Multiple filter conditions
df = pd.read_parquet(
'sales_data/',
filters=[
('year', '=', 2024),
('month', 'in', [1, 2, 3])
]
)
This is predicate pushdown in action. The engine evaluates filters against partition values before reading any data. If your dataset has years of historical data but you only need last month, filtering at read time is orders of magnitude faster than loading everything and filtering in Pandas.
Supported filter operators include =, !=, <, <=, >, >=, in, and not in:
# Complex filtering example
df = pd.read_parquet(
'events/',
columns=['event_id', 'user_id', 'event_type', 'timestamp'],
filters=[
('date', '>=', '2024-01-01'),
('date', '<', '2024-02-01'),
('region', 'in', ['US', 'EU'])
]
)
Engine Selection and Performance
Pandas supports two Parquet engines. Here’s when to use each:
# Explicitly select pyarrow (recommended default)
df = pd.read_parquet('data.parquet', engine='pyarrow')
# Use fastparquet when needed
df = pd.read_parquet('data.parquet', engine='fastparquet')
PyArrow is the better choice for most scenarios. It’s faster on large files, handles complex nested schemas, and integrates with the broader Arrow ecosystem. Use it unless you have a specific reason not to.
Fastparquet has niche uses: it supports append operations to existing files, has lower memory overhead for some operations, and occasionally handles legacy Parquet files that PyArrow rejects.
Performance comparison on a 1GB file:
import time
engines = ['pyarrow', 'fastparquet']
for engine in engines:
start = time.time()
df = pd.read_parquet('large_file.parquet', engine=engine)
elapsed = time.time() - start
print(f"{engine}: {elapsed:.2f} seconds")
In my benchmarks, PyArrow consistently reads 20-40% faster than fastparquet on files over 100MB. The gap widens with compression—PyArrow’s optimized decompression routines handle snappy and zstd more efficiently.
For write operations, engine choice matters more. PyArrow produces smaller files with better compression ratios by default.
Common Issues and Troubleshooting
Parquet reading fails in predictable ways. Here’s how to handle the common cases:
Schema mismatches occur when files in a partitioned dataset have different column types:
def read_parquet_safe(path, **kwargs):
"""Read Parquet with schema mismatch handling."""
try:
return pd.read_parquet(path, **kwargs)
except Exception as e:
if "schema" in str(e).lower():
# Try reading with schema inference disabled
import pyarrow.parquet as pq
table = pq.read_table(path, **kwargs)
return table.to_pandas(safe=False)
raise
df = read_parquet_safe('inconsistent_data/')
Missing columns happen when you request columns that don’t exist:
def read_with_optional_columns(path, required_cols, optional_cols):
"""Read Parquet, gracefully handling missing optional columns."""
import pyarrow.parquet as pq
# Get actual schema
schema = pq.read_schema(path)
available = set(schema.names)
# Build column list
columns = list(required_cols)
for col in optional_cols:
if col in available:
columns.append(col)
df = pd.read_parquet(path, columns=columns)
# Add missing optional columns as NaN
for col in optional_cols:
if col not in df.columns:
df[col] = pd.NA
return df
df = read_with_optional_columns(
'data.parquet',
required_cols=['id', 'timestamp'],
optional_cols=['metadata', 'tags']
)
Memory errors on large files require chunked reading:
import pyarrow.parquet as pq
def read_parquet_chunked(path, chunk_size=100_000):
"""Read large Parquet files in chunks to manage memory."""
parquet_file = pq.ParquetFile(path)
for batch in parquet_file.iter_batches(batch_size=chunk_size):
chunk_df = batch.to_pandas()
yield chunk_df
# Process chunks without loading entire file
for chunk in read_parquet_chunked('huge_file.parquet'):
# Process each chunk
process(chunk)
For truly large datasets that don’t fit in memory, consider switching to Dask or Polars, which handle out-of-core processing natively. But for files up to a few gigabytes, these Pandas techniques will serve you well.