Pandas - Read Parquet File (read_parquet)
Parquet is a columnar storage format designed for analytical workloads. Unlike row-based formats like CSV, Parquet stores data by column, enabling efficient compression and selective column reading.
Key Insights
- Parquet files offer 10-50x faster read speeds and 75-90% smaller file sizes compared to CSV, making them ideal for production data pipelines and analytics workflows
pd.read_parquet()supports column pruning, row filtering, and partitioned datasets, enabling efficient processing of multi-gigabyte datasets without loading everything into memory- PyArrow and fastparquet engines provide different trade-offs: PyArrow offers better performance and feature completeness, while fastparquet has fewer dependencies
Why Parquet Over CSV
Parquet is a columnar storage format designed for analytical workloads. Unlike row-based formats like CSV, Parquet stores data by column, enabling efficient compression and selective column reading.
import pandas as pd
import numpy as np
# Create sample dataset
df = pd.DataFrame({
'user_id': np.arange(1000000),
'timestamp': pd.date_range('2024-01-01', periods=1000000, freq='1min'),
'value': np.random.randn(1000000),
'category': np.random.choice(['A', 'B', 'C'], 1000000)
})
# Compare file sizes
df.to_csv('data.csv', index=False)
df.to_parquet('data.parquet', index=False)
# CSV: ~45MB, Parquet: ~8MB (82% reduction)
The columnar format also preserves data types, eliminating the need for type inference during reads.
Basic Read Operations
The simplest use case reads an entire Parquet file into a DataFrame:
import pandas as pd
# Basic read
df = pd.read_parquet('data.parquet')
# Specify engine explicitly
df = pd.read_parquet('data.parquet', engine='pyarrow')
df = pd.read_parquet('data.parquet', engine='fastparquet')
Install the required engine if not present:
pip install pyarrow # Recommended
# or
pip install fastparquet
PyArrow is the default engine and handles most production scenarios better than fastparquet.
Column Selection (Column Pruning)
Reading only required columns dramatically reduces memory usage and I/O:
# Read specific columns only
df_subset = pd.read_parquet('data.parquet', columns=['user_id', 'value'])
# Memory usage comparison
df_all = pd.read_parquet('data.parquet')
df_subset = pd.read_parquet('data.parquet', columns=['user_id'])
print(f"Full dataset: {df_all.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"Single column: {df_subset.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# Output: Full dataset: 38.15 MB, Single column: 7.63 MB
This is particularly powerful when working with wide tables containing hundreds of columns.
Row Filtering with PyArrow
PyArrow supports predicate pushdown, filtering data before loading it into memory:
import pandas as pd
# Filter rows during read (PyArrow only)
df_filtered = pd.read_parquet(
'data.parquet',
filters=[('category', '=', 'A')]
)
# Multiple conditions with AND logic
df_filtered = pd.read_parquet(
'data.parquet',
filters=[
('category', 'in', ['A', 'B']),
('value', '>', 0)
]
)
# OR logic requires multiple filter groups
df_filtered = pd.read_parquet(
'data.parquet',
filters=[
[('category', '=', 'A')],
[('category', '=', 'B')]
]
)
Supported operators: =, !=, <, <=, >, >=, in, not in.
Working with Partitioned Datasets
Partitioned Parquet datasets split data across multiple files based on column values:
import pandas as pd
# Create partitioned dataset
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=1000, freq='1H'),
'region': np.random.choice(['US', 'EU', 'ASIA'], 1000),
'sales': np.random.randint(100, 1000, 1000)
})
# Write partitioned by region and year
df['year'] = df['date'].dt.year
df.to_parquet('sales_data/', partition_cols=['region', 'year'], index=False)
# Directory structure:
# sales_data/
# region=US/year=2024/part-0.parquet
# region=EU/year=2024/part-0.parquet
# region=ASIA/year=2024/part-0.parquet
Reading partitioned data:
# Read entire partitioned dataset
df = pd.read_parquet('sales_data/')
# Filter by partition columns (very efficient)
df_us = pd.read_parquet('sales_data/', filters=[('region', '=', 'US')])
# Combine partition and column filtering
df_subset = pd.read_parquet(
'sales_data/',
filters=[('region', '=', 'US'), ('year', '=', 2024)],
columns=['date', 'sales']
)
Partition filtering skips entire files, making it extremely fast for large datasets.
Handling Large Files with Chunking
For datasets that don’t fit in memory, use PyArrow’s dataset API:
import pyarrow.parquet as pq
import pandas as pd
# Read in batches
parquet_file = pq.ParquetFile('large_data.parquet')
for batch in parquet_file.iter_batches(batch_size=100000):
df_chunk = batch.to_pandas()
# Process chunk
process_data(df_chunk)
# Or use dataset API for more control
import pyarrow.dataset as ds
dataset = ds.dataset('sales_data/', format='parquet')
for batch in dataset.to_batches(batch_size=50000, columns=['sales', 'region']):
df_chunk = batch.to_pandas()
# Process chunk
Reading from Cloud Storage
Both engines support reading directly from S3, GCS, and Azure:
# S3 with PyArrow
df = pd.read_parquet('s3://bucket-name/data.parquet')
# Configure filesystem options
df = pd.read_parquet(
's3://bucket-name/data.parquet',
storage_options={
'key': 'AWS_ACCESS_KEY',
'secret': 'AWS_SECRET_KEY',
'region_name': 'us-east-1'
}
)
# Google Cloud Storage
df = pd.read_parquet('gs://bucket-name/data.parquet')
# Azure Blob Storage
df = pd.read_parquet('abfs://container/data.parquet')
For better performance with cloud storage, install fsspec:
pip install fsspec s3fs gcsfs adlfs
Metadata and Schema Inspection
Inspect Parquet files without reading data:
import pyarrow.parquet as pq
# Read metadata
parquet_file = pq.ParquetFile('data.parquet')
# Schema information
print(parquet_file.schema)
print(f"Number of rows: {parquet_file.metadata.num_rows}")
print(f"Number of columns: {parquet_file.metadata.num_columns}")
# Column statistics (min, max, null count)
for i in range(parquet_file.metadata.num_row_groups):
row_group = parquet_file.metadata.row_group(i)
for j in range(row_group.num_columns):
col = row_group.column(j)
print(f"Column {col.path_in_schema}: {col.statistics}")
Performance Optimization
# Use PyArrow for best performance
df = pd.read_parquet('data.parquet', engine='pyarrow')
# Enable multi-threading (PyArrow)
df = pd.read_parquet('data.parquet', use_threads=True)
# Read with compression awareness
df = pd.read_parquet('data.parquet') # Auto-detects compression
# For repeated reads, use PyArrow Table
import pyarrow.parquet as pq
table = pq.read_table('data.parquet', columns=['col1', 'col2'])
df = table.to_pandas() # Convert only when needed
Common Pitfalls
# Avoid: Reading entire file then filtering
df = pd.read_parquet('data.parquet')
df_filtered = df[df['category'] == 'A'] # Inefficient
# Better: Filter during read
df_filtered = pd.read_parquet('data.parquet', filters=[('category', '=', 'A')])
# Avoid: Reading all columns unnecessarily
df = pd.read_parquet('data.parquet')
result = df['user_id'].unique() # Read everything
# Better: Select columns upfront
df = pd.read_parquet('data.parquet', columns=['user_id'])
result = df['user_id'].unique()
Parquet’s efficiency comes from leveraging its columnar format. Always specify columns and filters when possible to minimize I/O and memory usage. For production pipelines processing terabytes of data, these optimizations translate to significant cost savings and faster execution times.