Pandas - Read Parquet File (read_parquet)

Parquet is a columnar storage format designed for analytical workloads. Unlike row-based formats like CSV, Parquet stores data by column, enabling efficient compression and selective column reading.

Key Insights

  • Parquet files offer 10-50x faster read speeds and 75-90% smaller file sizes compared to CSV, making them ideal for production data pipelines and analytics workflows
  • pd.read_parquet() supports column pruning, row filtering, and partitioned datasets, enabling efficient processing of multi-gigabyte datasets without loading everything into memory
  • PyArrow and fastparquet engines provide different trade-offs: PyArrow offers better performance and feature completeness, while fastparquet has fewer dependencies

Why Parquet Over CSV

Parquet is a columnar storage format designed for analytical workloads. Unlike row-based formats like CSV, Parquet stores data by column, enabling efficient compression and selective column reading.

import pandas as pd
import numpy as np

# Create sample dataset
df = pd.DataFrame({
    'user_id': np.arange(1000000),
    'timestamp': pd.date_range('2024-01-01', periods=1000000, freq='1min'),
    'value': np.random.randn(1000000),
    'category': np.random.choice(['A', 'B', 'C'], 1000000)
})

# Compare file sizes
df.to_csv('data.csv', index=False)
df.to_parquet('data.parquet', index=False)

# CSV: ~45MB, Parquet: ~8MB (82% reduction)

The columnar format also preserves data types, eliminating the need for type inference during reads.

Basic Read Operations

The simplest use case reads an entire Parquet file into a DataFrame:

import pandas as pd

# Basic read
df = pd.read_parquet('data.parquet')

# Specify engine explicitly
df = pd.read_parquet('data.parquet', engine='pyarrow')
df = pd.read_parquet('data.parquet', engine='fastparquet')

Install the required engine if not present:

pip install pyarrow  # Recommended
# or
pip install fastparquet

PyArrow is the default engine and handles most production scenarios better than fastparquet.

Column Selection (Column Pruning)

Reading only required columns dramatically reduces memory usage and I/O:

# Read specific columns only
df_subset = pd.read_parquet('data.parquet', columns=['user_id', 'value'])

# Memory usage comparison
df_all = pd.read_parquet('data.parquet')
df_subset = pd.read_parquet('data.parquet', columns=['user_id'])

print(f"Full dataset: {df_all.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"Single column: {df_subset.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# Output: Full dataset: 38.15 MB, Single column: 7.63 MB

This is particularly powerful when working with wide tables containing hundreds of columns.

Row Filtering with PyArrow

PyArrow supports predicate pushdown, filtering data before loading it into memory:

import pandas as pd

# Filter rows during read (PyArrow only)
df_filtered = pd.read_parquet(
    'data.parquet',
    filters=[('category', '=', 'A')]
)

# Multiple conditions with AND logic
df_filtered = pd.read_parquet(
    'data.parquet',
    filters=[
        ('category', 'in', ['A', 'B']),
        ('value', '>', 0)
    ]
)

# OR logic requires multiple filter groups
df_filtered = pd.read_parquet(
    'data.parquet',
    filters=[
        [('category', '=', 'A')],
        [('category', '=', 'B')]
    ]
)

Supported operators: =, !=, <, <=, >, >=, in, not in.

Working with Partitioned Datasets

Partitioned Parquet datasets split data across multiple files based on column values:

import pandas as pd

# Create partitioned dataset
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=1000, freq='1H'),
    'region': np.random.choice(['US', 'EU', 'ASIA'], 1000),
    'sales': np.random.randint(100, 1000, 1000)
})

# Write partitioned by region and year
df['year'] = df['date'].dt.year
df.to_parquet('sales_data/', partition_cols=['region', 'year'], index=False)

# Directory structure:
# sales_data/
#   region=US/year=2024/part-0.parquet
#   region=EU/year=2024/part-0.parquet
#   region=ASIA/year=2024/part-0.parquet

Reading partitioned data:

# Read entire partitioned dataset
df = pd.read_parquet('sales_data/')

# Filter by partition columns (very efficient)
df_us = pd.read_parquet('sales_data/', filters=[('region', '=', 'US')])

# Combine partition and column filtering
df_subset = pd.read_parquet(
    'sales_data/',
    filters=[('region', '=', 'US'), ('year', '=', 2024)],
    columns=['date', 'sales']
)

Partition filtering skips entire files, making it extremely fast for large datasets.

Handling Large Files with Chunking

For datasets that don’t fit in memory, use PyArrow’s dataset API:

import pyarrow.parquet as pq
import pandas as pd

# Read in batches
parquet_file = pq.ParquetFile('large_data.parquet')

for batch in parquet_file.iter_batches(batch_size=100000):
    df_chunk = batch.to_pandas()
    # Process chunk
    process_data(df_chunk)

# Or use dataset API for more control
import pyarrow.dataset as ds

dataset = ds.dataset('sales_data/', format='parquet')

for batch in dataset.to_batches(batch_size=50000, columns=['sales', 'region']):
    df_chunk = batch.to_pandas()
    # Process chunk

Reading from Cloud Storage

Both engines support reading directly from S3, GCS, and Azure:

# S3 with PyArrow
df = pd.read_parquet('s3://bucket-name/data.parquet')

# Configure filesystem options
df = pd.read_parquet(
    's3://bucket-name/data.parquet',
    storage_options={
        'key': 'AWS_ACCESS_KEY',
        'secret': 'AWS_SECRET_KEY',
        'region_name': 'us-east-1'
    }
)

# Google Cloud Storage
df = pd.read_parquet('gs://bucket-name/data.parquet')

# Azure Blob Storage
df = pd.read_parquet('abfs://container/data.parquet')

For better performance with cloud storage, install fsspec:

pip install fsspec s3fs gcsfs adlfs

Metadata and Schema Inspection

Inspect Parquet files without reading data:

import pyarrow.parquet as pq

# Read metadata
parquet_file = pq.ParquetFile('data.parquet')

# Schema information
print(parquet_file.schema)
print(f"Number of rows: {parquet_file.metadata.num_rows}")
print(f"Number of columns: {parquet_file.metadata.num_columns}")

# Column statistics (min, max, null count)
for i in range(parquet_file.metadata.num_row_groups):
    row_group = parquet_file.metadata.row_group(i)
    for j in range(row_group.num_columns):
        col = row_group.column(j)
        print(f"Column {col.path_in_schema}: {col.statistics}")

Performance Optimization

# Use PyArrow for best performance
df = pd.read_parquet('data.parquet', engine='pyarrow')

# Enable multi-threading (PyArrow)
df = pd.read_parquet('data.parquet', use_threads=True)

# Read with compression awareness
df = pd.read_parquet('data.parquet')  # Auto-detects compression

# For repeated reads, use PyArrow Table
import pyarrow.parquet as pq
table = pq.read_table('data.parquet', columns=['col1', 'col2'])
df = table.to_pandas()  # Convert only when needed

Common Pitfalls

# Avoid: Reading entire file then filtering
df = pd.read_parquet('data.parquet')
df_filtered = df[df['category'] == 'A']  # Inefficient

# Better: Filter during read
df_filtered = pd.read_parquet('data.parquet', filters=[('category', '=', 'A')])

# Avoid: Reading all columns unnecessarily
df = pd.read_parquet('data.parquet')
result = df['user_id'].unique()  # Read everything

# Better: Select columns upfront
df = pd.read_parquet('data.parquet', columns=['user_id'])
result = df['user_id'].unique()

Parquet’s efficiency comes from leveraging its columnar format. Always specify columns and filters when possible to minimize I/O and memory usage. For production pipelines processing terabytes of data, these optimizations translate to significant cost savings and faster execution times.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.