Pandas - Read SQL Query into DataFrame (read_sql)

The `read_sql()` function executes SQL queries and returns results as a pandas DataFrame. It accepts both raw SQL strings and SQLAlchemy selectable objects, working with any database supported by...

Key Insights

  • pd.read_sql() provides a unified interface for executing SQL queries and loading results directly into DataFrames, supporting both SQLAlchemy engines and legacy database connections
  • Query parameterization through the params argument prevents SQL injection attacks while enabling dynamic query construction with proper type handling
  • Performance optimization requires understanding chunking with chunksize, index management with index_col, and data type specification through dtype to handle large result sets efficiently

Basic Query Execution

The read_sql() function executes SQL queries and returns results as a pandas DataFrame. It accepts both raw SQL strings and SQLAlchemy selectable objects, working with any database supported by SQLAlchemy or Python’s DB-API.

import pandas as pd
from sqlalchemy import create_engine

# Create database connection
engine = create_engine('postgresql://user:password@localhost:5432/mydb')

# Execute simple SELECT query
df = pd.read_sql('SELECT * FROM customers', engine)
print(df.head())

For SQLite databases, you can use the built-in sqlite3 module without SQLAlchemy:

import sqlite3
import pandas as pd

# Connect to SQLite database
conn = sqlite3.connect('sales.db')

# Read data into DataFrame
df = pd.read_sql('SELECT product_id, SUM(quantity) as total FROM orders GROUP BY product_id', conn)
conn.close()

Parameterized Queries

Never concatenate user input directly into SQL strings. Use parameterized queries to prevent SQL injection and handle type conversions properly:

# WRONG - vulnerable to SQL injection
user_id = "105 OR 1=1"
df = pd.read_sql(f"SELECT * FROM users WHERE id = {user_id}", engine)

# CORRECT - using parameterized query
user_id = 105
query = "SELECT * FROM users WHERE id = %(user_id)s"
df = pd.read_sql(query, engine, params={'user_id': user_id})

For multiple parameters and complex filtering:

query = """
    SELECT order_id, customer_name, total_amount, order_date
    FROM orders
    WHERE order_date BETWEEN %(start_date)s AND %(end_date)s
    AND total_amount > %(min_amount)s
    ORDER BY order_date DESC
"""

params = {
    'start_date': '2024-01-01',
    'end_date': '2024-12-31',
    'min_amount': 1000
}

df = pd.read_sql(query, engine, params=params)

Parameter syntax varies by database. PostgreSQL and MySQL use %(name)s, while SQLite uses ? or :name:

# SQLite named parameters
query = "SELECT * FROM products WHERE category = :cat AND price > :min_price"
df = pd.read_sql(query, sqlite_conn, params={'cat': 'Electronics', 'min_price': 500})

# SQLite positional parameters
query = "SELECT * FROM products WHERE category = ? AND price > ?"
df = pd.read_sql(query, sqlite_conn, params=['Electronics', 500])

Index Column Configuration

Specify which column(s) should become the DataFrame index using index_col:

# Single column as index
df = pd.read_sql(
    'SELECT customer_id, name, email FROM customers',
    engine,
    index_col='customer_id'
)
print(df.index.name)  # 'customer_id'

# Multiple columns as MultiIndex
df = pd.read_sql(
    'SELECT year, quarter, region, revenue FROM sales',
    engine,
    index_col=['year', 'quarter']
)
print(type(df.index))  # pandas.core.indexes.multi.MultiIndex

Without index_col, pandas creates a default integer RangeIndex:

df = pd.read_sql('SELECT name, age FROM users', engine)
print(df.index)  # RangeIndex(start=0, stop=N, step=1)

Data Type Control

Explicitly specify column data types using the dtype argument to override pandas’ type inference:

query = "SELECT user_id, status_code, created_at, metadata FROM events"

dtype_spec = {
    'user_id': 'int32',  # Save memory vs default int64
    'status_code': 'category',  # Categorical for repeated values
    'metadata': 'string'  # Use pandas string dtype
}

df = pd.read_sql(query, engine, dtype=dtype_spec)
print(df.dtypes)

This is particularly useful for:

  • Reducing memory usage with smaller integer types
  • Treating numeric codes as categorical data
  • Ensuring consistent string handling with pandas string dtype
  • Preventing unwanted float conversion for integer columns with NULLs
# Without dtype specification, integer columns with NULLs become float64
df1 = pd.read_sql('SELECT id, nullable_count FROM stats', engine)
print(df1['nullable_count'].dtype)  # float64

# Force nullable integer type
df2 = pd.read_sql('SELECT id, nullable_count FROM stats', engine, dtype={'nullable_count': 'Int64'})
print(df2['nullable_count'].dtype)  # Int64 (nullable integer)

Chunked Reading for Large Results

When dealing with millions of rows, load data in chunks to avoid memory exhaustion:

query = "SELECT * FROM large_transactions_table"

# Process data in chunks of 50,000 rows
chunk_iterator = pd.read_sql(query, engine, chunksize=50000)

# Aggregate across chunks
total_amount = 0
for chunk in chunk_iterator:
    total_amount += chunk['amount'].sum()
    
print(f"Total: {total_amount}")

Chunked reading returns an iterator, not a DataFrame. Process each chunk individually:

# Filter and concatenate selected chunks
filtered_chunks = []
chunk_iterator = pd.read_sql(
    "SELECT timestamp, sensor_id, temperature FROM readings",
    engine,
    chunksize=100000
)

for chunk in chunk_iterator:
    # Filter high temperature readings
    high_temp = chunk[chunk['temperature'] > 100]
    if not high_temp.empty:
        filtered_chunks.append(high_temp)

# Combine all filtered chunks
result = pd.concat(filtered_chunks, ignore_index=True)

Parse Dates Automatically

Use parse_dates to automatically convert string columns to datetime objects:

query = "SELECT order_id, order_date, ship_date, delivery_date FROM orders"

df = pd.read_sql(
    query,
    engine,
    parse_dates=['order_date', 'ship_date', 'delivery_date']
)

print(df['order_date'].dtype)  # datetime64[ns]

# Calculate delivery time
df['delivery_time'] = df['delivery_date'] - df['order_date']

For custom datetime formats or column combinations:

# Combine date and time columns
df = pd.read_sql(
    "SELECT event_date, event_time FROM logs",
    engine,
    parse_dates={'event_timestamp': ['event_date', 'event_time']}
)

SQLAlchemy ORM Integration

Use SQLAlchemy’s ORM and query builder for type-safe, composable queries:

from sqlalchemy import create_engine, MetaData, Table, select

engine = create_engine('postgresql://user:password@localhost/db')
metadata = MetaData()

# Reflect existing table
orders = Table('orders', metadata, autoload_with=engine)

# Build query using SQLAlchemy
query = select(orders.c.customer_id, orders.c.total).where(orders.c.status == 'completed')

# Execute and load into DataFrame
df = pd.read_sql(query, engine)

This approach provides:

  • Database-agnostic query syntax
  • Compile-time query validation
  • Better refactoring support in IDEs
  • Easier query composition and reuse
# Complex query with joins
from sqlalchemy import join

customers = Table('customers', metadata, autoload_with=engine)
orders = Table('orders', metadata, autoload_with=engine)

j = join(customers, orders, customers.c.id == orders.c.customer_id)
query = select(customers.c.name, orders.c.total).select_from(j)

df = pd.read_sql(query, engine)

Connection Management

Always close database connections explicitly or use context managers:

# Manual connection management
conn = sqlite3.connect('database.db')
try:
    df = pd.read_sql('SELECT * FROM data', conn)
finally:
    conn.close()

# Context manager (preferred)
with sqlite3.connect('database.db') as conn:
    df = pd.read_sql('SELECT * FROM data', conn)
# Connection automatically closed

For SQLAlchemy engines, connections are pooled and managed automatically, but you can still use explicit connection handling:

from sqlalchemy import create_engine

engine = create_engine('postgresql://user:pass@localhost/db')

with engine.connect() as conn:
    df1 = pd.read_sql('SELECT * FROM table1', conn)
    df2 = pd.read_sql('SELECT * FROM table2', conn)
# Connection returned to pool

Performance Considerations

Optimize query performance by selecting only required columns and filtering at the database level:

# SLOW - transfers all data then filters in pandas
df = pd.read_sql('SELECT * FROM huge_table', engine)
df_filtered = df[df['status'] == 'active']

# FAST - filters in database
df = pd.read_sql("SELECT id, name, email FROM huge_table WHERE status = 'active'", engine)

Create database indexes for frequently filtered columns and use EXPLAIN to analyze query performance before loading into pandas. For very large datasets, consider using dask or vaex instead of pandas for out-of-core computation.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.