Pandas - Filter DataFrame by Date Range

Key Insights

• Pandas offers multiple methods to filter DataFrames by date ranges, including boolean indexing, loc[], between(), and query(), each suited for different scenarios and performance requirements. • Converting date columns to proper datetime types using pd.to_datetime() is essential before filtering, as string-based date comparisons can produce incorrect results. • For large datasets, using datetime indexing with loc[] or converting the DatetimeIndex provides significant performance improvements over repeated boolean operations.

Converting Strings to Datetime Objects

Before filtering by date range, ensure your date column is in datetime format. Pandas won’t correctly interpret string dates for range operations.

import pandas as pd
import numpy as np

# Sample DataFrame with string dates
df = pd.DataFrame({
    'date': ['2024-01-15', '2024-02-20', '2024-03-10', '2024-04-05'],
    'sales': [1200, 1500, 1800, 2100],
    'region': ['North', 'South', 'East', 'West']
})

# Convert to datetime
df['date'] = pd.to_datetime(df['date'])

# Verify the data type
print(df.dtypes)
# Output: date datetime64[ns], sales int64, region object

For different date formats, specify the format parameter for faster parsing:

# European date format
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')

# Handle errors gracefully
df['date'] = pd.to_datetime(df['date'], errors='coerce')  # Invalid dates become NaT

Boolean Indexing with Comparison Operators

The most straightforward approach uses boolean indexing with comparison operators. This method is readable and works well for most use cases.

# Create a larger sample dataset
date_range = pd.date_range(start='2024-01-01', end='2024-12-31', freq='D')
df = pd.DataFrame({
    'date': date_range,
    'revenue': np.random.randint(1000, 5000, size=len(date_range)),
    'expenses': np.random.randint(500, 2000, size=len(date_range))
})

# Filter for Q1 2024
start_date = pd.to_datetime('2024-01-01')
end_date = pd.to_datetime('2024-03-31')

q1_data = df[(df['date'] >= start_date) & (df['date'] <= end_date)]
print(f"Q1 records: {len(q1_data)}")

# Filter excluding boundaries
q1_exclusive = df[(df['date'] > start_date) & (df['date'] < end_date)]

Combine date filters with other conditions:

# High revenue days in Q1
high_revenue_q1 = df[
    (df['date'] >= start_date) & 
    (df['date'] <= end_date) & 
    (df['revenue'] > 3000)
]

print(high_revenue_q1.head())

Using the between() Method

The between() method provides cleaner syntax for inclusive range filtering and improves readability.

# Filter using between() - inclusive by default
q2_data = df[df['date'].between('2024-04-01', '2024-06-30')]

# Exclusive boundaries
q2_exclusive = df[df['date'].between('2024-04-01', '2024-06-30', inclusive='neither')]

# Left inclusive, right exclusive
q2_left = df[df['date'].between('2024-04-01', '2024-06-30', inclusive='left')]

The between() method is particularly useful when working with dynamic date ranges:

from datetime import datetime, timedelta

# Last 30 days
today = datetime.now()
thirty_days_ago = today - timedelta(days=30)

recent_data = df[df['date'].between(thirty_days_ago, today)]

# Current month
month_start = today.replace(day=1)
current_month = df[df['date'].between(month_start, today)]

Filtering with loc[] and DatetimeIndex

For repeated filtering operations, setting the date column as the index dramatically improves performance and simplifies syntax.

# Set date as index
df_indexed = df.set_index('date')

# Filter using loc with string dates
q3_data = df_indexed.loc['2024-07-01':'2024-09-30']

# Partial string indexing - get all data for a specific month
july_data = df_indexed.loc['2024-07']

# Get all data for a specific year
year_2024 = df_indexed.loc['2024']

# Combine with column selection
july_revenue = df_indexed.loc['2024-07', 'revenue']

Advanced indexing with DatetimeIndex:

# Filter by time of day (if datetime includes time)
df_with_time = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=100, freq='H'),
    'value': np.random.randn(100)
}).set_index('timestamp')

# Get business hours data (9 AM to 5 PM)
business_hours = df_with_time.between_time('09:00', '17:00')

# Get data for specific hours
morning = df_with_time.at_time('09:00')

Using query() Method

The query() method offers a SQL-like syntax that can be more readable for complex conditions.

# Define date variables
start = '2024-05-01'
end = '2024-05-31'

# Filter using query
may_data = df.query('date >= @start and date <= @end')

# Complex query with multiple conditions
filtered = df.query(
    'date >= @start and date <= @end and revenue > 2500 and expenses < 1500'
)

# Using variables from the environment
threshold = 3000
high_performers = df.query('revenue > @threshold and date >= "2024-06-01"')

Handling Missing Dates and Edge Cases

Real-world data often contains gaps and requires robust filtering logic.

# Create DataFrame with missing dates
dates_with_gaps = pd.to_datetime([
    '2024-01-01', '2024-01-05', '2024-01-10', '2024-02-01'
])
df_gaps = pd.DataFrame({
    'date': dates_with_gaps,
    'value': [100, 200, 300, 400]
})

# Filter and handle NaT (Not a Time)
df_gaps.loc[1, 'date'] = pd.NaT

# Remove NaT before filtering
df_clean = df_gaps.dropna(subset=['date'])
filtered_clean = df_clean[df_clean['date'] >= '2024-01-05']

# Or filter while keeping NaT
filtered_with_nat = df_gaps[
    (df_gaps['date'] >= '2024-01-05') | (df_gaps['date'].isna())
]

Performance Comparison

For large datasets, method choice impacts performance significantly.

import time

# Create large dataset
large_df = pd.DataFrame({
    'date': pd.date_range('2020-01-01', periods=1000000, freq='T'),
    'value': np.random.randn(1000000)
})

# Method 1: Boolean indexing
start = time.time()
result1 = large_df[(large_df['date'] >= '2023-01-01') & 
                    (large_df['date'] <= '2023-12-31')]
print(f"Boolean indexing: {time.time() - start:.4f}s")

# Method 2: between()
start = time.time()
result2 = large_df[large_df['date'].between('2023-01-01', '2023-12-31')]
print(f"between() method: {time.time() - start:.4f}s")

# Method 3: DatetimeIndex with loc
large_df_indexed = large_df.set_index('date')
start = time.time()
result3 = large_df_indexed.loc['2023-01-01':'2023-12-31']
print(f"DatetimeIndex loc: {time.time() - start:.4f}s")

Filtering by Relative Date Ranges

Business logic often requires filtering based on relative dates rather than absolute values.

from datetime import datetime, timedelta

# Current date reference
now = pd.Timestamp.now()

# Last 7 days
last_week = df[df['date'] >= (now - timedelta(days=7))]

# Next 30 days
next_month = df[df['date'] <= (now + timedelta(days=30))]

# Year to date
year_start = pd.Timestamp(now.year, 1, 1)
ytd = df[df['date'] >= year_start]

# Rolling window - last 90 days from each date
df['last_90_days'] = df['date'].apply(
    lambda x: df[(df['date'] >= x - timedelta(days=90)) & 
                  (df['date'] <= x)]['revenue'].sum()
)

Choose boolean indexing for simple, one-time filters. Use between() for cleaner syntax with inclusive ranges. Implement DatetimeIndex with loc[] for repeated filtering operations on large datasets. Apply query() when combining multiple conditions for improved readability. Always convert string dates to datetime objects before filtering to ensure accurate comparisons and leverage Pandas’ optimized datetime operations.