Pandas - Convert Timestamp to Date
• Pandas provides multiple methods to convert timestamps to dates: `dt.date`, `dt.normalize()`, and `dt.floor()`, each serving different use cases from extracting date objects to maintaining...
Key Insights
• Pandas provides multiple methods to convert timestamps to dates: dt.date, dt.normalize(), and dt.floor(), each serving different use cases from extracting date objects to maintaining datetime64 dtype
• The dt.date accessor returns Python date objects which breaks vectorization, while dt.normalize() and dt.floor('D') preserve datetime64 dtype for better performance on large datasets
• Timezone-aware timestamps require explicit handling before conversion to avoid ambiguity, using dt.tz_localize(None) or dt.tz_convert() as needed
Converting Timestamps Using dt.date
The most straightforward approach to extract dates from timestamps is the dt.date accessor. This method returns Python datetime.date objects, stripping away the time component entirely.
import pandas as pd
import numpy as np
# Create sample timestamp data
df = pd.DataFrame({
'timestamp': pd.date_range('2024-01-15 08:30:00', periods=5, freq='6H')
})
print("Original timestamps:")
print(df['timestamp'])
# Convert to date using dt.date
df['date'] = df['timestamp'].dt.date
print("\nConverted to dates:")
print(df['date'])
print(f"Data type: {type(df['date'].iloc[0])}")
Output:
Original timestamps:
0 2024-01-15 08:30:00
1 2024-01-15 14:30:00
2 2024-01-15 20:30:00
3 2024-01-16 02:30:00
4 2024-01-16 08:30:00
Converted to dates:
0 2024-01-15
1 2024-01-15
2 2024-01-15
3 2024-01-16
4 2024-01-16
Data type: <class 'datetime.date'>
The limitation here is performance. Python date objects are not vectorized, making operations on large datasets significantly slower than native pandas datetime64 operations.
Preserving datetime64 with dt.normalize()
When you need to maintain the datetime64 dtype while setting time to midnight, use dt.normalize(). This approach keeps your data in pandas’ optimized format.
# Create DataFrame with various timestamps
df = pd.DataFrame({
'timestamp': pd.to_datetime([
'2024-01-15 08:30:45',
'2024-01-15 14:22:10',
'2024-01-16 23:59:59',
'2024-01-17 00:00:01'
])
})
# Normalize to midnight
df['normalized_date'] = df['timestamp'].dt.normalize()
print("Original vs Normalized:")
print(df)
print(f"\nNormalized dtype: {df['normalized_date'].dtype}")
Output:
timestamp normalized_date
0 2024-01-15 08:30:45 2024-01-15
1 2024-01-15 14:22:10 2024-01-15
2 2024-01-16 23:59:59 2024-01-16
3 2024-01-17 00:00:01 2024-01-17
Normalized dtype: datetime64[ns]
This method is ideal for grouping operations, time-based joins, and any scenario where you need date-level granularity while maintaining compatibility with pandas datetime operations.
Using dt.floor() for Date Conversion
The dt.floor() method provides flexible rounding to any time frequency. For date conversion, use floor('D') to round down to the start of each day.
# Create timestamps with various times
df = pd.DataFrame({
'timestamp': pd.to_datetime([
'2024-03-15 08:30:00',
'2024-03-15 16:45:30',
'2024-03-16 00:15:00',
'2024-03-16 23:59:59'
])
})
# Floor to day level
df['floored_date'] = df['timestamp'].dt.floor('D')
# Compare with normalize
df['normalized_date'] = df['timestamp'].dt.normalize()
print(df)
print(f"\nAre floor and normalize equal? {df['floored_date'].equals(df['normalized_date'])}")
Output:
timestamp floored_date normalized_date
0 2024-03-15 08:30:00 2024-03-15 00:00:00 2024-03-15 00:00:00
1 2024-03-15 16:45:30 2024-03-15 00:00:00 2024-03-15 00:00:00
2 2024-03-16 00:15:00 2024-03-16 00:00:00 2024-03-16 00:00:00
3 2024-03-16 23:59:59 2024-03-16 00:00:00 2024-03-16 00:00:00
Are floor and normalize equal? True
While floor('D') and normalize() produce identical results for date conversion, floor() offers additional flexibility for rounding to other frequencies like hours (‘H’), minutes (‘T’), or weeks (‘W’).
Handling Timezone-Aware Timestamps
Timezone-aware timestamps require special attention. You must decide whether to convert to a specific timezone or remove timezone information entirely.
# Create timezone-aware timestamps
df = pd.DataFrame({
'timestamp_utc': pd.date_range('2024-01-15 08:00:00', periods=4, freq='6H', tz='UTC')
})
print("Original UTC timestamps:")
print(df['timestamp_utc'])
# Convert to Eastern Time before extracting date
df['timestamp_eastern'] = df['timestamp_utc'].dt.tz_convert('America/New_York')
df['date_eastern'] = df['timestamp_eastern'].dt.normalize()
# Remove timezone info then extract date
df['date_utc'] = df['timestamp_utc'].dt.tz_localize(None).dt.normalize()
print("\nConverted dates:")
print(df[['timestamp_utc', 'date_eastern', 'date_utc']])
Output:
Original UTC timestamps:
0 2024-01-15 08:00:00+00:00
1 2024-01-15 14:00:00+00:00
2 2024-01-15 20:00:00+00:00
3 2024-01-16 02:00:00+00:00
Converted dates:
timestamp_utc date_eastern date_utc
0 2024-01-15 08:00:00+00:00 2024-01-15 00:00:00 2024-01-15 00:00:00
1 2024-01-15 14:00:00+00:00 2024-01-15 00:00:00 2024-01-15 00:00:00
2 2024-01-15 20:00:00+00:00 2024-01-15 00:00:00 2024-01-15 00:00:00
3 2024-01-16 02:00:00+00:00 2024-01-15 00:00:00 2024-01-16 00:00:00
Notice that the last timestamp (2024-01-16 02:00 UTC) becomes 2024-01-15 in Eastern Time because it’s still January 15th at 9 PM EST.
Converting String Timestamps to Dates
When working with timestamp strings, combine pd.to_datetime() with date conversion in a single operation or chain them for clarity.
# Sample data with string timestamps
data = {
'event': ['Login', 'Purchase', 'Logout', 'Login'],
'timestamp_str': [
'2024-01-15 08:30:00',
'2024-01-15 09:45:30',
'2024-01-15 10:22:15',
'2024-01-16 08:15:00'
]
}
df = pd.DataFrame(data)
# Method 1: Convert then extract date
df['date_method1'] = pd.to_datetime(df['timestamp_str']).dt.normalize()
# Method 2: Convert with specific format then extract
df['date_method2'] = pd.to_datetime(
df['timestamp_str'],
format='%Y-%m-%d %H:%M:%S'
).dt.date
print(df[['event', 'timestamp_str', 'date_method1', 'date_method2']])
Performance Considerations for Large Datasets
When working with millions of rows, the choice of conversion method significantly impacts performance.
import time
# Create large dataset
n_rows = 1_000_000
large_df = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=n_rows, freq='1min')
})
# Benchmark dt.date (returns Python objects)
start = time.time()
dates_python = large_df['timestamp'].dt.date
time_python = time.time() - start
# Benchmark dt.normalize (returns datetime64)
start = time.time()
dates_normalized = large_df['timestamp'].dt.normalize()
time_normalized = time.time() - start
print(f"dt.date execution time: {time_python:.4f} seconds")
print(f"dt.normalize execution time: {time_normalized:.4f} seconds")
print(f"Speedup: {time_python/time_normalized:.2f}x")
On typical hardware, dt.normalize() runs 10-50x faster than dt.date on large datasets because it maintains vectorized operations.
Grouping and Aggregating by Date
Converting timestamps to dates enables powerful grouping operations for daily aggregations.
# Sales data with timestamps
sales_data = pd.DataFrame({
'timestamp': pd.date_range('2024-01-15 08:00', periods=100, freq='2H'),
'amount': np.random.randint(100, 1000, 100)
})
# Convert to date for grouping
sales_data['date'] = sales_data['timestamp'].dt.normalize()
# Daily aggregations
daily_summary = sales_data.groupby('date').agg({
'amount': ['sum', 'mean', 'count']
}).round(2)
print("Daily Sales Summary:")
print(daily_summary.head())
This pattern is essential for time-series analysis, reporting, and dashboard creation where daily granularity is required.
Handling Missing Values
Missing timestamp values require explicit handling during conversion to avoid errors.
# Data with missing timestamps
df = pd.DataFrame({
'timestamp': pd.to_datetime([
'2024-01-15 08:30:00',
None,
'2024-01-16 14:22:00',
np.nan,
'2024-01-17 09:15:00'
])
})
# dt.normalize preserves NaT (Not a Time)
df['date_normalized'] = df['timestamp'].dt.normalize()
# dt.date also preserves None
df['date_python'] = df['timestamp'].dt.date
print(df)
print(f"\nNull count in normalized: {df['date_normalized'].isna().sum()}")
print(f"Null count in python dates: {df['date_python'].isna().sum()}")
Both methods properly handle missing values, maintaining NaT or None in the resulting series. This allows for consistent null handling in downstream operations.