Pandas - Convert Column to Datetime
The `pd.to_datetime()` function converts string or numeric columns to datetime objects. For standard ISO 8601 formats, Pandas automatically detects the pattern:
Key Insights
- Pandas provides
pd.to_datetime()for flexible datetime conversion with automatic format detection, custom format strings, and error handling options - Understanding datetime parsing parameters like
format,errors, anddayfirstprevents data loss and improves conversion performance by up to 10x - Mixed datetime formats, timezone handling, and epoch timestamps require specific conversion strategies to maintain data integrity
Basic DateTime Conversion
The pd.to_datetime() function converts string or numeric columns to datetime objects. For standard ISO 8601 formats, Pandas automatically detects the pattern:
import pandas as pd
df = pd.DataFrame({
'date_string': ['2024-01-15', '2024-02-20', '2024-03-25']
})
df['date'] = pd.to_datetime(df['date_string'])
print(df.dtypes)
# date_string object
# date datetime64[ns]
For datetime strings with time components:
df = pd.DataFrame({
'timestamp': ['2024-01-15 14:30:00', '2024-02-20 09:15:30']
})
df['datetime'] = pd.to_datetime(df['timestamp'])
print(df['datetime'])
# 0 2024-01-15 14:30:00
# 1 2024-02-20 09:15:30
Specifying Custom Date Formats
Explicitly defining the format string dramatically improves performance and ensures correct parsing. Use Python’s strftime directives:
df = pd.DataFrame({
'date_str': ['15/01/2024', '20/02/2024', '25/03/2024']
})
# Specify day/month/year format
df['date'] = pd.to_datetime(df['date_str'], format='%d/%m/%Y')
print(df['date'])
# 0 2024-01-15
# 1 2024-02-20
# 2 2024-03-25
Common format codes:
formats = {
'MM/DD/YYYY': '%m/%d/%Y',
'DD-MM-YYYY': '%d-%m-%Y',
'YYYY.MM.DD': '%Y.%m.%d',
'Month DD, YYYY': '%B %d, %Y',
'DD Mon YYYY': '%d %b %Y'
}
df = pd.DataFrame({
'date1': ['01/15/2024'],
'date2': ['15-01-2024'],
'date3': ['2024.01.15'],
'date4': ['January 15, 2024'],
'date5': ['15 Jan 2024']
})
df['parsed1'] = pd.to_datetime(df['date1'], format='%m/%d/%Y')
df['parsed2'] = pd.to_datetime(df['date2'], format='%d-%m-%Y')
df['parsed3'] = pd.to_datetime(df['date3'], format='%Y.%m.%d')
df['parsed4'] = pd.to_datetime(df['date4'], format='%B %d, %Y')
df['parsed5'] = pd.to_datetime(df['date5'], format='%d %b %Y')
Handling Conversion Errors
The errors parameter controls behavior when parsing fails. Three options exist: raise (default), coerce, and ignore:
df = pd.DataFrame({
'dates': ['2024-01-15', 'invalid', '2024-03-25', None]
})
# Default: raises ValueError
try:
df['date_raise'] = pd.to_datetime(df['dates'])
except ValueError as e:
print(f"Error: {e}")
# Coerce invalid values to NaT (Not a Time)
df['date_coerce'] = pd.to_datetime(df['dates'], errors='coerce')
print(df['date_coerce'])
# 0 2024-01-15
# 1 NaT
# 2 2024-03-25
# 3 NaT
# Ignore: returns original data if conversion fails
df['date_ignore'] = pd.to_datetime(df['dates'], errors='ignore')
print(df['date_ignore'])
# 0 2024-01-15
# 1 invalid
# 2 2024-03-25
# 3 None
Identify and handle problematic rows:
df['date_parsed'] = pd.to_datetime(df['dates'], errors='coerce')
invalid_dates = df[df['date_parsed'].isna() & df['dates'].notna()]
print(f"Found {len(invalid_dates)} invalid date entries")
Ambiguous Date Formats
When dates could be interpreted multiple ways (01/02/2024 could be Jan 2 or Feb 1), use dayfirst parameter:
df = pd.DataFrame({
'ambiguous': ['01/02/2024', '05/06/2024', '12/11/2024']
})
# US format (month first)
df['us_format'] = pd.to_datetime(df['ambiguous'], dayfirst=False)
print(df['us_format'])
# 0 2024-01-02 (January 2)
# 1 2024-05-06 (May 6)
# 2 2024-12-11 (December 11)
# European format (day first)
df['eu_format'] = pd.to_datetime(df['ambiguous'], dayfirst=True)
print(df['eu_format'])
# 0 2024-02-01 (February 1)
# 1 2024-06-05 (June 5)
# 2 2024-11-12 (November 12)
Converting Unix Timestamps
Handle epoch timestamps (seconds or milliseconds since 1970-01-01):
df = pd.DataFrame({
'timestamp_sec': [1705334400, 1707926400, 1710518400],
'timestamp_ms': [1705334400000, 1707926400000, 1710518400000]
})
# Seconds since epoch
df['date_sec'] = pd.to_datetime(df['timestamp_sec'], unit='s')
# Milliseconds since epoch
df['date_ms'] = pd.to_datetime(df['timestamp_ms'], unit='ms')
print(df[['date_sec', 'date_ms']])
# date_sec date_ms
# 0 2024-01-15 12:00:00 2024-01-15 12:00:00
# 1 2024-02-14 12:00:00 2024-02-14 12:00:00
# 2 2024-03-15 12:00:00 2024-03-15 12:00:00
Available units: D (days), s (seconds), ms (milliseconds), us (microseconds), ns (nanoseconds).
Timezone Aware Conversions
Convert timezone-naive datetimes to timezone-aware:
df = pd.DataFrame({
'datetime': ['2024-01-15 14:30:00', '2024-02-20 09:15:30']
})
# Parse and localize to UTC
df['utc'] = pd.to_datetime(df['datetime'], utc=True)
print(df['utc'])
# 0 2024-01-15 14:30:00+00:00
# 1 2024-02-20 09:15:30+00:00
# Convert to different timezone
df['eastern'] = df['utc'].dt.tz_convert('America/New_York')
print(df['eastern'])
# 0 2024-01-15 09:30:00-05:00
# 1 2024-02-20 04:15:30-05:00
Parse timezone strings directly:
df = pd.DataFrame({
'datetime_tz': ['2024-01-15 14:30:00+00:00', '2024-02-20 09:15:30-05:00']
})
df['parsed'] = pd.to_datetime(df['datetime_tz'])
print(df['parsed'])
# 0 2024-01-15 14:30:00+00:00
# 1 2024-02-20 09:15:30-05:00
Handling Mixed Formats
When a column contains multiple date formats, use infer_datetime_format or apply custom logic:
df = pd.DataFrame({
'mixed_dates': [
'2024-01-15',
'15/01/2024',
'January 15, 2024',
'2024.01.15'
]
})
# Attempt automatic inference (slower but flexible)
df['parsed'] = pd.to_datetime(df['mixed_dates'], infer_datetime_format=True)
# Or apply conditional parsing
def parse_mixed_dates(date_str):
formats = ['%Y-%m-%d', '%d/%m/%Y', '%B %d, %Y', '%Y.%m.%d']
for fmt in formats:
try:
return pd.to_datetime(date_str, format=fmt)
except ValueError:
continue
return pd.NaT
df['custom_parsed'] = df['mixed_dates'].apply(parse_mixed_dates)
Extracting Date Components
After conversion, extract specific components:
df = pd.DataFrame({
'date_string': ['2024-01-15', '2024-02-20', '2024-03-25']
})
df['date'] = pd.to_datetime(df['date_string'])
# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_of_week'] = df['date'].dt.dayofweek # Monday=0
df['day_name'] = df['date'].dt.day_name()
df['quarter'] = df['date'].dt.quarter
print(df[['date', 'year', 'month', 'day', 'day_name', 'quarter']])
# date year month day day_name quarter
# 0 2024-01-15 2024 1 15 Monday 1
# 1 2024-02-20 2024 2 20 Tuesday 1
# 2 2024-03-25 2024 3 25 Monday 1
Performance Optimization
For large datasets, specify format explicitly and cache results:
import numpy as np
# Generate test data
n = 1_000_000
dates = pd.date_range('2020-01-01', periods=n, freq='H').strftime('%Y-%m-%d %H:%M:%S')
df = pd.DataFrame({'date_str': dates})
# Without format specification (slower)
%timeit pd.to_datetime(df['date_str'])
# ~2.5 seconds
# With format specification (faster)
%timeit pd.to_datetime(df['date_str'], format='%Y-%m-%d %H:%M:%S')
# ~250 milliseconds (10x faster)
# For repeated operations, convert once
df['date'] = pd.to_datetime(df['date_str'], format='%Y-%m-%d %H:%M:%S')
Set datetime as index for time-series operations:
df = df.set_index('date')
df_resampled = df.resample('D').count() # Daily aggregation