Date Functions in PySpark vs Pandas vs SQL
Every data engineer knows this pain: you write a date transformation in Pandas during exploration, then need to port it to PySpark for production, and finally someone asks for the equivalent SQL for...
Key Insights
- PySpark and SQL share nearly identical date function syntax, making transitions between them straightforward—Pandas is the outlier with its object-oriented approach
- Pandas offers the most flexibility for complex date manipulations through its
DateOffsetandTimedeltaclasses, but this power comes with a steeper learning curve - Always specify explicit format strings when parsing dates; implicit parsing is a silent bug waiting to happen in production pipelines
The Date Function Mental Model Problem
Every data engineer knows this pain: you write a date transformation in Pandas during exploration, then need to port it to PySpark for production, and finally someone asks for the equivalent SQL for a dashboard query. Each time, you’re googling the same basic operations.
The root issue isn’t complexity—date functions are conceptually simple. The problem is that each tool uses different naming conventions and slightly different approaches. PySpark and SQL are close cousins, while Pandas takes an object-oriented path that feels foreign to SQL-trained engineers.
This article gives you the mental map to translate between all three without constant documentation lookups.
Getting the Current Date/Time
Let’s start with the most basic operation: getting the current timestamp.
# PySpark
from pyspark.sql import functions as F
df = df.withColumn("created_at", F.current_timestamp())
df = df.withColumn("created_date", F.current_date())
# Pandas
import pandas as pd
df["created_at"] = pd.Timestamp.now()
df["created_date"] = pd.Timestamp.now().date()
# Or for timezone-aware timestamps
df["created_at"] = pd.Timestamp.now(tz="UTC")
-- SQL Server
SELECT GETDATE() AS created_at, CAST(GETDATE() AS DATE) AS created_date
-- PostgreSQL / Standard SQL
SELECT CURRENT_TIMESTAMP AS created_at, CURRENT_DATE AS created_date
-- MySQL
SELECT NOW() AS created_at, CURDATE() AS created_date
The key difference: PySpark and SQL evaluate these functions per-row at execution time, while Pandas assigns a single scalar value to the entire column. If you need row-level evaluation in Pandas (rare, but it happens), you’d need to use apply().
Parsing and Formatting Dates
String-to-date conversion is where bugs breed. Implicit parsing seems convenient until your pipeline breaks because someone uploaded European-formatted dates.
# PySpark - explicit format is mandatory (good)
from pyspark.sql import functions as F
df = df.withColumn(
"parsed_date",
F.to_date(F.col("date_string"), "yyyy-MM-dd")
)
df = df.withColumn(
"parsed_timestamp",
F.to_timestamp(F.col("ts_string"), "yyyy-MM-dd HH:mm:ss")
)
# Formatting back to string
df = df.withColumn(
"formatted",
F.date_format(F.col("parsed_date"), "MM/dd/yyyy")
)
# Pandas - format is optional but you should always specify it
import pandas as pd
# Dangerous - relies on inference
df["parsed_date"] = pd.to_datetime(df["date_string"])
# Safe - explicit format
df["parsed_date"] = pd.to_datetime(df["date_string"], format="%Y-%m-%d")
# Formatting to string
df["formatted"] = df["parsed_date"].dt.strftime("%m/%d/%Y")
-- PostgreSQL
SELECT TO_DATE('2024-01-15', 'YYYY-MM-DD') AS parsed_date;
SELECT TO_CHAR(created_at, 'MM/DD/YYYY') AS formatted;
-- SQL Server
SELECT CONVERT(DATE, '2024-01-15', 23) AS parsed_date;
SELECT FORMAT(created_at, 'MM/dd/yyyy') AS formatted;
-- MySQL
SELECT STR_TO_DATE('2024-01-15', '%Y-%m-%d') AS parsed_date;
SELECT DATE_FORMAT(created_at, '%m/%d/%Y') AS formatted;
Notice the format string syntax differs: PySpark uses Java’s SimpleDateFormat patterns (yyyy-MM-dd), Pandas uses Python’s strftime codes (%Y-%m-%d), and SQL varies by dialect. This is the single most annoying inconsistency across these tools.
Extracting Date Components
Pulling out year, month, day, and other components is straightforward once you know the function names.
# PySpark
from pyspark.sql import functions as F
df = df.withColumn("year", F.year(F.col("date_col")))
df = df.withColumn("month", F.month(F.col("date_col")))
df = df.withColumn("day", F.dayofmonth(F.col("date_col")))
df = df.withColumn("day_of_week", F.dayofweek(F.col("date_col"))) # 1=Sunday
df = df.withColumn("day_of_year", F.dayofyear(F.col("date_col")))
df = df.withColumn("quarter", F.quarter(F.col("date_col")))
df = df.withColumn("week_of_year", F.weekofyear(F.col("date_col")))
# Pandas - uses the .dt accessor
df["year"] = df["date_col"].dt.year
df["month"] = df["date_col"].dt.month
df["day"] = df["date_col"].dt.day
df["day_of_week"] = df["date_col"].dt.dayofweek # 0=Monday
df["day_of_year"] = df["date_col"].dt.dayofyear
df["quarter"] = df["date_col"].dt.quarter
df["week_of_year"] = df["date_col"].dt.isocalendar().week
-- PostgreSQL
SELECT
EXTRACT(YEAR FROM date_col) AS year,
EXTRACT(MONTH FROM date_col) AS month,
EXTRACT(DAY FROM date_col) AS day,
EXTRACT(DOW FROM date_col) AS day_of_week, -- 0=Sunday
EXTRACT(DOY FROM date_col) AS day_of_year,
EXTRACT(QUARTER FROM date_col) AS quarter,
EXTRACT(WEEK FROM date_col) AS week_of_year
FROM table_name;
-- SQL Server
SELECT
YEAR(date_col) AS year,
MONTH(date_col) AS month,
DAY(date_col) AS day,
DATEPART(WEEKDAY, date_col) AS day_of_week,
DATEPART(DAYOFYEAR, date_col) AS day_of_year,
DATEPART(QUARTER, date_col) AS quarter,
DATEPART(WEEK, date_col) AS week_of_year
FROM table_name;
Watch out for day-of-week indexing. PySpark starts with Sunday=1, Pandas starts with Monday=0, and SQL varies by dialect and server settings. Always test this explicitly in your environment.
Date Arithmetic
Adding and subtracting time intervals is where Pandas shines with its expressive API, while PySpark and SQL stick to simpler function-based approaches.
# PySpark
from pyspark.sql import functions as F
# Add/subtract days
df = df.withColumn("next_week", F.date_add(F.col("date_col"), 7))
df = df.withColumn("last_week", F.date_sub(F.col("date_col"), 7))
# Add months
df = df.withColumn("next_month", F.add_months(F.col("date_col"), 1))
# Difference between dates (returns integer days)
df = df.withColumn("days_diff", F.datediff(F.col("end_date"), F.col("start_date")))
# Difference in months
df = df.withColumn("months_diff", F.months_between(F.col("end_date"), F.col("start_date")))
# Pandas - multiple approaches
import pandas as pd
from pandas import DateOffset, Timedelta
# Using Timedelta for days/hours/minutes
df["next_week"] = df["date_col"] + Timedelta(days=7)
df["next_hour"] = df["date_col"] + Timedelta(hours=1)
# Using DateOffset for calendar-aware arithmetic
df["next_month"] = df["date_col"] + DateOffset(months=1)
df["next_quarter"] = df["date_col"] + DateOffset(months=3)
df["next_business_day"] = df["date_col"] + pd.offsets.BusinessDay(1)
# Date differences
df["days_diff"] = (df["end_date"] - df["start_date"]).dt.days
df["seconds_diff"] = (df["end_date"] - df["start_date"]).dt.total_seconds()
-- PostgreSQL
SELECT
date_col + INTERVAL '7 days' AS next_week,
date_col + INTERVAL '1 month' AS next_month,
end_date - start_date AS days_diff,
AGE(end_date, start_date) AS interval_diff
FROM table_name;
-- SQL Server
SELECT
DATEADD(DAY, 7, date_col) AS next_week,
DATEADD(MONTH, 1, date_col) AS next_month,
DATEDIFF(DAY, start_date, end_date) AS days_diff
FROM table_name;
-- MySQL
SELECT
DATE_ADD(date_col, INTERVAL 7 DAY) AS next_week,
DATE_ADD(date_col, INTERVAL 1 MONTH) AS next_month,
DATEDIFF(end_date, start_date) AS days_diff
FROM table_name;
Pandas’ DateOffset is particularly powerful for business logic. Need the last day of the month? DateOffset(months=1, days=-1) from the first of the month. Need the next business day excluding holidays? Combine BusinessDay with a custom holiday calendar.
Truncating and Rounding Dates
Truncation is essential for time-series aggregation. You’ll use this constantly for grouping by month, quarter, or year.
# PySpark
from pyspark.sql import functions as F
df = df.withColumn("month_start", F.date_trunc("month", F.col("date_col")))
df = df.withColumn("year_start", F.date_trunc("year", F.col("date_col")))
df = df.withColumn("quarter_start", F.date_trunc("quarter", F.col("date_col")))
df = df.withColumn("week_start", F.date_trunc("week", F.col("date_col")))
# Last day of month
df = df.withColumn("month_end", F.last_day(F.col("date_col")))
# Pandas - multiple approaches
# Using period conversion (returns Period, not Timestamp)
df["month_period"] = df["date_col"].dt.to_period("M")
# To get actual timestamps, normalize then floor
df["month_start"] = df["date_col"].dt.to_period("M").dt.to_timestamp()
df["year_start"] = df["date_col"].dt.to_period("Y").dt.to_timestamp()
# Or use floor directly (cleaner)
df["month_start"] = df["date_col"].dt.floor("D") - pd.offsets.MonthBegin(1)
# Last day of month
df["month_end"] = df["date_col"] + pd.offsets.MonthEnd(0)
-- PostgreSQL
SELECT
DATE_TRUNC('month', date_col) AS month_start,
DATE_TRUNC('year', date_col) AS year_start,
DATE_TRUNC('quarter', date_col) AS quarter_start,
(DATE_TRUNC('month', date_col) + INTERVAL '1 month - 1 day')::DATE AS month_end
FROM table_name;
-- SQL Server
SELECT
DATETRUNC(MONTH, date_col) AS month_start, -- SQL Server 2022+
DATEFROMPARTS(YEAR(date_col), 1, 1) AS year_start,
EOMONTH(date_col) AS month_end
FROM table_name;
Quick Reference Cheat Sheet
| Operation | PySpark | Pandas | PostgreSQL |
|---|---|---|---|
| Current timestamp | F.current_timestamp() |
pd.Timestamp.now() |
CURRENT_TIMESTAMP |
| Current date | F.current_date() |
pd.Timestamp.now().date() |
CURRENT_DATE |
| Parse string | F.to_date(col, fmt) |
pd.to_datetime(s, format=) |
TO_DATE(s, fmt) |
| Format to string | F.date_format(col, fmt) |
s.dt.strftime(fmt) |
TO_CHAR(col, fmt) |
| Extract year | F.year(col) |
s.dt.year |
EXTRACT(YEAR FROM col) |
| Extract month | F.month(col) |
s.dt.month |
EXTRACT(MONTH FROM col) |
| Add days | F.date_add(col, n) |
s + Timedelta(days=n) |
col + INTERVAL 'n days' |
| Add months | F.add_months(col, n) |
s + DateOffset(months=n) |
col + INTERVAL 'n months' |
| Date difference | F.datediff(end, start) |
(end - start).dt.days |
end - start |
| Truncate to month | F.date_trunc('month', col) |
s.dt.to_period('M').dt.to_timestamp() |
DATE_TRUNC('month', col) |
| Last day of month | F.last_day(col) |
s + pd.offsets.MonthEnd(0) |
(DATE_TRUNC('month', col) + INTERVAL '1 month - 1 day')::DATE |
Keep this table bookmarked. After a few weeks of active use, the mappings become automatic. Until then, reference it liberally—there’s no prize for memorizing syntax when you could be shipping features.