Spark SQL - Date and Timestamp Functions
Spark SQL handles three temporal data types: `date` (calendar date without time), `timestamp` (instant in time with timezone), and `timestamp_ntz` (timestamp without timezone, Spark 3.4+).
Key Insights
- Spark SQL provides 50+ date and timestamp functions that handle timezone conversions, date arithmetic, and format transformations with significantly better performance than UDFs
- Understanding the difference between
date,timestamp, andtimestamp_ntztypes is critical—timestamps store UTC internally while timestamp_ntz preserves local time without timezone information - Common pitfalls include implicit timezone conversions, incorrect format strings in parsing functions, and null propagation in date arithmetic that can silently corrupt aggregations
Core Date and Timestamp Types
Spark SQL handles three temporal data types: date (calendar date without time), timestamp (instant in time with timezone), and timestamp_ntz (timestamp without timezone, Spark 3.4+).
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("date-functions").getOrCreate()
# Create sample data with different temporal types
data = [
("2024-01-15", "2024-01-15 14:30:00", "2024-01-15 14:30:00"),
("2024-06-20", "2024-06-20 09:15:30", "2024-06-20 09:15:30")
]
df = spark.createDataFrame(data, ["date_str", "ts_str", "ts_ntz_str"])
# Cast to appropriate types
df = df.select(
col("date_str").cast("date").alias("date_col"),
col("ts_str").cast("timestamp").alias("ts_col"),
col("ts_ntz_str").cast("timestamp_ntz").alias("ts_ntz_col")
)
df.printSchema()
# root
# |-- date_col: date (nullable = true)
# |-- ts_col: timestamp (nullable = true)
# |-- ts_ntz_col: timestamp_ntz (nullable = true)
The timestamp type stores UTC internally and converts based on session timezone. The timestamp_ntz type stores exactly what you provide without conversion.
Parsing and Formatting Functions
Converting between strings and temporal types requires explicit format patterns following Java’s SimpleDateFormat conventions.
# Create data with various date formats
messy_dates = [
("15-Jan-2024", "2024/01/15 2:30 PM", "20240115"),
("20-Jun-2024", "2024/06/20 9:15 AM", "20240620")
]
df = spark.createDataFrame(messy_dates, ["format1", "format2", "format3"])
# Parse different formats
df_parsed = df.select(
to_date("format1", "dd-MMM-yyyy").alias("date1"),
to_timestamp("format2", "yyyy/MM/dd h:mm a").alias("ts1"),
to_date("format3", "yyyyMMdd").alias("date2")
)
# Format timestamps back to strings
df_formatted = df_parsed.select(
date_format("date1", "yyyy-MM-dd").alias("iso_date"),
date_format("ts1", "EEE, MMM dd yyyy HH:mm:ss").alias("readable_ts"),
date_format("date2", "MM/dd/yy").alias("us_date")
)
df_formatted.show(truncate=False)
# +----------+---------------------------+--------+
# |iso_date |readable_ts |us_date |
# +----------+---------------------------+--------+
# |2024-01-15|Mon, Jan 15 2024 14:30:00 |01/15/24|
# |2024-06-20|Thu, Jun 20 2024 09:15:00 |06/20/24|
# +----------+---------------------------+--------+
Invalid formats return null. Always validate parsing results:
df_with_invalid = spark.createDataFrame(
[("2024-01-15",), ("invalid-date",), (None,)],
["date_str"]
)
df_validated = df_with_invalid.select(
col("date_str"),
to_date("date_str", "yyyy-MM-dd").alias("parsed"),
when(to_date("date_str", "yyyy-MM-dd").isNull(), "INVALID")
.otherwise("VALID").alias("status")
)
df_validated.show()
Date Arithmetic and Interval Operations
Spark provides functions for adding/subtracting time periods and calculating differences between dates.
from pyspark.sql.functions import expr
dates_df = spark.createDataFrame(
[("2024-01-15", "2024-03-20")],
["start_date", "end_date"]
).select(
col("start_date").cast("date").alias("start_date"),
col("end_date").cast("date").alias("end_date")
)
# Date arithmetic
result = dates_df.select(
"start_date",
"end_date",
# Add/subtract days
date_add("start_date", 30).alias("plus_30_days"),
date_sub("start_date", 7).alias("minus_7_days"),
# Add months (handles varying month lengths)
add_months("start_date", 3).alias("plus_3_months"),
add_months("start_date", -1).alias("minus_1_month"),
# Calculate difference
datediff("end_date", "start_date").alias("days_between"),
months_between("end_date", "start_date").alias("months_between"),
# Next/last day of month
last_day("start_date").alias("month_end"),
trunc("start_date", "month").alias("month_start")
)
result.show()
For timestamp arithmetic with hours/minutes/seconds, use expr() with interval syntax:
ts_df = spark.createDataFrame(
[("2024-01-15 10:30:00",)],
["timestamp_str"]
).select(
col("timestamp_str").cast("timestamp").alias("ts")
)
ts_result = ts_df.select(
"ts",
expr("ts + INTERVAL 2 HOURS").alias("plus_2_hours"),
expr("ts - INTERVAL 30 MINUTES").alias("minus_30_min"),
expr("ts + INTERVAL 1 DAY 3 HOURS").alias("plus_1d_3h"),
expr("ts + INTERVAL '1-6' YEAR TO MONTH").alias("plus_1y_6m")
)
ts_result.show(truncate=False)
Extracting Date Components
Breaking down dates into components enables grouping and filtering operations.
df = spark.createDataFrame(
[("2024-01-15 14:30:45",), ("2024-06-20 09:15:30",)],
["ts_str"]
).select(col("ts_str").cast("timestamp").alias("ts"))
components = df.select(
"ts",
year("ts").alias("year"),
month("ts").alias("month"),
dayofmonth("ts").alias("day"),
dayofweek("ts").alias("dow"), # 1=Sunday, 7=Saturday
dayofyear("ts").alias("doy"),
weekofyear("ts").alias("week"),
quarter("ts").alias("quarter"),
hour("ts").alias("hour"),
minute("ts").alias("minute"),
second("ts").alias("second")
)
components.show()
Use these for time-series aggregations:
sales_data = spark.createDataFrame([
("2024-01-15 10:30:00", 100.0),
("2024-01-15 14:45:00", 150.0),
("2024-01-16 09:20:00", 200.0),
("2024-02-01 11:00:00", 175.0)
], ["sale_time", "amount"])
sales_df = sales_data.select(
col("sale_time").cast("timestamp").alias("sale_time"),
col("amount")
)
# Aggregate by month
monthly_sales = sales_df.groupBy(
year("sale_time").alias("year"),
month("sale_time").alias("month")
).agg(
sum("amount").alias("total_sales"),
count("*").alias("num_sales")
)
monthly_sales.show()
Timezone Handling
Timezone conversions are a common source of bugs. Spark’s timestamp type stores UTC internally and interprets values based on the session timezone.
# Set session timezone
spark.conf.set("spark.sql.session.timeZone", "America/New_York")
df = spark.createDataFrame(
[("2024-01-15 10:00:00",)],
["ts_str"]
).select(col("ts_str").cast("timestamp").alias("ts_ny"))
# Convert to different timezones
df_converted = df.select(
"ts_ny",
from_utc_timestamp("ts_ny", "America/Los_Angeles").alias("ts_la"),
from_utc_timestamp("ts_ny", "Europe/London").alias("ts_london"),
to_utc_timestamp("ts_ny", "America/New_York").alias("ts_utc")
)
df_converted.show(truncate=False)
Key functions:
from_utc_timestamp(ts, tz): Convert UTC timestamp to specified timezoneto_utc_timestamp(ts, tz): Interpret timestamp as specified timezone, convert to UTCconvert_timezone(source_tz, target_tz, ts): Direct conversion between timezones (Spark 3.1+)
# Direct timezone conversion
df_tz = df.select(
"ts_ny",
expr("convert_timezone('America/New_York', 'Asia/Tokyo', ts_ny)").alias("ts_tokyo")
)
Working with Unix Timestamps
Unix timestamps (seconds since epoch) are common in log files and APIs.
df = spark.createDataFrame(
[(1705320000,), (1718870400,)],
["unix_ts"]
)
# Convert unix timestamp to timestamp
df_converted = df.select(
"unix_ts",
from_unixtime("unix_ts").alias("ts_str"),
from_unixtime("unix_ts", "yyyy-MM-dd HH:mm:ss").alias("formatted"),
col("unix_ts").cast("timestamp").alias("ts")
)
# Convert timestamp back to unix
df_back = df_converted.select(
"ts",
unix_timestamp("ts").alias("unix_ts_back"),
unix_timestamp("formatted", "yyyy-MM-dd HH:mm:ss").alias("unix_from_str")
)
df_back.show()
Current Date and Time Functions
Generate current timestamps for audit columns and time-based filtering.
df = spark.range(3).select(
current_date().alias("today"),
current_timestamp().alias("now"),
# Date truncation
date_trunc("hour", current_timestamp()).alias("current_hour"),
date_trunc("day", current_timestamp()).alias("current_day"),
date_trunc("month", current_timestamp()).alias("current_month")
)
df.show(truncate=False)
These functions evaluate once per query, not per row, ensuring consistency within a single execution.
Performance Considerations
Date functions are optimized C++ implementations. Avoid UDFs for date operations:
# SLOW: Python UDF
from pyspark.sql.types import DateType
from datetime import datetime, timedelta
@udf(returnType=DateType())
def add_days_udf(date_val, days):
return date_val + timedelta(days=days)
# FAST: Built-in function
df = df.withColumn("future_date", date_add("date_col", 30))
For complex date logic, use when() chains or expr() with SQL:
# Calculate business days (excluding weekends)
df = df.withColumn(
"is_business_day",
when(dayofweek("date_col").isin([1, 7]), False)
.otherwise(True)
)
# Or use SQL expression
df = df.withColumn(
"is_business_day",
expr("dayofweek(date_col) NOT IN (1, 7)")
)
Partition tables by date columns for efficient time-range queries:
df.write.partitionBy("year", "month").parquet("output/sales")
This enables partition pruning when filtering by date ranges, dramatically reducing data scanned.