PySpark - Convert String to Date/Timestamp

Key Insights

Use to_date() for date-only conversions and to_timestamp() when you need to preserve time components—the format string follows Java’s SimpleDateFormat patterns, not Python’s strftime conventions.
Handle datasets with inconsistent date formats using coalesce() with multiple parsing attempts, preventing entire jobs from failing due to a few malformed records.
Converting string columns to proper date/timestamp types enables partition pruning and predicate pushdown, dramatically improving query performance on large datasets.

Introduction

Working with dates in PySpark presents unique challenges compared to pandas or standard Python. String-formatted dates are ubiquitous in raw data—CSV files, JSON logs, database exports—but keeping them as strings cripples your ability to perform time-based operations efficiently. Without proper date types, you lose partition pruning, can’t use temporal functions, and sacrifice significant performance gains that Spark’s Catalyst optimizer provides.

The conversion from string to date or timestamp isn’t just about data quality; it’s about unlocking Spark’s distributed computing power for time-series analysis. A properly typed date column allows Spark to push down predicates to the data source, skip entire partitions, and optimize joins. This article covers the practical approaches to string-to-date conversion in PySpark, with real-world examples and performance considerations.

Basic String to Date Conversion using `to_date()`

The to_date() function is your primary tool for converting string columns to DateType. It strips time components and returns only the date portion.

For standard ISO 8601 formatted dates (YYYY-MM-DD), conversion is straightforward:

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date, col

spark = SparkSession.builder.appName("DateConversion").getOrCreate()

data = [
    ("2024-01-15",),
    ("2024-02-20",),
    ("2024-03-10",)
]

df = spark.createDataFrame(data, ["date_string"])

df_converted = df.withColumn("date", to_date(col("date_string")))
df_converted.printSchema()
# date: date (nullable = true)

For non-standard formats, specify the pattern using Java SimpleDateFormat conventions:

data = [
    ("01/15/2024",),
    ("02/20/2024",),
    ("03/10/2024",)
]

df = spark.createDataFrame(data, ["date_string"])

# MM/dd/yyyy pattern
df_converted = df.withColumn("date", to_date(col("date_string"), "MM/dd/yyyy"))

# Different format: dd-MMM-yyyy
data2 = [("15-Jan-2024",), ("20-Feb-2024",)]
df2 = spark.createDataFrame(data2, ["date_string"])
df2_converted = df2.withColumn("date", to_date(col("date_string"), "dd-MMM-yyyy"))

When to_date() encounters invalid formats, it returns null rather than throwing an exception:

data = [
    ("2024-01-15",),
    ("invalid-date",),
    ("2024-13-45",),  # Invalid month and day
    (None,)
]

df = spark.createDataFrame(data, ["date_string"])
df_converted = df.withColumn("date", to_date(col("date_string")))

df_converted.show()
# +-------------+----------+
# |  date_string|      date|
# +-------------+----------+
# |   2024-01-15|2024-01-15|
# | invalid-date|      null|
# |   2024-13-45|      null|
# |         null|      null|
# +-------------+----------+

String to Timestamp Conversion using `to_timestamp()`

When your data includes time components (hours, minutes, seconds), use to_timestamp() instead of to_date(). This preserves the full temporal precision.

from pyspark.sql.functions import to_timestamp

data = [
    ("2024-01-15 14:30:00",),
    ("2024-02-20 09:15:30",),
    ("2024-03-10 18:45:15",)
]

df = spark.createDataFrame(data, ["timestamp_string"])

df_converted = df.withColumn(
    "timestamp", 
    to_timestamp(col("timestamp_string"), "yyyy-MM-dd HH:mm:ss")
)

df_converted.printSchema()
# timestamp: timestamp (nullable = true)

For timestamps with milliseconds, adjust the format pattern:

data = [
    ("2024-01-15 14:30:00.123",),
    ("2024-02-20 09:15:30.456",)
]

df = spark.createDataFrame(data, ["timestamp_string"])

df_converted = df.withColumn(
    "timestamp",
    to_timestamp(col("timestamp_string"), "yyyy-MM-dd HH:mm:ss.SSS")
)

Timezone handling in PySpark follows the JVM’s timezone settings. By default, timestamps are interpreted in the session’s timezone:

# Set session timezone
spark.conf.set("spark.sql.session.timeZone", "UTC")

data = [("2024-01-15T14:30:00Z",)]
df = spark.createDataFrame(data, ["timestamp_string"])

df_converted = df.withColumn(
    "timestamp",
    to_timestamp(col("timestamp_string"), "yyyy-MM-dd'T'HH:mm:ss'Z'")
)

Using `cast()` Method for Type Conversion

The cast() method provides a simpler alternative when your strings follow standard formats. It’s more concise but less flexible than to_date() or to_timestamp().

from pyspark.sql.types import DateType, TimestampType

data = [
    ("2024-01-15",),
    ("2024-02-20",)
]

df = spark.createDataFrame(data, ["date_string"])

# Cast to DateType
df_date = df.withColumn("date", col("date_string").cast(DateType()))

# Cast to TimestampType
data_ts = [("2024-01-15 14:30:00",)]
df_ts = spark.createDataFrame(data_ts, ["timestamp_string"])
df_timestamp = df_ts.withColumn("timestamp", col("timestamp_string").cast(TimestampType()))

The key difference: cast() only works with specific standard formats. For custom formats, you must use to_date() or to_timestamp():

# This works - standard format
df.withColumn("date", col("date_string").cast(DateType()))

# This returns null - non-standard format
data_custom = [("01/15/2024",)]
df_custom = spark.createDataFrame(data_custom, ["date_string"])
df_custom.withColumn("date", col("date_string").cast(DateType())).show()
# Returns null because cast() doesn't recognize MM/dd/yyyy

# Must use to_date() with format
df_custom.withColumn("date", to_date(col("date_string"), "MM/dd/yyyy")).show()
# Works correctly

Handling Multiple Date Formats with `coalesce()`

Real-world datasets often contain dates in multiple formats—different sources, data quality issues, or historical changes. Use coalesce() to attempt multiple parsing strategies:

from pyspark.sql.functions import coalesce

data = [
    ("2024-01-15",),           # YYYY-MM-DD
    ("01/15/2024",),           # MM/dd/yyyy
    ("15-Jan-2024",),          # dd-MMM-yyyy
    ("2024.01.15",)            # YYYY.MM.dd
]

df = spark.createDataFrame(data, ["date_string"])

df_multi = df.withColumn(
    "date",
    coalesce(
        to_date(col("date_string"), "yyyy-MM-dd"),
        to_date(col("date_string"), "MM/dd/yyyy"),
        to_date(col("date_string"), "dd-MMM-yyyy"),
        to_date(col("date_string"), "yyyy.MM.dd")
    )
)

df_multi.show()

For more complex scenarios, create a reusable parsing function:

from pyspark.sql.functions import udf
from pyspark.sql.types import DateType
from datetime import datetime

def parse_flexible_date(date_str):
    if not date_str:
        return None
    
    formats = [
        "%Y-%m-%d",
        "%m/%d/%Y",
        "%d-%b-%Y",
        "%Y.%m.%d",
        "%d/%m/%Y"
    ]
    
    for fmt in formats:
        try:
            return datetime.strptime(date_str, fmt).date()
        except ValueError:
            continue
    return None

parse_date_udf = udf(parse_flexible_date, DateType())

df_flexible = df.withColumn("date", parse_date_udf(col("date_string")))

Note: UDFs have performance overhead. Use built-in functions with coalesce() when possible.

Performance Considerations and Best Practices

Converting to proper date types isn’t just about correctness—it’s about performance. Spark’s Catalyst optimizer can’t optimize string-based date operations.

Partition Pruning Example:

# Bad: String-based filtering scans all partitions
df_string = spark.read.parquet("data/events/")
df_filtered = df_string.filter(col("event_date_string") >= "2024-01-01")

# Good: Date-based filtering enables partition pruning
df_date = df_string.withColumn("event_date", to_date(col("event_date_string")))
df_filtered = df_date.filter(col("event_date") >= "2024-01-01")

When writing partitioned data, use date columns for partitioning:

df_converted = df.withColumn("date", to_date(col("date_string")))

# Write partitioned by date
df_converted.write.partitionBy("date").parquet("output/partitioned_data")

# Queries with date filters skip entire partitions
df_read = spark.read.parquet("output/partitioned_data")
df_read.filter(col("date") == "2024-01-15")  # Only reads one partition

Cache converted DataFrames when reusing them:

df_converted = df.withColumn("date", to_date(col("date_string")))
df_converted.cache()

# Multiple operations benefit from cached conversion
df_converted.filter(col("date") >= "2024-01-01").count()
df_converted.groupBy("date").count().show()

Common Errors and Troubleshooting

ParseException with to_timestamp():

When format patterns don’t match, you get null values, not exceptions. Validate your results:

df_converted = df.withColumn("date", to_date(col("date_string"), "yyyy-MM-dd"))

# Check for conversion failures
null_count = df_converted.filter(col("date").isNull()).count()
print(f"Failed conversions: {null_count}")

# Identify problematic patterns
df_converted.filter(col("date").isNull() & col("date_string").isNotNull()).show()

Locale-specific formats:

Month names depend on JVM locale settings:

# For locale-specific month names, ensure JVM locale matches
# This might fail if JVM locale isn't English
data = [("15-Jan-2024",)]
df = spark.createDataFrame(data, ["date_string"])
df.withColumn("date", to_date(col("date_string"), "dd-MMM-yyyy")).show()

# Use numeric months for portability
data = [("15-01-2024",)]
df = spark.createDataFrame(data, ["date_string"])
df.withColumn("date", to_date(col("date_string"), "dd-MM-yyyy")).show()

Validating conversion results:

from pyspark.sql.functions import when

df_validated = df.withColumn("date", to_date(col("date_string"))) \
    .withColumn(
        "conversion_status",
        when(col("date").isNull() & col("date_string").isNotNull(), "FAILED")
        .when(col("date").isNotNull(), "SUCCESS")
        .otherwise("NULL_INPUT")
    )

df_validated.groupBy("conversion_status").count().show()

Converting strings to dates and timestamps in PySpark is fundamental for effective time-series analysis. Master these patterns, understand the performance implications, and your Spark jobs will run faster and more reliably. Always convert at the earliest stage possible, validate your conversions, and leverage proper date types for partitioning and filtering.

Introduction

Basic String to Date Conversion using to_date()

String to Timestamp Conversion using to_timestamp()

Using cast() Method for Type Conversion

Handling Multiple Date Formats with coalesce()