How to Use Struct Type in PySpark

PySpark's StructType is the foundation for defining complex schemas in DataFrames. While simple datasets with flat columns work fine for basic analytics, real-world data is messy and hierarchical....

Key Insights

  • StructType and StructField are PySpark’s building blocks for defining complex, nested schemas that mirror real-world hierarchical data structures like JSON documents or nested objects.
  • Use dot notation (col("parent.child")) to access nested fields, and leverage withField() for surgical updates to struct columns without rebuilding entire schemas.
  • Flatten deeply nested structs for analytical queries where performance matters, but keep structs intact when you need to preserve logical groupings or write data back to nested formats like JSON or Parquet.

Introduction to StructType and StructField

PySpark’s StructType is the foundation for defining complex schemas in DataFrames. While simple datasets with flat columns work fine for basic analytics, real-world data is messy and hierarchical. Customer records contain addresses. Orders contain line items. Events contain metadata objects.

StructType lets you model this hierarchy explicitly rather than flattening everything into dozens of columns or storing raw JSON strings. When you define a schema with StructType, Spark understands your data’s shape at compile time, enabling better query optimization and clearer code.

You should reach for StructType when:

  • Your source data is inherently nested (JSON APIs, document databases)
  • You want to group related fields logically (address components, contact info)
  • You’re building data pipelines that need to preserve structure through transformations
  • Schema enforcement matters more than schema inference flexibility

Creating a Basic StructType Schema

Every StructType schema is composed of StructField objects. Each StructField defines a column’s name, data type, and whether it accepts null values.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("StructTypeDemo").getOrCreate()

# Define schema explicitly
user_schema = StructType([
    StructField("name", StringType(), nullable=False),
    StructField("age", IntegerType(), nullable=True),
    StructField("email", StringType(), nullable=True)
])

# Create DataFrame with schema
users_data = [
    ("Alice", 32, "alice@example.com"),
    ("Bob", 28, "bob@example.com"),
    ("Charlie", None, "charlie@example.com")
]

users_df = spark.createDataFrame(users_data, schema=user_schema)
users_df.printSchema()
users_df.show()

Output:

root
 |-- name: string (nullable = false)
 |-- age: integer (nullable = true)
 |-- email: string (nullable = true)

+-------+----+-------------------+
|   name| age|              email|
+-------+----+-------------------+
|  Alice|  32|  alice@example.com|
|    Bob|  28|    bob@example.com|
|Charlie|null|charlie@example.com|
+-------+----+-------------------+

The nullable parameter is more than documentation—Spark uses it for optimization and will throw errors if you violate the constraint when writing to strict sinks.

Working with Nested Structs

The real power of StructType emerges when you nest structures. Instead of columns like address_street, address_city, and address_zip, you create a single address column containing a struct.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType

# Define nested schema
address_schema = StructType([
    StructField("street", StringType(), True),
    StructField("city", StringType(), True),
    StructField("zip", StringType(), True)
])

customer_schema = StructType([
    StructField("customer_id", LongType(), False),
    StructField("name", StringType(), False),
    StructField("address", address_schema, True),  # Nested struct
    StructField("loyalty_points", IntegerType(), True)
])

# Create data with nested structure
customers_data = [
    (1001, "Alice Smith", ("123 Main St", "Seattle", "98101"), 500),
    (1002, "Bob Jones", ("456 Oak Ave", "Portland", "97201"), 250),
    (1003, "Carol White", ("789 Pine Rd", "Seattle", "98102"), 750)
]

customers_df = spark.createDataFrame(customers_data, schema=customer_schema)
customers_df.printSchema()
customers_df.show(truncate=False)

Output:

root
 |-- customer_id: long (nullable = false)
 |-- name: string (nullable = false)
 |-- address: struct (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- zip: string (nullable = true)
 |-- loyalty_points: integer (nullable = true)

+-----------+-----------+-----------------------------+--------------+
|customer_id|name       |address                      |loyalty_points|
+-----------+-----------+-----------------------------+--------------+
|1001       |Alice Smith|{123 Main St, Seattle, 98101}|500           |
|1002       |Bob Jones  |{456 Oak Ave, Portland, 97201}|250          |
|1003       |Carol White|{789 Pine Rd, Seattle, 98102}|750           |
+-----------+-----------+-----------------------------+--------------+

You can nest structs multiple levels deep—a customer can have an address, which has a coordinates struct containing latitude and longitude. But resist the urge to over-nest; deeply nested schemas become unwieldy to query.

Accessing and Querying Struct Fields

PySpark provides intuitive dot notation for accessing nested fields. This works in both select() and filter() operations.

from pyspark.sql.functions import col

# Select nested fields using dot notation
customers_df.select(
    col("name"),
    col("address.city"),
    col("address.zip")
).show()

# Filter based on nested field values
seattle_customers = customers_df.filter(col("address.city") == "Seattle")
seattle_customers.show(truncate=False)

# Use getField() as an alternative
customers_df.select(
    col("name"),
    col("address").getField("city").alias("city"),
    col("address").getField("zip").alias("zip_code")
).show()

# Combine with other operations
from pyspark.sql.functions import upper

customers_df.select(
    col("name"),
    upper(col("address.city")).alias("city_upper"),
    col("loyalty_points")
).filter(col("loyalty_points") > 300).show()

The dot notation is cleaner for simple access, but getField() becomes useful when you’re building column names dynamically or chaining multiple operations.

Modifying Struct Columns

Updating nested data used to require destructuring and reconstructing entire structs. PySpark 3.1+ introduced withField() for surgical modifications.

from pyspark.sql.functions import col, lit, struct, upper

# Add a new field to existing struct
updated_df = customers_df.withColumn(
    "address",
    col("address").withField("country", lit("USA"))
)
updated_df.printSchema()
updated_df.show(truncate=False)

# Update an existing nested field
normalized_df = customers_df.withColumn(
    "address",
    col("address").withField("city", upper(col("address.city")))
)
normalized_df.show(truncate=False)

# Drop a field from a struct
trimmed_df = customers_df.withColumn(
    "address",
    col("address").dropFields("street")
)
trimmed_df.printSchema()

# Chain multiple modifications
modified_df = customers_df.withColumn(
    "address",
    col("address")
        .withField("country", lit("USA"))
        .withField("city", upper(col("address.city")))
)
modified_df.show(truncate=False)

For PySpark versions before 3.1, you’ll need to rebuild structs manually:

# Legacy approach (pre-3.1)
from pyspark.sql.functions import struct

legacy_updated_df = customers_df.withColumn(
    "address",
    struct(
        col("address.street").alias("street"),
        upper(col("address.city")).alias("city"),
        col("address.zip").alias("zip"),
        lit("USA").alias("country")
    )
)

Converting Between Structs and Other Formats

Real pipelines often need to flatten structs for analytics or parse JSON into structs for processing.

from pyspark.sql.functions import col, struct, to_json, from_json, schema_of_json

# Flatten struct to individual columns
flat_df = customers_df.select(
    col("customer_id"),
    col("name"),
    col("address.street").alias("street"),
    col("address.city").alias("city"),
    col("address.zip").alias("zip"),
    col("loyalty_points")
)
flat_df.show()

# Convert struct column to JSON string
json_df = customers_df.withColumn(
    "address_json",
    to_json(col("address"))
)
json_df.select("name", "address_json").show(truncate=False)

# Parse JSON string back to struct
json_data = [
    (1, '{"product": "Widget", "price": 29.99, "quantity": 3}'),
    (2, '{"product": "Gadget", "price": 49.99, "quantity": 1}')
]
json_df = spark.createDataFrame(json_data, ["order_id", "item_json"])

# Define target schema
item_schema = StructType([
    StructField("product", StringType(), True),
    StructField("price", DoubleType(), True),
    StructField("quantity", IntegerType(), True)
])

# Parse JSON to struct
from pyspark.sql.types import DoubleType

parsed_df = json_df.withColumn(
    "item",
    from_json(col("item_json"), item_schema)
)
parsed_df.printSchema()
parsed_df.select("order_id", "item.product", "item.price").show()

# Reconstruct struct from flat columns
reconstructed_df = flat_df.withColumn(
    "address",
    struct(
        col("street"),
        col("city"),
        col("zip")
    )
).drop("street", "city", "zip")
reconstructed_df.show(truncate=False)

Best Practices and Performance Considerations

When to use structs vs. flat columns: Use structs when fields are logically grouped and often accessed together. Use flat columns when you frequently filter or aggregate on individual fields—predicate pushdown works better on top-level columns.

Schema evolution: Adding fields to structs is generally safe; removing or renaming fields breaks compatibility. When working with Parquet or Delta Lake, test schema changes in development before production deployments.

Performance implications: Deeply nested structs (3+ levels) incur overhead during serialization and can complicate query plans. If you’re doing heavy analytics on nested data, consider materializing flattened views for frequently-run queries.

Explicit schemas over inference: Always define schemas explicitly for production pipelines. Schema inference reads extra data, can guess wrong types, and makes pipelines fragile to upstream changes.

# Don't do this in production
df = spark.read.json("data.json")  # Schema inference

# Do this instead
explicit_schema = StructType([...])
df = spark.read.schema(explicit_schema).json("data.json")

Null handling in structs: A null struct is different from a struct with null fields. Design your schemas to handle both cases, and use isNull() checks appropriately when filtering.

StructType transforms PySpark from a flat-file processor into a tool capable of handling the complex, hierarchical data that modern applications produce. Master these patterns, and you’ll write cleaner pipelines that preserve data semantics from source to sink.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.