Spark SQL - Struct Type Operations | Application Architect

Key Insights

Struct types in Spark SQL enable nested data modeling directly within DataFrames, eliminating the need for separate tables and complex joins while maintaining schema validation and type safety
Operations on structs include dot notation access, wildcard expansion with .*, selective field extraction, and runtime manipulation through struct(), withField(), and dropFields() functions
Performance optimization requires understanding struct serialization overhead, avoiding unnecessary struct explosions, and leveraging predicate pushdown for nested field filtering in columnar formats like Parquet

Understanding Struct Types in Spark SQL

Struct types represent complex data structures within a single column, similar to objects in programming languages or nested JSON documents. Unlike primitive types, structs contain multiple named fields with potentially different data types.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

spark = SparkSession.builder.appName("StructOperations").getOrCreate()

# Define schema with struct type
schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("address", StructType([
        StructField("street", StringType(), True),
        StructField("city", StringType(), True),
        StructField("zipcode", StringType(), True)
    ]), True),
    StructField("contact", StructType([
        StructField("email", StringType(), True),
        StructField("phone", StringType(), True)
    ]), True)
])

data = [
    (1, "John Doe", ("123 Main St", "New York", "10001"), ("john@example.com", "555-0100")),
    (2, "Jane Smith", ("456 Oak Ave", "Boston", "02101"), ("jane@example.com", "555-0200")),
    (3, "Bob Johnson", ("789 Pine Rd", "Chicago", "60601"), ("bob@example.com", None))
]

df = spark.createDataFrame(data, schema)
df.printSchema()

Accessing Struct Fields

Spark provides multiple methods to access nested struct fields. Dot notation offers the most readable syntax for direct field access.

from pyspark.sql.functions import col

# Dot notation - most common approach
df.select(
    "id",
    "name",
    col("address.city"),
    col("contact.email")
).show()

# Bracket notation - useful for field names with special characters
df.select(
    col("address")["city"],
    col("contact")["email"]
).show()

# getField() method - programmatic access
df.select(
    col("address").getField("city"),
    col("contact").getField("email")
).show()

# Wildcard expansion - extract all struct fields as separate columns
df.select("id", "name", "address.*").show()

Creating and Modifying Structs

The struct() function creates new struct columns from existing columns, while withField() and dropFields() enable runtime struct modification without schema redefinition.

from pyspark.sql.functions import struct, lit

# Create new struct from existing columns
df_with_metadata = df.withColumn(
    "metadata",
    struct(
        lit("2024").alias("year"),
        lit("active").alias("status"),
        col("id").alias("record_id")
    )
)

df_with_metadata.select("id", "metadata").show(truncate=False)

# Add field to existing struct (Spark 3.1+)
df_updated = df.withColumn(
    "address",
    col("address").withField("country", lit("USA"))
)

df_updated.select("address").show(truncate=False)

# Add multiple fields
df_enhanced = df.withColumn(
    "contact",
    col("contact")
        .withField("preferred", lit(True))
        .withField("verified", lit(False))
)

# Drop fields from struct (Spark 3.1+)
df_simplified = df_enhanced.withColumn(
    "contact",
    col("contact").dropFields("verified")
)

Filtering and Conditional Logic with Structs

Filtering on nested struct fields requires proper column reference syntax. Complex conditions can combine multiple struct field predicates.

from pyspark.sql.functions import when

# Filter by nested field
df.filter(col("address.city") == "New York").show()

# Multiple nested conditions
df.filter(
    (col("address.city").isin(["New York", "Boston"])) &
    (col("contact.phone").isNotNull())
).show()

# Conditional struct field updates
df_conditional = df.withColumn(
    "contact",
    when(col("contact.phone").isNull(),
         struct(
             col("contact.email"),
             lit("000-0000").alias("phone")
         )
    ).otherwise(col("contact"))
)

df_conditional.show(truncate=False)

Flattening and Exploding Nested Structures

Converting structs to flat columns improves compatibility with systems that don’t support nested types. However, this increases column count and may impact performance.

# Flatten all struct fields to top level
df_flat = df.select(
    "id",
    "name",
    col("address.street").alias("street"),
    col("address.city").alias("city"),
    col("address.zipcode").alias("zipcode"),
    col("contact.email").alias("email"),
    col("contact.phone").alias("phone")
)

df_flat.show()

# Programmatic flattening for dynamic schemas
def flatten_struct(df, struct_col):
    """Flatten a struct column into separate columns"""
    struct_fields = df.schema[struct_col].dataType.fieldNames()
    
    select_expr = [c for c in df.columns if c != struct_col]
    for field in struct_fields:
        select_expr.append(col(f"{struct_col}.{field}").alias(f"{struct_col}_{field}"))
    
    return df.select(*select_expr)

df_auto_flat = flatten_struct(df, "address")
df_auto_flat.show()

Working with Arrays of Structs

Arrays of structs represent one-to-many relationships within a single row, common in semi-structured data like JSON logs or event streams.

from pyspark.sql.functions import explode, array, transform

# Create DataFrame with array of structs
orders_data = [
    (1, "John", [
        {"product": "Laptop", "price": 1200.00, "quantity": 1},
        {"product": "Mouse", "price": 25.00, "quantity": 2}
    ]),
    (2, "Jane", [
        {"product": "Monitor", "price": 350.00, "quantity": 1},
        {"product": "Keyboard", "price": 75.00, "quantity": 1}
    ])
]

orders_df = spark.createDataFrame(orders_data, ["customer_id", "name", "items"])

# Explode array of structs to separate rows
orders_exploded = orders_df.select(
    "customer_id",
    "name",
    explode("items").alias("item")
).select(
    "customer_id",
    "name",
    "item.product",
    "item.price",
    "item.quantity"
)

orders_exploded.show()

# Transform array of structs without exploding
orders_with_total = orders_df.withColumn(
    "items_with_total",
    transform(
        col("items"),
        lambda x: struct(
            x.product.alias("product"),
            x.price.alias("price"),
            x.quantity.alias("quantity"),
            (x.price * x.quantity).alias("total")
        )
    )
)

orders_with_total.select("customer_id", "items_with_total").show(truncate=False)

Performance Considerations

Struct operations impact query performance differently than flat schemas. Understanding serialization costs and predicate pushdown behavior optimizes execution plans.

# Inefficient: Selecting entire struct when only one field needed
df.select("id", "address").filter(col("address.city") == "New York")

# Efficient: Select only required fields
df.select("id", "address.city").filter(col("address.city") == "New York")

# Parquet predicate pushdown works with nested fields
df.write.parquet("/tmp/users_parquet")
df_parquet = spark.read.parquet("/tmp/users_parquet")

# This filter pushes down to Parquet reader
df_parquet.filter(col("address.city") == "New York").explain()

# Cache consideration: structs increase memory footprint
df.cache()
print(f"Cached size: {df.storageLevel}")

SQL Syntax for Struct Operations

Spark SQL supports struct operations through standard SQL syntax with dot notation and struct constructor functions.

-- Create temporary view
df.createOrReplaceTempView("users")

-- Access nested fields
SELECT id, name, address.city, contact.email
FROM users
WHERE address.city = 'New York';

-- Create struct in SQL
SELECT 
    id,
    name,
    STRUCT(address.city, address.zipcode) as location,
    STRUCT(contact.email, 'primary' as type) as primary_contact
FROM users;

-- Named struct with explicit fields
SELECT
    id,
    NAMED_STRUCT(
        'city', address.city,
        'zip', address.zipcode,
        'country', 'USA'
    ) as full_location
FROM users;

Struct types provide powerful data modeling capabilities in Spark SQL, enabling efficient representation of hierarchical data while maintaining type safety and schema enforcement. Proper use of struct operations balances code readability with query performance, particularly when working with columnar storage formats that support predicate pushdown on nested fields.