PySpark - Read XML File into DataFrame

Key Insights

PySpark doesn’t natively support XML files, requiring the spark-xml library from Databricks to parse XML data into DataFrames with proper schema inference
XML parsing offers three approaches: automatic schema inference for rapid prototyping, explicit schema definition for production reliability, and complex nested structure handling for hierarchical data
Performance optimization requires careful consideration of rowTag selection, partition management, and memory allocation when processing large XML files in distributed environments

Installing and Configuring spark-xml

PySpark requires the spark-xml package to read XML files. Install it via pip or include it when creating your Spark session.

from pyspark.sql import SparkSession

# Create Spark session with spark-xml package
spark = SparkSession.builder \
    .appName("XML Reader") \
    .config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.17.0") \
    .getOrCreate()

For existing environments, install via pip:

pip install spark-xml

The version number in the Maven coordinates (_2.12) must match your Scala version. Use _2.13 for Scala 2.13 environments.

Basic XML File Reading with Schema Inference

The simplest approach lets Spark infer the schema automatically. This works well for exploration and simple XML structures.

# Read XML with automatic schema inference
df = spark.read \
    .format("xml") \
    .option("rowTag", "book") \
    .load("books.xml")

df.show(truncate=False)
df.printSchema()

Sample XML structure (books.xml):

<?xml version="1.0" encoding="UTF-8"?>
<catalog>
    <book id="bk101">
        <author>Gambardella, Matthew</author>
        <title>XML Developer's Guide</title>
        <genre>Computer</genre>
        <price>44.95</price>
        <publish_date>2000-10-01</publish_date>
    </book>
    <book id="bk102">
        <author>Ralls, Kim</author>
        <title>Midnight Rain</title>
        <genre>Fantasy</genre>
        <price>5.95</price>
        <publish_date>2000-12-16</publish_date>
    </book>
</catalog>

The rowTag parameter specifies which XML element represents a row. Each <book> element becomes one DataFrame row.

Explicit Schema Definition

Production systems benefit from explicit schemas. This prevents schema drift, improves performance, and ensures data type consistency.

from pyspark.sql.types import StructType, StructField, StringType, DoubleType, DateType

# Define explicit schema
schema = StructType([
    StructField("_id", StringType(), True),
    StructField("author", StringType(), True),
    StructField("title", StringType(), True),
    StructField("genre", StringType(), True),
    StructField("price", DoubleType(), True),
    StructField("publish_date", DateType(), True)
])

# Read with explicit schema
df = spark.read \
    .format("xml") \
    .option("rowTag", "book") \
    .schema(schema) \
    .load("books.xml")

# Attributes are prefixed with underscore
df.select("_id", "title", "price").show()

XML attributes (like id="bk101") are automatically prefixed with underscore (_id) to distinguish them from element values.

Handling Nested XML Structures

Real-world XML often contains nested elements. The spark-xml library handles these by creating nested struct types.

# Complex nested XML example
nested_xml = """
<library>
    <book id="bk101">
        <metadata>
            <author>
                <firstname>Matthew</firstname>
                <lastname>Gambardella</lastname>
            </author>
            <publication>
                <publisher>Tech Press</publisher>
                <year>2000</year>
            </publication>
        </metadata>
        <title>XML Developer's Guide</title>
        <price currency="USD">44.95</price>
    </book>
</library>
"""

# Read nested structure
df_nested = spark.read \
    .format("xml") \
    .option("rowTag", "book") \
    .load("library.xml")

df_nested.printSchema()

Output schema:

root
 |-- _id: string (nullable = true)
 |-- metadata: struct (nullable = true)
 |    |-- author: struct (nullable = true)
 |    |    |-- firstname: string (nullable = true)
 |    |    |-- lastname: string (nullable = true)
 |    |-- publication: struct (nullable = true)
 |    |    |-- publisher: string (nullable = true)
 |    |    |-- year: long (nullable = true)
 |-- title: string (nullable = true)
 |-- price: struct (nullable = true)
 |    |-- _currency: string (nullable = true)
 |    |-- _VALUE: double (nullable = true)

Access nested fields using dot notation:

df_nested.select(
    "_id",
    "metadata.author.firstname",
    "metadata.author.lastname",
    "metadata.publication.publisher",
    "price._VALUE",
    "price._currency"
).show()

Flattening Nested Structures

Complex nested structures often require flattening for analysis or downstream processing.

from pyspark.sql.functions import col

# Flatten nested DataFrame
df_flat = df_nested.select(
    col("_id").alias("book_id"),
    col("title"),
    col("metadata.author.firstname").alias("author_first"),
    col("metadata.author.lastname").alias("author_last"),
    col("metadata.publication.publisher").alias("publisher"),
    col("metadata.publication.year").alias("pub_year"),
    col("price._VALUE").alias("price"),
    col("price._currency").alias("currency")
)

df_flat.show()

For deeply nested structures with arrays:

from pyspark.sql.functions import explode

# XML with array elements
df_with_arrays = spark.read \
    .format("xml") \
    .option("rowTag", "book") \
    .load("books_with_reviews.xml")

# Explode array to create one row per review
df_exploded = df_with_arrays.select(
    "title",
    explode("reviews.review").alias("review")
).select(
    "title",
    "review.rating",
    "review.comment"
)

Advanced Configuration Options

Fine-tune XML parsing behavior with additional options:

df = spark.read \
    .format("xml") \
    .option("rowTag", "book") \
    .option("rootTag", "catalog") \
    .option("nullValue", "NULL") \
    .option("attributePrefix", "@") \
    .option("valueTag", "#VALUE") \
    .option("ignoreSurroundingSpaces", "true") \
    .option("dateFormat", "yyyy-MM-dd") \
    .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss") \
    .load("books.xml")

Key options:

attributePrefix: Change attribute prefix from _ to custom value
valueTag: Change value tag from _VALUE to custom value
nullValue: String to treat as null
ignoreSurroundingSpaces: Trim whitespace from values
dateFormat/timestampFormat: Parse date/timestamp strings

Performance Optimization for Large XML Files

Processing large XML files requires optimization strategies:

# Read with partitioning
df = spark.read \
    .format("xml") \
    .option("rowTag", "record") \
    .load("large_dataset.xml") \
    .repartition(200)  # Adjust based on cluster size

# Cache frequently accessed data
df.cache()

# Write to Parquet for faster subsequent reads
df.write.mode("overwrite").parquet("output/books.parquet")

# Read from Parquet instead
df_parquet = spark.read.parquet("output/books.parquet")

For extremely large files, process in chunks:

# Process multiple XML files in parallel
df_multiple = spark.read \
    .format("xml") \
    .option("rowTag", "record") \
    .load("data/*.xml")

# Monitor partition distribution
print(f"Number of partitions: {df_multiple.rdd.getNumPartitions()}")
print(f"Row count: {df_multiple.count()}")

Writing DataFrames to XML

Convert DataFrames back to XML format when needed:

# Write DataFrame to XML
df.write \
    .format("xml") \
    .option("rootTag", "catalog") \
    .option("rowTag", "book") \
    .mode("overwrite") \
    .save("output/books_output.xml")

# Write with compression
df.write \
    .format("xml") \
    .option("rootTag", "catalog") \
    .option("rowTag", "book") \
    .option("compression", "gzip") \
    .mode("overwrite") \
    .save("output/books_compressed.xml")

The spark-xml library provides robust XML processing capabilities for PySpark, handling everything from simple flat structures to complex nested hierarchies. Explicit schema definition and proper configuration ensure reliable, performant data pipelines in production environments.