PySpark - Read XML File into DataFrame
PySpark requires the `spark-xml` package to read XML files. Install it via pip or include it when creating your Spark session.
Key Insights
- PySpark doesn’t natively support XML files, requiring the spark-xml library from Databricks to parse XML data into DataFrames with proper schema inference
- XML parsing offers three approaches: automatic schema inference for rapid prototyping, explicit schema definition for production reliability, and complex nested structure handling for hierarchical data
- Performance optimization requires careful consideration of rowTag selection, partition management, and memory allocation when processing large XML files in distributed environments
Installing and Configuring spark-xml
PySpark requires the spark-xml package to read XML files. Install it via pip or include it when creating your Spark session.
from pyspark.sql import SparkSession
# Create Spark session with spark-xml package
spark = SparkSession.builder \
.appName("XML Reader") \
.config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.17.0") \
.getOrCreate()
For existing environments, install via pip:
pip install spark-xml
The version number in the Maven coordinates (_2.12) must match your Scala version. Use _2.13 for Scala 2.13 environments.
Basic XML File Reading with Schema Inference
The simplest approach lets Spark infer the schema automatically. This works well for exploration and simple XML structures.
# Read XML with automatic schema inference
df = spark.read \
.format("xml") \
.option("rowTag", "book") \
.load("books.xml")
df.show(truncate=False)
df.printSchema()
Sample XML structure (books.xml):
<?xml version="1.0" encoding="UTF-8"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
</book>
</catalog>
The rowTag parameter specifies which XML element represents a row. Each <book> element becomes one DataFrame row.
Explicit Schema Definition
Production systems benefit from explicit schemas. This prevents schema drift, improves performance, and ensures data type consistency.
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, DateType
# Define explicit schema
schema = StructType([
StructField("_id", StringType(), True),
StructField("author", StringType(), True),
StructField("title", StringType(), True),
StructField("genre", StringType(), True),
StructField("price", DoubleType(), True),
StructField("publish_date", DateType(), True)
])
# Read with explicit schema
df = spark.read \
.format("xml") \
.option("rowTag", "book") \
.schema(schema) \
.load("books.xml")
# Attributes are prefixed with underscore
df.select("_id", "title", "price").show()
XML attributes (like id="bk101") are automatically prefixed with underscore (_id) to distinguish them from element values.
Handling Nested XML Structures
Real-world XML often contains nested elements. The spark-xml library handles these by creating nested struct types.
# Complex nested XML example
nested_xml = """
<library>
<book id="bk101">
<metadata>
<author>
<firstname>Matthew</firstname>
<lastname>Gambardella</lastname>
</author>
<publication>
<publisher>Tech Press</publisher>
<year>2000</year>
</publication>
</metadata>
<title>XML Developer's Guide</title>
<price currency="USD">44.95</price>
</book>
</library>
"""
# Read nested structure
df_nested = spark.read \
.format("xml") \
.option("rowTag", "book") \
.load("library.xml")
df_nested.printSchema()
Output schema:
root
|-- _id: string (nullable = true)
|-- metadata: struct (nullable = true)
| |-- author: struct (nullable = true)
| | |-- firstname: string (nullable = true)
| | |-- lastname: string (nullable = true)
| |-- publication: struct (nullable = true)
| | |-- publisher: string (nullable = true)
| | |-- year: long (nullable = true)
|-- title: string (nullable = true)
|-- price: struct (nullable = true)
| |-- _currency: string (nullable = true)
| |-- _VALUE: double (nullable = true)
Access nested fields using dot notation:
df_nested.select(
"_id",
"metadata.author.firstname",
"metadata.author.lastname",
"metadata.publication.publisher",
"price._VALUE",
"price._currency"
).show()
Flattening Nested Structures
Complex nested structures often require flattening for analysis or downstream processing.
from pyspark.sql.functions import col
# Flatten nested DataFrame
df_flat = df_nested.select(
col("_id").alias("book_id"),
col("title"),
col("metadata.author.firstname").alias("author_first"),
col("metadata.author.lastname").alias("author_last"),
col("metadata.publication.publisher").alias("publisher"),
col("metadata.publication.year").alias("pub_year"),
col("price._VALUE").alias("price"),
col("price._currency").alias("currency")
)
df_flat.show()
For deeply nested structures with arrays:
from pyspark.sql.functions import explode
# XML with array elements
df_with_arrays = spark.read \
.format("xml") \
.option("rowTag", "book") \
.load("books_with_reviews.xml")
# Explode array to create one row per review
df_exploded = df_with_arrays.select(
"title",
explode("reviews.review").alias("review")
).select(
"title",
"review.rating",
"review.comment"
)
Advanced Configuration Options
Fine-tune XML parsing behavior with additional options:
df = spark.read \
.format("xml") \
.option("rowTag", "book") \
.option("rootTag", "catalog") \
.option("nullValue", "NULL") \
.option("attributePrefix", "@") \
.option("valueTag", "#VALUE") \
.option("ignoreSurroundingSpaces", "true") \
.option("dateFormat", "yyyy-MM-dd") \
.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss") \
.load("books.xml")
Key options:
attributePrefix: Change attribute prefix from_to custom valuevalueTag: Change value tag from_VALUEto custom valuenullValue: String to treat as nullignoreSurroundingSpaces: Trim whitespace from valuesdateFormat/timestampFormat: Parse date/timestamp strings
Performance Optimization for Large XML Files
Processing large XML files requires optimization strategies:
# Read with partitioning
df = spark.read \
.format("xml") \
.option("rowTag", "record") \
.load("large_dataset.xml") \
.repartition(200) # Adjust based on cluster size
# Cache frequently accessed data
df.cache()
# Write to Parquet for faster subsequent reads
df.write.mode("overwrite").parquet("output/books.parquet")
# Read from Parquet instead
df_parquet = spark.read.parquet("output/books.parquet")
For extremely large files, process in chunks:
# Process multiple XML files in parallel
df_multiple = spark.read \
.format("xml") \
.option("rowTag", "record") \
.load("data/*.xml")
# Monitor partition distribution
print(f"Number of partitions: {df_multiple.rdd.getNumPartitions()}")
print(f"Row count: {df_multiple.count()}")
Writing DataFrames to XML
Convert DataFrames back to XML format when needed:
# Write DataFrame to XML
df.write \
.format("xml") \
.option("rootTag", "catalog") \
.option("rowTag", "book") \
.mode("overwrite") \
.save("output/books_output.xml")
# Write with compression
df.write \
.format("xml") \
.option("rootTag", "catalog") \
.option("rowTag", "book") \
.option("compression", "gzip") \
.mode("overwrite") \
.save("output/books_compressed.xml")
The spark-xml library provides robust XML processing capabilities for PySpark, handling everything from simple flat structures to complex nested hierarchies. Explicit schema definition and proper configuration ensure reliable, performant data pipelines in production environments.