PySpark - Read Excel File into DataFrame
PySpark's native data source API supports formats like CSV, JSON, Parquet, and ORC, but Excel files require additional handling. Excel files are binary formats (.xlsx) or legacy binary formats (.xls)...
Key Insights
- PySpark doesn’t natively support Excel files; you need third-party libraries like
pandasorspark-excelto read .xlsx/.xls formats into DataFrames - The pandas-based approach converts Excel to Spark DataFrame via
createDataFrame(), suitable for smaller files that fit in driver memory - For production workloads with large Excel files, use the
spark-excelpackage which leverages Spark’s distributed processing capabilities
Understanding PySpark’s Excel Limitations
PySpark’s native data source API supports formats like CSV, JSON, Parquet, and ORC, but Excel files require additional handling. Excel files are binary formats (.xlsx) or legacy binary formats (.xls) that don’t align with Spark’s distributed file processing model. This means you’ll need to choose between two primary approaches based on your file size and infrastructure requirements.
Method 1: Using Pandas for Small to Medium Files
The simplest approach leverages pandas to read Excel files, then converts the pandas DataFrame to a PySpark DataFrame. This works well when your Excel files are under 1-2 GB and can comfortably fit in the driver’s memory.
from pyspark.sql import SparkSession
import pandas as pd
# Initialize Spark session
spark = SparkSession.builder \
.appName("ExcelReader") \
.getOrCreate()
# Read Excel file using pandas
pandas_df = pd.read_excel("sales_data.xlsx", sheet_name="Q1_Sales")
# Convert to PySpark DataFrame
spark_df = spark.createDataFrame(pandas_df)
# Display schema and data
spark_df.printSchema()
spark_df.show(5)
For Excel files with multiple sheets, you can read specific sheets or all sheets:
# Read specific sheet
pandas_df = pd.read_excel("financial_report.xlsx", sheet_name="Revenue")
# Read multiple sheets into separate DataFrames
excel_file = pd.ExcelFile("financial_report.xlsx")
sheet_names = excel_file.sheet_names
dataframes = {}
for sheet in sheet_names:
pandas_df = pd.read_excel(excel_file, sheet_name=sheet)
dataframes[sheet] = spark.createDataFrame(pandas_df)
# Access specific sheet DataFrame
revenue_df = dataframes["Revenue"]
Handle specific data types and null values during conversion:
# Read Excel with specific parameters
pandas_df = pd.read_excel(
"inventory.xlsx",
sheet_name="Products",
header=0,
usecols="A:F", # Read only columns A through F
dtype={"ProductID": str, "Quantity": int},
na_values=["N/A", "Missing"]
)
# Convert with schema inference
spark_df = spark.createDataFrame(pandas_df)
# Or specify schema explicitly
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
schema = StructType([
StructField("ProductID", StringType(), True),
StructField("ProductName", StringType(), True),
StructField("Category", StringType(), True),
StructField("Quantity", IntegerType(), True),
StructField("Price", DoubleType(), True),
StructField("LastUpdated", StringType(), True)
])
spark_df = spark.createDataFrame(pandas_df, schema=schema)
Method 2: Using spark-excel for Large Files
For production environments handling larger Excel files, the spark-excel package provides native Spark integration with distributed processing capabilities. First, install the package:
pip install spark-excel
When starting your Spark session, include the necessary JAR dependencies:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("ExcelReaderProduction") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.12:3.3.1_0.18.5") \
.getOrCreate()
# Read Excel file directly
df = spark.read \
.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'Sheet1'!A1:F1000") \
.load("large_dataset.xlsx")
df.show()
Configure various Excel-specific options:
# Advanced configuration
df = spark.read \
.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'Sales_Data'!A1") \
.option("treatEmptyValuesAsNulls", "true") \
.option("addColorColumns", "false") \
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss") \
.option("maxRowsInMemory", 20) \
.option("excerptSize", 10) \
.load("sales_report.xlsx")
# Apply transformations
result_df = df.filter(df["Revenue"] > 10000) \
.groupBy("Region") \
.agg({"Revenue": "sum", "Units": "sum"}) \
.orderBy("sum(Revenue)", ascending=False)
result_df.show()
Handling Common Excel Scenarios
Reading Excel Files from Cloud Storage
Both methods support reading from cloud storage systems:
# Using pandas with S3
import s3fs
import pandas as pd
s3 = s3fs.S3FileSystem(anon=False)
with s3.open('s3://my-bucket/data/reports.xlsx', 'rb') as f:
pandas_df = pd.read_excel(f, sheet_name="Summary")
spark_df = spark.createDataFrame(pandas_df)
# Using spark-excel with S3
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", "YOUR_ACCESS_KEY")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "YOUR_SECRET_KEY")
df = spark.read \
.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("s3a://my-bucket/data/reports.xlsx")
Processing Multiple Excel Files
Batch process multiple Excel files in a directory:
import os
import glob
# Using pandas approach
excel_files = glob.glob("data/*.xlsx")
all_dataframes = []
for file_path in excel_files:
pandas_df = pd.read_excel(file_path)
pandas_df['source_file'] = os.path.basename(file_path)
all_dataframes.append(pandas_df)
# Combine all pandas DataFrames
combined_pandas = pd.concat(all_dataframes, ignore_index=True)
spark_df = spark.createDataFrame(combined_pandas)
# Using spark-excel with wildcard
df = spark.read \
.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("data/*.xlsx")
Optimizing Performance
Implement performance optimizations for Excel processing:
# Pandas approach with chunking for very large sheets
from pyspark.sql.functions import col
chunk_size = 10000
chunks = []
for chunk in pd.read_excel("huge_file.xlsx", chunksize=chunk_size):
chunks.append(spark.createDataFrame(chunk))
# Union all chunks
from functools import reduce
final_df = reduce(lambda df1, df2: df1.union(df2), chunks)
# Repartition for better parallelism
final_df = final_df.repartition(10)
# Cache if reusing multiple times
final_df.cache()
final_df.count() # Trigger caching
Error Handling and Validation
Implement robust error handling:
from pyspark.sql.utils import AnalysisException
def read_excel_safely(file_path, sheet_name="Sheet1"):
try:
pandas_df = pd.read_excel(file_path, sheet_name=sheet_name)
if pandas_df.empty:
raise ValueError(f"Sheet '{sheet_name}' is empty")
spark_df = spark.createDataFrame(pandas_df)
# Validate schema
required_columns = ["ID", "Name", "Value"]
missing_cols = set(required_columns) - set(spark_df.columns)
if missing_cols:
raise ValueError(f"Missing required columns: {missing_cols}")
return spark_df
except FileNotFoundError:
print(f"File not found: {file_path}")
return None
except ValueError as e:
print(f"Validation error: {str(e)}")
return None
except Exception as e:
print(f"Unexpected error: {str(e)}")
return None
# Usage
df = read_excel_safely("data/input.xlsx", "Sales")
if df is not None:
df.show()
Choosing the Right Approach
Use the pandas method when files are under 2GB, you need quick prototyping, or you’re working in local mode. Choose spark-excel for production workloads exceeding 2GB, distributed processing requirements, or when reading from distributed file systems. For hybrid scenarios, consider reading metadata with pandas and processing large datasets with spark-excel after initial validation.