PySpark - Read Excel File into DataFrame

Key Insights

PySpark doesn’t natively support Excel files; you need third-party libraries like pandas or spark-excel to read .xlsx/.xls formats into DataFrames
The pandas-based approach converts Excel to Spark DataFrame via createDataFrame(), suitable for smaller files that fit in driver memory
For production workloads with large Excel files, use the spark-excel package which leverages Spark’s distributed processing capabilities

Understanding PySpark’s Excel Limitations

PySpark’s native data source API supports formats like CSV, JSON, Parquet, and ORC, but Excel files require additional handling. Excel files are binary formats (.xlsx) or legacy binary formats (.xls) that don’t align with Spark’s distributed file processing model. This means you’ll need to choose between two primary approaches based on your file size and infrastructure requirements.

Method 1: Using Pandas for Small to Medium Files

The simplest approach leverages pandas to read Excel files, then converts the pandas DataFrame to a PySpark DataFrame. This works well when your Excel files are under 1-2 GB and can comfortably fit in the driver’s memory.

from pyspark.sql import SparkSession
import pandas as pd

# Initialize Spark session
spark = SparkSession.builder \
    .appName("ExcelReader") \
    .getOrCreate()

# Read Excel file using pandas
pandas_df = pd.read_excel("sales_data.xlsx", sheet_name="Q1_Sales")

# Convert to PySpark DataFrame
spark_df = spark.createDataFrame(pandas_df)

# Display schema and data
spark_df.printSchema()
spark_df.show(5)

For Excel files with multiple sheets, you can read specific sheets or all sheets:

# Read specific sheet
pandas_df = pd.read_excel("financial_report.xlsx", sheet_name="Revenue")

# Read multiple sheets into separate DataFrames
excel_file = pd.ExcelFile("financial_report.xlsx")
sheet_names = excel_file.sheet_names

dataframes = {}
for sheet in sheet_names:
    pandas_df = pd.read_excel(excel_file, sheet_name=sheet)
    dataframes[sheet] = spark.createDataFrame(pandas_df)
    
# Access specific sheet DataFrame
revenue_df = dataframes["Revenue"]

Handle specific data types and null values during conversion:

# Read Excel with specific parameters
pandas_df = pd.read_excel(
    "inventory.xlsx",
    sheet_name="Products",
    header=0,
    usecols="A:F",  # Read only columns A through F
    dtype={"ProductID": str, "Quantity": int},
    na_values=["N/A", "Missing"]
)

# Convert with schema inference
spark_df = spark.createDataFrame(pandas_df)

# Or specify schema explicitly
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

schema = StructType([
    StructField("ProductID", StringType(), True),
    StructField("ProductName", StringType(), True),
    StructField("Category", StringType(), True),
    StructField("Quantity", IntegerType(), True),
    StructField("Price", DoubleType(), True),
    StructField("LastUpdated", StringType(), True)
])

spark_df = spark.createDataFrame(pandas_df, schema=schema)

Method 2: Using spark-excel for Large Files

For production environments handling larger Excel files, the spark-excel package provides native Spark integration with distributed processing capabilities. First, install the package:

pip install spark-excel

When starting your Spark session, include the necessary JAR dependencies:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ExcelReaderProduction") \
    .config("spark.jars.packages", "com.crealytics:spark-excel_2.12:3.3.1_0.18.5") \
    .getOrCreate()

# Read Excel file directly
df = spark.read \
    .format("com.crealytics.spark.excel") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("dataAddress", "'Sheet1'!A1:F1000") \
    .load("large_dataset.xlsx")

df.show()

Configure various Excel-specific options:

# Advanced configuration
df = spark.read \
    .format("com.crealytics.spark.excel") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("dataAddress", "'Sales_Data'!A1") \
    .option("treatEmptyValuesAsNulls", "true") \
    .option("addColorColumns", "false") \
    .option("timestampFormat", "yyyy-MM-dd HH:mm:ss") \
    .option("maxRowsInMemory", 20) \
    .option("excerptSize", 10) \
    .load("sales_report.xlsx")

# Apply transformations
result_df = df.filter(df["Revenue"] > 10000) \
    .groupBy("Region") \
    .agg({"Revenue": "sum", "Units": "sum"}) \
    .orderBy("sum(Revenue)", ascending=False)

result_df.show()

Handling Common Excel Scenarios

Reading Excel Files from Cloud Storage

Both methods support reading from cloud storage systems:

# Using pandas with S3
import s3fs
import pandas as pd

s3 = s3fs.S3FileSystem(anon=False)
with s3.open('s3://my-bucket/data/reports.xlsx', 'rb') as f:
    pandas_df = pd.read_excel(f, sheet_name="Summary")
    spark_df = spark.createDataFrame(pandas_df)

# Using spark-excel with S3
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", "YOUR_ACCESS_KEY")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "YOUR_SECRET_KEY")

df = spark.read \
    .format("com.crealytics.spark.excel") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("s3a://my-bucket/data/reports.xlsx")

Processing Multiple Excel Files

Batch process multiple Excel files in a directory:

import os
import glob

# Using pandas approach
excel_files = glob.glob("data/*.xlsx")
all_dataframes = []

for file_path in excel_files:
    pandas_df = pd.read_excel(file_path)
    pandas_df['source_file'] = os.path.basename(file_path)
    all_dataframes.append(pandas_df)

# Combine all pandas DataFrames
combined_pandas = pd.concat(all_dataframes, ignore_index=True)
spark_df = spark.createDataFrame(combined_pandas)

# Using spark-excel with wildcard
df = spark.read \
    .format("com.crealytics.spark.excel") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("data/*.xlsx")

Optimizing Performance

Implement performance optimizations for Excel processing:

# Pandas approach with chunking for very large sheets
from pyspark.sql.functions import col

chunk_size = 10000
chunks = []

for chunk in pd.read_excel("huge_file.xlsx", chunksize=chunk_size):
    chunks.append(spark.createDataFrame(chunk))

# Union all chunks
from functools import reduce
final_df = reduce(lambda df1, df2: df1.union(df2), chunks)

# Repartition for better parallelism
final_df = final_df.repartition(10)

# Cache if reusing multiple times
final_df.cache()
final_df.count()  # Trigger caching

Error Handling and Validation

Implement robust error handling:

from pyspark.sql.utils import AnalysisException

def read_excel_safely(file_path, sheet_name="Sheet1"):
    try:
        pandas_df = pd.read_excel(file_path, sheet_name=sheet_name)
        
        if pandas_df.empty:
            raise ValueError(f"Sheet '{sheet_name}' is empty")
        
        spark_df = spark.createDataFrame(pandas_df)
        
        # Validate schema
        required_columns = ["ID", "Name", "Value"]
        missing_cols = set(required_columns) - set(spark_df.columns)
        
        if missing_cols:
            raise ValueError(f"Missing required columns: {missing_cols}")
        
        return spark_df
        
    except FileNotFoundError:
        print(f"File not found: {file_path}")
        return None
    except ValueError as e:
        print(f"Validation error: {str(e)}")
        return None
    except Exception as e:
        print(f"Unexpected error: {str(e)}")
        return None

# Usage
df = read_excel_safely("data/input.xlsx", "Sales")
if df is not None:
    df.show()

Choosing the Right Approach

Use the pandas method when files are under 2GB, you need quick prototyping, or you’re working in local mode. Choose spark-excel for production workloads exceeding 2GB, distributed processing requirements, or when reading from distributed file systems. For hybrid scenarios, consider reading metadata with pandas and processing large datasets with spark-excel after initial validation.