How to Read Excel Files in Pandas

Excel files remain stubbornly ubiquitous in data workflows. Whether you're receiving sales reports from finance, customer data from marketing, or research datasets from academic partners, you'll...

Key Insights

  • The pd.read_excel() function requires an engine like openpyxl for .xlsx files or xlrd for legacy .xls files—install these dependencies before attempting to read Excel data.
  • Use sheet_name, usecols, skiprows, and nrows parameters to surgically extract only the data you need, which dramatically improves performance with large workbooks.
  • Reading all sheets at once with sheet_name=None returns a dictionary of DataFrames, enabling efficient batch processing of multi-sheet workbooks.

Introduction

Excel files remain stubbornly ubiquitous in data workflows. Whether you’re receiving sales reports from finance, customer data from marketing, or research datasets from academic partners, you’ll encounter .xlsx and .xls files constantly. Fighting this reality is pointless—embrace it.

Pandas provides the read_excel() function to handle this exact scenario. It’s powerful, flexible, and once you understand its parameters, you can extract precisely the data you need without manual copying and pasting. This article covers everything from basic usage to advanced techniques for handling multi-sheet workbooks and problematic data types.

Prerequisites and Setup

Before reading any Excel file, you need pandas and an appropriate engine installed. Pandas doesn’t include Excel reading capabilities by default—it delegates to specialized libraries.

# For modern .xlsx files (Excel 2007+)
pip install pandas openpyxl

# For legacy .xls files (Excel 97-2003)
pip install pandas xlrd

The openpyxl engine handles the modern Office Open XML format (.xlsx), while xlrd handles the older binary format (.xls). If you’re working with both formats, install both libraries.

For conda users:

conda install pandas openpyxl xlrd

Verify your installation:

import pandas as pd

# Check pandas version
print(pd.__version__)

# Verify openpyxl is available
try:
    import openpyxl
    print(f"openpyxl version: {openpyxl.__version__}")
except ImportError:
    print("openpyxl not installed")

One common mistake: attempting to read .xlsx files with only xlrd installed. As of xlrd version 2.0, support for .xlsx files was removed for security reasons. Use openpyxl for modern Excel files.

Basic Excel Reading with read_excel()

The simplest usage requires only a file path:

import pandas as pd

# Read the first sheet of an Excel file
df = pd.read_excel('sales_data.xlsx')

# Display the first few rows
print(df.head())

# Check the shape and column names
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

By default, read_excel() reads the first sheet, uses the first row as column headers, and infers data types automatically. This works perfectly for well-structured spreadsheets.

You can also read from URLs or file-like objects:

# Read from a URL
url = 'https://example.com/data/quarterly_report.xlsx'
df = pd.read_excel(url)

# Read from a BytesIO object (useful for data from APIs)
from io import BytesIO

with open('sales_data.xlsx', 'rb') as f:
    excel_bytes = BytesIO(f.read())
    df = pd.read_excel(excel_bytes)

Explicitly specifying the engine prevents ambiguity:

# Explicit engine specification
df = pd.read_excel('sales_data.xlsx', engine='openpyxl')
df_legacy = pd.read_excel('old_report.xls', engine='xlrd')

Selecting Sheets and Ranges

Real-world Excel files rarely contain a single, perfectly formatted sheet. You’ll need to navigate multiple sheets and extract specific ranges.

Selecting Sheets

Use the sheet_name parameter to target specific sheets:

# By name
df = pd.read_excel('workbook.xlsx', sheet_name='Q4 Sales')

# By index (0-based)
df = pd.read_excel('workbook.xlsx', sheet_name=0)  # First sheet
df = pd.read_excel('workbook.xlsx', sheet_name=2)  # Third sheet

# Multiple sheets (returns a dictionary)
dfs = pd.read_excel('workbook.xlsx', sheet_name=['Q1 Sales', 'Q2 Sales'])
print(dfs['Q1 Sales'].head())

Selecting Columns

The usecols parameter accepts several formats:

# Excel-style column letters
df = pd.read_excel('sales.xlsx', usecols='A:D')
df = pd.read_excel('sales.xlsx', usecols='A,C,E:G')

# Column indices (0-based)
df = pd.read_excel('sales.xlsx', usecols=[0, 1, 3, 5])

# Column names (requires headers to be read first)
df = pd.read_excel('sales.xlsx', usecols=['Date', 'Product', 'Revenue'])

# Callable for dynamic selection
df = pd.read_excel('sales.xlsx', usecols=lambda x: 'price' in x.lower())

Selecting Rows

Control which rows to read with skiprows, nrows, and skipfooter:

# Skip the first 3 rows (useful for files with metadata headers)
df = pd.read_excel('report.xlsx', skiprows=3)

# Skip specific rows by index
df = pd.read_excel('report.xlsx', skiprows=[0, 2, 4])

# Read only the first 100 rows
df = pd.read_excel('large_file.xlsx', nrows=100)

# Skip the last 5 rows (footer/summary rows)
df = pd.read_excel('report.xlsx', skipfooter=5)

# Combine parameters for precise extraction
df = pd.read_excel(
    'complex_report.xlsx',
    sheet_name='Data',
    usecols='B:F',
    skiprows=4,
    nrows=500
)

A practical example combining these parameters:

# Extract a specific table from a complex report
df = pd.read_excel(
    'quarterly_financials.xlsx',
    sheet_name='Income Statement',
    usecols='A:E',
    skiprows=8,      # Skip title and metadata rows
    nrows=25,        # Read only the main data section
    skipfooter=0
)

Handling Data Types and Headers

Excel’s loose typing often causes problems. A column might contain numbers, text, and dates mixed together. Take control with explicit type specifications.

Setting Headers

# Use a different row as headers
df = pd.read_excel('data.xlsx', header=2)  # Third row becomes headers

# No headers (first row is data)
df = pd.read_excel('data.xlsx', header=None)

# Provide custom column names
df = pd.read_excel(
    'data.xlsx',
    header=None,
    names=['id', 'name', 'value', 'timestamp']
)

# Multi-level headers
df = pd.read_excel('data.xlsx', header=[0, 1])  # Two-row header

Specifying Data Types

# Explicit dtype mapping
df = pd.read_excel(
    'sales.xlsx',
    dtype={
        'product_id': str,      # Keep leading zeros
        'quantity': int,
        'price': float,
        'region_code': 'category'
    }
)

# Force all columns to string (useful for initial inspection)
df = pd.read_excel('unknown_data.xlsx', dtype=str)

Parsing Dates

Excel stores dates as serial numbers internally. Pandas usually handles this correctly, but explicit parsing gives you control:

# Parse specific columns as dates
df = pd.read_excel(
    'orders.xlsx',
    parse_dates=['order_date', 'ship_date']
)

# Combine multiple columns into a single datetime
df = pd.read_excel(
    'events.xlsx',
    parse_dates={'event_datetime': ['date_col', 'time_col']}
)

# Custom date parser for non-standard formats
df = pd.read_excel(
    'european_data.xlsx',
    parse_dates=['date'],
    date_parser=lambda x: pd.to_datetime(x, format='%d/%m/%Y')
)

A comprehensive example:

df = pd.read_excel(
    'customer_orders.xlsx',
    sheet_name='Orders',
    usecols='A:H',
    dtype={
        'customer_id': str,
        'postal_code': str,
        'quantity': 'Int64',  # Nullable integer
        'unit_price': float
    },
    parse_dates=['order_date', 'delivery_date'],
    na_values=['N/A', 'MISSING', '-']  # Custom NA indicators
)

Reading Multiple Sheets

When you need data from multiple sheets, avoid reading the file repeatedly. Use sheet_name=None to load everything at once:

# Load all sheets into a dictionary
all_sheets = pd.read_excel('annual_report.xlsx', sheet_name=None)

# all_sheets is a dict: {'Sheet1': df1, 'Sheet2': df2, ...}
print(f"Available sheets: {list(all_sheets.keys())}")

# Access individual DataFrames
january_data = all_sheets['January']
february_data = all_sheets['February']

Processing Multiple Sheets

# Concatenate all sheets with a source column
all_sheets = pd.read_excel('monthly_sales.xlsx', sheet_name=None)

dfs = []
for sheet_name, df in all_sheets.items():
    df['source_sheet'] = sheet_name
    dfs.append(df)

combined = pd.concat(dfs, ignore_index=True)
print(f"Combined shape: {combined.shape}")

Selective Multi-Sheet Reading

# Read only specific sheets
quarterly_data = pd.read_excel(
    'annual_report.xlsx',
    sheet_name=['Q1', 'Q2', 'Q3', 'Q4']
)

# Process with consistent parameters
sheets_to_read = ['North', 'South', 'East', 'West']
regional_data = pd.read_excel(
    'regional_sales.xlsx',
    sheet_name=sheets_to_read,
    usecols='A:F',
    skiprows=2,
    dtype={'region_code': str}
)

Common Issues and Troubleshooting

Missing Engine Errors

ImportError: Missing optional dependency 'openpyxl'.

Solution: Install the required engine: pip install openpyxl

Memory Issues with Large Files

Large Excel files can exhaust memory. Strategies to handle this:

# Read in chunks by limiting rows
chunk_size = 10000
total_rows = 100000  # Estimate or get from file inspection

dfs = []
for skip in range(0, total_rows, chunk_size):
    chunk = pd.read_excel(
        'huge_file.xlsx',
        skiprows=range(1, skip + 1) if skip > 0 else None,
        nrows=chunk_size
    )
    dfs.append(chunk)

# Read only necessary columns
df = pd.read_excel('huge_file.xlsx', usecols=['id', 'value'])

For truly massive files, consider converting to CSV or using specialized tools like openpyxl directly with read-only mode.

Encoding and Character Issues

# Specify encoding for problematic files
df = pd.read_excel('international_data.xlsx', encoding='utf-8')

Merged Cells and Formatting

Merged cells often cause NaN values in unexpected places:

df = pd.read_excel('formatted_report.xlsx')
# Forward-fill merged cell values
df['category'] = df['category'].fillna(method='ffill')

Formula Results vs. Formulas

By default, read_excel() reads calculated values, not formulas. This is usually what you want. If cells show #REF! or similar errors, the formulas had issues in Excel itself.

Excel file handling in pandas is straightforward once you understand the key parameters. Start with basic reads, then add specificity as needed. When performance matters, always limit your reads to exactly the data you need using usecols, skiprows, and nrows.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.