Pandas - MultiIndex (Hierarchical Indexing) Tutorial
MultiIndex (hierarchical indexing) extends Pandas' indexing capabilities by allowing multiple levels of labels on rows or columns. This structure is essential when working with multi-dimensional data...
Key Insights
- MultiIndex enables hierarchical row and column labels, allowing you to represent higher-dimensional data in a 2D DataFrame structure while maintaining efficient operations
- Creating MultiIndex structures from tuples, arrays, or existing DataFrame columns provides flexibility for different data organization patterns and analytical needs
- Advanced operations like stacking, unstacking, and cross-sections let you reshape and slice hierarchical data without expensive pivot operations or data duplication
Understanding MultiIndex Fundamentals
MultiIndex (hierarchical indexing) extends Pandas’ indexing capabilities by allowing multiple levels of labels on rows or columns. This structure is essential when working with multi-dimensional data that doesn’t fit neatly into a flat table.
import pandas as pd
import numpy as np
# Creating a MultiIndex from tuples
index = pd.MultiIndex.from_tuples([
('US', 'California', 'San Francisco'),
('US', 'California', 'Los Angeles'),
('US', 'Texas', 'Houston'),
('UK', 'England', 'London'),
('UK', 'Scotland', 'Edinburgh')
], names=['Country', 'State', 'City'])
data = pd.Series([875000, 520000, 380000, 650000, 295000], index=index)
print(data)
Output shows hierarchical structure:
Country State City
US California San Francisco 875000
Los Angeles 520000
Texas Houston 380000
UK England London 650000
Scotland Edinburgh 295000
Creating MultiIndex Structures
From Arrays
# MultiIndex from arrays
countries = ['US', 'US', 'US', 'UK', 'UK']
states = ['CA', 'CA', 'TX', 'England', 'Scotland']
cities = ['SF', 'LA', 'Houston', 'London', 'Edinburgh']
multi_idx = pd.MultiIndex.from_arrays(
[countries, states, cities],
names=['Country', 'State', 'City']
)
df = pd.DataFrame({
'Population': [875000, 520000, 380000, 650000, 295000],
'GDP': [550, 380, 290, 480, 180]
}, index=multi_idx)
print(df)
From Product (Cartesian Product)
# Create all combinations of levels
years = [2021, 2022, 2023]
quarters = ['Q1', 'Q2', 'Q3', 'Q4']
multi_idx = pd.MultiIndex.from_product(
[years, quarters],
names=['Year', 'Quarter']
)
df = pd.DataFrame({
'Revenue': np.random.randint(100, 500, size=12),
'Costs': np.random.randint(50, 300, size=12)
}, index=multi_idx)
print(df.head(8))
From DataFrame Columns
# Convert existing columns to MultiIndex
df_flat = pd.DataFrame({
'Country': ['US', 'US', 'UK'],
'City': ['NYC', 'LA', 'London'],
'Year': [2022, 2022, 2022],
'Sales': [1200, 980, 1500]
})
df_multi = df_flat.set_index(['Country', 'City', 'Year'])
print(df_multi)
Indexing and Selection Operations
Basic Selection
# Create sample data
idx = pd.MultiIndex.from_product(
[['North', 'South'], ['Q1', 'Q2', 'Q3'], ['Product_A', 'Product_B']],
names=['Region', 'Quarter', 'Product']
)
sales_df = pd.DataFrame({
'Units': np.random.randint(50, 200, size=12),
'Revenue': np.random.randint(1000, 5000, size=12)
}, index=idx)
# Select outer level
print(sales_df.loc['North'])
# Select specific combination
print(sales_df.loc[('North', 'Q1')])
# Select with tuple
print(sales_df.loc[('South', 'Q2', 'Product_A')])
Cross-Sections with xs()
# Get cross-section at specific level
north_q1 = sales_df.xs('Q1', level='Quarter')
print(north_q1)
# Cross-section on multiple levels
product_a = sales_df.xs('Product_A', level='Product')
print(product_a)
# Drop level after selection
result = sales_df.xs(('North', 'Q1'), level=['Region', 'Quarter'], drop_level=True)
print(result)
Advanced Slicing
# Slice with IndexSlice
idx_slice = pd.IndexSlice
# Select all Q1 and Q2 data for North region
result = sales_df.loc[idx_slice['North', ['Q1', 'Q2'], :], :]
print(result)
# Select range of quarters
result = sales_df.loc[idx_slice[:, 'Q1':'Q2', 'Product_A'], :]
print(result)
Reshaping with Stack and Unstack
Unstacking Levels
# Create sample hierarchical data
idx = pd.MultiIndex.from_product(
[['2022', '2023'], ['Q1', 'Q2'], ['Revenue', 'Costs']],
names=['Year', 'Quarter', 'Metric']
)
data = pd.Series(np.random.randint(100, 1000, size=8), index=idx)
# Unstack innermost level (Metric)
unstacked = data.unstack(level='Metric')
print(unstacked)
# Unstack multiple levels
unstacked_multi = data.unstack(level=['Quarter', 'Metric'])
print(unstacked_multi)
Stacking Operations
# Create DataFrame with hierarchical columns
df = pd.DataFrame({
('Sales', 'Q1'): [100, 150, 200],
('Sales', 'Q2'): [120, 160, 210],
('Costs', 'Q1'): [60, 80, 100],
('Costs', 'Q2'): [70, 85, 105]
}, index=['Product_A', 'Product_B', 'Product_C'])
df.columns = pd.MultiIndex.from_tuples(df.columns, names=['Metric', 'Quarter'])
# Stack to convert columns to rows
stacked = df.stack(level='Quarter')
print(stacked)
# Stack multiple levels
fully_stacked = df.stack(level=['Metric', 'Quarter'])
print(fully_stacked)
Aggregation and GroupBy Operations
# Create sales data with MultiIndex
idx = pd.MultiIndex.from_product(
[['North', 'South', 'East', 'West'],
['2022', '2023'],
['Q1', 'Q2', 'Q3', 'Q4']],
names=['Region', 'Year', 'Quarter']
)
sales_data = pd.DataFrame({
'Sales': np.random.randint(1000, 10000, size=32),
'Units': np.random.randint(50, 500, size=32)
}, index=idx)
# Aggregate by level
yearly_sales = sales_data.groupby(level='Year').sum()
print(yearly_sales)
# Aggregate by multiple levels
regional_yearly = sales_data.groupby(level=['Region', 'Year']).mean()
print(regional_yearly)
# Using sum with level parameter
quarter_totals = sales_data.sum(level='Quarter')
print(quarter_totals)
Sorting and Reordering
# Sort by index levels
sorted_df = sales_data.sort_index(level=['Year', 'Quarter'], ascending=[True, False])
print(sorted_df.head(10))
# Reorder levels
reordered = sales_data.reorder_levels(['Year', 'Region', 'Quarter'])
print(reordered.head())
# Swap specific levels
swapped = sales_data.swaplevel('Region', 'Year')
print(swapped.head())
Practical Use Case: Time Series Analysis
# Financial data with hierarchical structure
dates = pd.date_range('2023-01-01', periods=90, freq='D')
tickers = ['AAPL', 'GOOGL', 'MSFT']
idx = pd.MultiIndex.from_product(
[dates, tickers],
names=['Date', 'Ticker']
)
stock_data = pd.DataFrame({
'Open': np.random.uniform(100, 200, size=270),
'Close': np.random.uniform(100, 200, size=270),
'Volume': np.random.randint(1000000, 10000000, size=270)
}, index=idx)
# Calculate daily returns per ticker
stock_data['Return'] = stock_data.groupby(level='Ticker')['Close'].pct_change()
# Get statistics by ticker
ticker_stats = stock_data.groupby(level='Ticker').agg({
'Close': ['mean', 'std', 'min', 'max'],
'Volume': 'mean'
})
print(ticker_stats)
# Pivot for comparison
comparison = stock_data['Close'].unstack(level='Ticker')
print(comparison.head())
# Rolling calculations within each ticker
rolling_avg = stock_data.groupby(level='Ticker')['Close'].rolling(window=7).mean()
print(rolling_avg.head(20))
Performance Considerations
MultiIndex operations are generally efficient, but keep these optimization strategies in mind:
# Sort index for faster lookups
df_sorted = sales_data.sort_index()
# Use is_monotonic to check if sorted
print(f"Index sorted: {df_sorted.index.is_monotonic_increasing}")
# Query with sorted index is significantly faster
result = df_sorted.loc[('North', '2023')] # Fast lookup on sorted index
# Reset index when hierarchical structure no longer needed
df_flat = sales_data.reset_index()
# Use categorical data for repeated values
sales_data.index = sales_data.index.set_levels(
sales_data.index.levels[0].astype('category'),
level=0
)
MultiIndex provides powerful data organization capabilities without requiring complex database structures. Master these techniques to handle multi-dimensional datasets efficiently in memory while maintaining clean, readable code.