Pandas - Describe/Summary Statistics
• The `describe()` method provides comprehensive statistical summaries but can be customized with percentiles, inclusion rules, and data type filters to match specific analytical needs
Key Insights
• The describe() method provides comprehensive statistical summaries but can be customized with percentiles, inclusion rules, and data type filters to match specific analytical needs
• Summary statistics in Pandas extend beyond describe() to include individual methods like mean(), median(), std(), and quantile() that offer more granular control and better performance for targeted analyses
• GroupBy operations combined with summary statistics enable powerful segmented analysis, while custom aggregation functions allow you to define domain-specific statistical measures
Basic Statistical Summary with describe()
The describe() method generates a statistical summary of numeric columns in a DataFrame, including count, mean, standard deviation, minimum, quartiles, and maximum values.
import pandas as pd
import numpy as np
# Create sample dataset
df = pd.DataFrame({
'revenue': [45000, 52000, 48000, 61000, 55000, 49000, 58000],
'expenses': [32000, 38000, 35000, 42000, 39000, 36000, 41000],
'employees': [12, 15, 13, 18, 16, 14, 17],
'region': ['North', 'South', 'East', 'West', 'North', 'South', 'East']
})
print(df.describe())
Output:
revenue expenses employees
count 7.000000 7.000000 7.000000
mean 52571.428571 37571.428571 15.000000
std 5593.424177 3458.782074 2.160247
min 45000.000000 32000.000000 12.000000
25% 48500.000000 35500.000000 13.500000
50% 52000.000000 38000.000000 15.000000
75% 56500.000000 40500.000000 16.500000
max 61000.000000 42000.000000 18.000000
Customizing Percentiles and Include Parameters
Control which percentiles appear in the summary and which data types to include.
# Custom percentiles
print(df.describe(percentiles=[.1, .5, .9]))
# Include all data types
print(df.describe(include='all'))
# Only categorical columns
print(df.describe(include=['object']))
# Exclude specific types
print(df.describe(exclude=['int64']))
For categorical data analysis:
df_categorical = pd.DataFrame({
'product': ['A', 'B', 'A', 'C', 'B', 'A', 'A', 'C'],
'status': ['active', 'inactive', 'active', 'active', 'active',
'inactive', 'active', 'active']
})
print(df_categorical.describe())
Output:
product status
count 8 8
unique 3 2
top A active
freq 4 6
Individual Statistical Methods
For specific metrics, individual methods provide better performance and more control than describe().
# Central tendency
print(f"Mean revenue: ${df['revenue'].mean():,.2f}")
print(f"Median revenue: ${df['revenue'].median():,.2f}")
print(f"Mode: {df['region'].mode().values}")
# Dispersion
print(f"Standard deviation: ${df['revenue'].std():,.2f}")
print(f"Variance: {df['revenue'].var():,.2f}")
print(f"Range: ${df['revenue'].max() - df['revenue'].min():,.2f}")
# Quantiles
print(f"90th percentile: ${df['revenue'].quantile(0.9):,.2f}")
print(f"IQR: ${df['revenue'].quantile(0.75) - df['revenue'].quantile(0.25):,.2f}")
# Distribution shape
print(f"Skewness: {df['revenue'].skew():.3f}")
print(f"Kurtosis: {df['revenue'].kurtosis():.3f}")
Multiple Statistics with agg()
The agg() method applies multiple aggregation functions simultaneously, accepting lists, dictionaries, or custom functions.
# Multiple functions on single column
revenue_stats = df['revenue'].agg(['count', 'mean', 'std', 'min', 'max'])
print(revenue_stats)
# Different functions for different columns
summary = df.agg({
'revenue': ['mean', 'median', 'std'],
'expenses': ['sum', 'mean'],
'employees': ['min', 'max', 'mean']
})
print(summary)
# Custom aggregation functions
def coefficient_of_variation(x):
return (x.std() / x.mean()) * 100
def profit_margin(df):
return ((df['revenue'] - df['expenses']) / df['revenue'] * 100).mean()
custom_stats = df.agg({
'revenue': [coefficient_of_variation, 'mean'],
'expenses': [coefficient_of_variation, 'mean']
})
print(custom_stats)
GroupBy Statistics for Segmented Analysis
Combine groupby() with statistical methods to analyze subsets of data.
# Statistics by group
regional_stats = df.groupby('region').agg({
'revenue': ['mean', 'sum', 'count'],
'expenses': 'mean',
'employees': ['min', 'max']
})
print(regional_stats)
# Multiple grouping levels
df['quarter'] = ['Q1', 'Q1', 'Q2', 'Q2', 'Q3', 'Q3', 'Q4']
multi_group = df.groupby(['quarter', 'region'])['revenue'].describe()
print(multi_group)
# Custom aggregations with groupby
def efficiency_ratio(group):
return (group['revenue'] - group['expenses']).sum() / group['revenue'].sum()
efficiency = df.groupby('region').apply(efficiency_ratio)
print(f"\nEfficiency by region:\n{efficiency}")
Statistical Analysis with Missing Data
Handle missing values appropriately when calculating statistics.
# Create dataset with missing values
df_missing = pd.DataFrame({
'sales': [100, 150, np.nan, 200, 175, np.nan, 190],
'costs': [80, np.nan, 95, 160, 140, 120, np.nan],
'units': [10, 15, 12, np.nan, 14, 11, 16]
})
# Default behavior (skipna=True)
print("With skipna=True (default):")
print(df_missing.describe())
# Include NaN in count
print("\nValue counts including NaN:")
print(df_missing.count())
print(f"Total rows: {len(df_missing)}")
# Calculate statistics with different strategies
print("\nMean with different NaN handling:")
print(f"Skip NaN: {df_missing['sales'].mean():.2f}")
print(f"Fill with 0: {df_missing['sales'].fillna(0).mean():.2f}")
print(f"Forward fill: {df_missing['sales'].fillna(method='ffill').mean():.2f}")
Advanced Statistical Summaries
Create custom summary functions for domain-specific analysis.
def comprehensive_summary(series):
"""Generate comprehensive statistical summary"""
return pd.Series({
'count': series.count(),
'missing': series.isna().sum(),
'mean': series.mean(),
'median': series.median(),
'std': series.std(),
'cv': (series.std() / series.mean() * 100) if series.mean() != 0 else np.nan,
'min': series.min(),
'q25': series.quantile(0.25),
'q75': series.quantile(0.75),
'max': series.max(),
'range': series.max() - series.min(),
'iqr': series.quantile(0.75) - series.quantile(0.25),
'skew': series.skew(),
'kurtosis': series.kurtosis()
})
# Apply to DataFrame
detailed_stats = df[['revenue', 'expenses', 'employees']].apply(comprehensive_summary)
print(detailed_stats)
Performance Considerations for Large Datasets
Optimize statistical calculations for large datasets.
# Create large dataset
large_df = pd.DataFrame({
'value': np.random.randn(1_000_000),
'category': np.random.choice(['A', 'B', 'C', 'D'], 1_000_000)
})
# Efficient single-pass statistics
import time
# Slower: multiple passes
start = time.time()
mean_val = large_df['value'].mean()
std_val = large_df['value'].std()
print(f"Multiple passes: {time.time() - start:.4f}s")
# Faster: single pass with describe
start = time.time()
stats = large_df['value'].describe()
print(f"Single pass: {time.time() - start:.4f}s")
# Memory-efficient groupby with observed=True for categoricals
large_df['category'] = large_df['category'].astype('category')
start = time.time()
grouped = large_df.groupby('category', observed=True)['value'].agg(['mean', 'std'])
print(f"Categorical groupby: {time.time() - start:.4f}s")
Rolling and Expanding Statistics
Calculate moving statistics for time-series analysis.
# Time series data
ts_df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=30, freq='D'),
'sales': np.random.randint(100, 200, 30)
})
ts_df.set_index('date', inplace=True)
# Rolling statistics
ts_df['rolling_mean_7d'] = ts_df['sales'].rolling(window=7).mean()
ts_df['rolling_std_7d'] = ts_df['sales'].rolling(window=7).std()
# Expanding statistics (cumulative)
ts_df['expanding_mean'] = ts_df['sales'].expanding().mean()
ts_df['expanding_max'] = ts_df['sales'].expanding().max()
# Exponentially weighted statistics
ts_df['ewm_mean'] = ts_df['sales'].ewm(span=7).mean()
print(ts_df.tail(10))
Summary statistics in Pandas provide the foundation for data exploration and analysis. Use describe() for quick overviews, individual methods for specific metrics, and agg() with groupby() for sophisticated segmented analysis. Understanding these tools enables efficient data profiling and informed decision-making in data pipelines.