Pandas - Describe/Summary Statistics

• The `describe()` method provides comprehensive statistical summaries but can be customized with percentiles, inclusion rules, and data type filters to match specific analytical needs

Key Insights

• The describe() method provides comprehensive statistical summaries but can be customized with percentiles, inclusion rules, and data type filters to match specific analytical needs • Summary statistics in Pandas extend beyond describe() to include individual methods like mean(), median(), std(), and quantile() that offer more granular control and better performance for targeted analyses • GroupBy operations combined with summary statistics enable powerful segmented analysis, while custom aggregation functions allow you to define domain-specific statistical measures

Basic Statistical Summary with describe()

The describe() method generates a statistical summary of numeric columns in a DataFrame, including count, mean, standard deviation, minimum, quartiles, and maximum values.

import pandas as pd
import numpy as np

# Create sample dataset
df = pd.DataFrame({
    'revenue': [45000, 52000, 48000, 61000, 55000, 49000, 58000],
    'expenses': [32000, 38000, 35000, 42000, 39000, 36000, 41000],
    'employees': [12, 15, 13, 18, 16, 14, 17],
    'region': ['North', 'South', 'East', 'West', 'North', 'South', 'East']
})

print(df.describe())

Output:

          revenue      expenses   employees
count       7.000000      7.000000   7.000000
mean    52571.428571  37571.428571  15.000000
std      5593.424177   3458.782074   2.160247
min     45000.000000  32000.000000  12.000000
25%     48500.000000  35500.000000  13.500000
50%     52000.000000  38000.000000  15.000000
75%     56500.000000  40500.000000  16.500000
max     61000.000000  42000.000000  18.000000

Customizing Percentiles and Include Parameters

Control which percentiles appear in the summary and which data types to include.

# Custom percentiles
print(df.describe(percentiles=[.1, .5, .9]))

# Include all data types
print(df.describe(include='all'))

# Only categorical columns
print(df.describe(include=['object']))

# Exclude specific types
print(df.describe(exclude=['int64']))

For categorical data analysis:

df_categorical = pd.DataFrame({
    'product': ['A', 'B', 'A', 'C', 'B', 'A', 'A', 'C'],
    'status': ['active', 'inactive', 'active', 'active', 'active', 
               'inactive', 'active', 'active']
})

print(df_categorical.describe())

Output:

       product  status
count        8       8
unique       3       2
top          A  active
freq         4       6

Individual Statistical Methods

For specific metrics, individual methods provide better performance and more control than describe().

# Central tendency
print(f"Mean revenue: ${df['revenue'].mean():,.2f}")
print(f"Median revenue: ${df['revenue'].median():,.2f}")
print(f"Mode: {df['region'].mode().values}")

# Dispersion
print(f"Standard deviation: ${df['revenue'].std():,.2f}")
print(f"Variance: {df['revenue'].var():,.2f}")
print(f"Range: ${df['revenue'].max() - df['revenue'].min():,.2f}")

# Quantiles
print(f"90th percentile: ${df['revenue'].quantile(0.9):,.2f}")
print(f"IQR: ${df['revenue'].quantile(0.75) - df['revenue'].quantile(0.25):,.2f}")

# Distribution shape
print(f"Skewness: {df['revenue'].skew():.3f}")
print(f"Kurtosis: {df['revenue'].kurtosis():.3f}")

Multiple Statistics with agg()

The agg() method applies multiple aggregation functions simultaneously, accepting lists, dictionaries, or custom functions.

# Multiple functions on single column
revenue_stats = df['revenue'].agg(['count', 'mean', 'std', 'min', 'max'])
print(revenue_stats)

# Different functions for different columns
summary = df.agg({
    'revenue': ['mean', 'median', 'std'],
    'expenses': ['sum', 'mean'],
    'employees': ['min', 'max', 'mean']
})
print(summary)

# Custom aggregation functions
def coefficient_of_variation(x):
    return (x.std() / x.mean()) * 100

def profit_margin(df):
    return ((df['revenue'] - df['expenses']) / df['revenue'] * 100).mean()

custom_stats = df.agg({
    'revenue': [coefficient_of_variation, 'mean'],
    'expenses': [coefficient_of_variation, 'mean']
})
print(custom_stats)

GroupBy Statistics for Segmented Analysis

Combine groupby() with statistical methods to analyze subsets of data.

# Statistics by group
regional_stats = df.groupby('region').agg({
    'revenue': ['mean', 'sum', 'count'],
    'expenses': 'mean',
    'employees': ['min', 'max']
})
print(regional_stats)

# Multiple grouping levels
df['quarter'] = ['Q1', 'Q1', 'Q2', 'Q2', 'Q3', 'Q3', 'Q4']
multi_group = df.groupby(['quarter', 'region'])['revenue'].describe()
print(multi_group)

# Custom aggregations with groupby
def efficiency_ratio(group):
    return (group['revenue'] - group['expenses']).sum() / group['revenue'].sum()

efficiency = df.groupby('region').apply(efficiency_ratio)
print(f"\nEfficiency by region:\n{efficiency}")

Statistical Analysis with Missing Data

Handle missing values appropriately when calculating statistics.

# Create dataset with missing values
df_missing = pd.DataFrame({
    'sales': [100, 150, np.nan, 200, 175, np.nan, 190],
    'costs': [80, np.nan, 95, 160, 140, 120, np.nan],
    'units': [10, 15, 12, np.nan, 14, 11, 16]
})

# Default behavior (skipna=True)
print("With skipna=True (default):")
print(df_missing.describe())

# Include NaN in count
print("\nValue counts including NaN:")
print(df_missing.count())
print(f"Total rows: {len(df_missing)}")

# Calculate statistics with different strategies
print("\nMean with different NaN handling:")
print(f"Skip NaN: {df_missing['sales'].mean():.2f}")
print(f"Fill with 0: {df_missing['sales'].fillna(0).mean():.2f}")
print(f"Forward fill: {df_missing['sales'].fillna(method='ffill').mean():.2f}")

Advanced Statistical Summaries

Create custom summary functions for domain-specific analysis.

def comprehensive_summary(series):
    """Generate comprehensive statistical summary"""
    return pd.Series({
        'count': series.count(),
        'missing': series.isna().sum(),
        'mean': series.mean(),
        'median': series.median(),
        'std': series.std(),
        'cv': (series.std() / series.mean() * 100) if series.mean() != 0 else np.nan,
        'min': series.min(),
        'q25': series.quantile(0.25),
        'q75': series.quantile(0.75),
        'max': series.max(),
        'range': series.max() - series.min(),
        'iqr': series.quantile(0.75) - series.quantile(0.25),
        'skew': series.skew(),
        'kurtosis': series.kurtosis()
    })

# Apply to DataFrame
detailed_stats = df[['revenue', 'expenses', 'employees']].apply(comprehensive_summary)
print(detailed_stats)

Performance Considerations for Large Datasets

Optimize statistical calculations for large datasets.

# Create large dataset
large_df = pd.DataFrame({
    'value': np.random.randn(1_000_000),
    'category': np.random.choice(['A', 'B', 'C', 'D'], 1_000_000)
})

# Efficient single-pass statistics
import time

# Slower: multiple passes
start = time.time()
mean_val = large_df['value'].mean()
std_val = large_df['value'].std()
print(f"Multiple passes: {time.time() - start:.4f}s")

# Faster: single pass with describe
start = time.time()
stats = large_df['value'].describe()
print(f"Single pass: {time.time() - start:.4f}s")

# Memory-efficient groupby with observed=True for categoricals
large_df['category'] = large_df['category'].astype('category')
start = time.time()
grouped = large_df.groupby('category', observed=True)['value'].agg(['mean', 'std'])
print(f"Categorical groupby: {time.time() - start:.4f}s")

Rolling and Expanding Statistics

Calculate moving statistics for time-series analysis.

# Time series data
ts_df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=30, freq='D'),
    'sales': np.random.randint(100, 200, 30)
})
ts_df.set_index('date', inplace=True)

# Rolling statistics
ts_df['rolling_mean_7d'] = ts_df['sales'].rolling(window=7).mean()
ts_df['rolling_std_7d'] = ts_df['sales'].rolling(window=7).std()

# Expanding statistics (cumulative)
ts_df['expanding_mean'] = ts_df['sales'].expanding().mean()
ts_df['expanding_max'] = ts_df['sales'].expanding().max()

# Exponentially weighted statistics
ts_df['ewm_mean'] = ts_df['sales'].ewm(span=7).mean()

print(ts_df.tail(10))

Summary statistics in Pandas provide the foundation for data exploration and analysis. Use describe() for quick overviews, individual methods for specific metrics, and agg() with groupby() for sophisticated segmented analysis. Understanding these tools enables efficient data profiling and informed decision-making in data pipelines.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.