How to Use Describe in Pandas

Exploratory data analysis starts with one question: what does my data actually look like? Before building models, creating visualizations, or writing complex transformations, you need to understand...

Key Insights

  • The describe() method generates eight essential statistics for numeric columns in a single call, making it the fastest way to understand your data’s distribution, central tendency, and spread.
  • Using the include and exclude parameters lets you analyze categorical data, numeric data, or both simultaneously—a feature many developers overlook.
  • Combining describe() with groupby() transforms simple summary statistics into powerful comparative analysis across segments of your data.

Introduction to the Describe Method

Exploratory data analysis starts with one question: what does my data actually look like? Before building models, creating visualizations, or writing complex transformations, you need to understand the basic shape and characteristics of your dataset. Pandas provides the describe() method for exactly this purpose.

The describe() method generates summary statistics for your DataFrame or Series in a single line of code. It calculates count, mean, standard deviation, minimum, maximum, and quartile values—everything you need to quickly assess data quality and distribution patterns.

Here’s the basic usage:

import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    'age': [25, 32, 47, 51, 62, 23, 34, 45, 56, 38],
    'salary': [50000, 65000, 82000, 95000, 120000, 45000, 70000, 88000, 105000, 72000],
    'years_experience': [2, 5, 15, 20, 30, 1, 8, 18, 25, 10],
    'department': ['Engineering', 'Sales', 'Engineering', 'HR', 'Sales', 
                   'Engineering', 'HR', 'Sales', 'Engineering', 'HR']
})

# Get summary statistics
print(df.describe())

Output:

             age        salary  years_experience
count  10.000000     10.000000         10.000000
mean   41.300000  79200.000000         13.400000
std    13.190147  23489.361362          9.701374
min    23.000000  45000.000000          1.000000
25%    30.250000  61250.000000          4.250000
50%    41.500000  77000.000000         12.500000
75%    52.250000  96750.000000         21.250000
max    62.000000 120000.000000         30.000000

With one method call, you now know the range, central tendency, and spread of every numeric column in your dataset.

Understanding the Default Output

Each row in the describe() output tells you something specific about your data. Understanding these metrics helps you spot issues and patterns immediately.

count reveals how many non-null values exist in each column. If count differs across columns, you have missing data. In our example, all columns show 10, meaning no missing values.

mean provides the arithmetic average. Compare this to the median (50%) to detect skewness. When mean significantly exceeds median, you likely have right-skewed data with high outliers.

std (standard deviation) measures spread around the mean. Higher values indicate more variability. Our salary column has a std of ~23,489, suggesting salaries vary considerably.

min and max show the range boundaries. These help identify potential data entry errors—a negative age or a salary of 1 would stand out immediately.

25%, 50%, 75% are the quartiles. The 50% value is the median. The interquartile range (75% minus 25%) contains the middle half of your data and resists outlier influence.

# Interpreting the output programmatically
stats = df.describe()

# Check for potential outliers using IQR
salary_iqr = stats.loc['75%', 'salary'] - stats.loc['25%', 'salary']
salary_upper_bound = stats.loc['75%', 'salary'] + (1.5 * salary_iqr)
salary_lower_bound = stats.loc['25%', 'salary'] - (1.5 * salary_iqr)

print(f"Salary IQR: ${salary_iqr:,.0f}")
print(f"Outlier bounds: ${salary_lower_bound:,.0f} - ${salary_upper_bound:,.0f}")

Describing Different Data Types

By default, describe() only analyzes numeric columns. The department column from our example doesn’t appear in the output. This behavior makes sense for most use cases, but you’ll often need to examine categorical data too.

The include parameter controls which data types to analyze:

# Include all columns regardless of type
print(df.describe(include='all'))

Output:

              age  department        salary  years_experience
count   10.000000          10     10.000000         10.000000
unique        NaN           3           NaN               NaN
top           NaN  Engineering           NaN               NaN
freq          NaN           4           NaN               NaN
mean    41.300000         NaN  79200.000000         13.400000
std     13.190147         NaN  23489.361362          9.701374
min     23.000000         NaN  45000.000000          1.000000
25%     30.250000         NaN  61250.000000          4.250000
50%     41.500000         NaN  77000.000000         12.500000
75%     52.250000         NaN  96750.000000         21.250000
max     62.000000         NaN 120000.000000         30.000000

For categorical columns, you get different statistics: unique (number of distinct values), top (most frequent value), and freq (frequency of the top value). Numeric statistics show NaN for categorical columns and vice versa.

Target specific data types with these patterns:

# Only object (string) columns
print(df.describe(include='object'))

# Only numeric columns (explicit)
print(df.describe(include=[np.number]))

# Multiple specific types
print(df.describe(include=['int64', 'float64']))

# Exclude certain types
print(df.describe(exclude=['object']))

For categorical data specifically:

# Analyze only categorical columns
categorical_stats = df.describe(include='object')
print(categorical_stats)

Output:

       department
count          10
unique          3
top     Engineering
freq            4

This tells us we have 3 unique departments, with Engineering being the most common (4 occurrences).

Customizing Percentiles

The default quartiles (25%, 50%, 75%) work well for general analysis, but specific use cases demand different percentile breakdowns. The percentiles parameter accepts a list of values between 0 and 1.

# Decile analysis (10th, 50th, 90th percentiles)
print(df.describe(percentiles=[.1, .5, .9]))

Output:

             age        salary  years_experience
count  10.000000     10.000000         10.000000
mean   41.300000  79200.000000         13.400000
std    13.190147  23489.361362          9.701374
min    23.000000  45000.000000          1.000000
10%    24.800000  49500.000000          1.900000
50%    41.500000  77000.000000         12.500000
90%    57.400000 111500.000000         27.500000
max    62.000000 120000.000000         30.000000

This view helps identify extreme values. The gap between 90th percentile and max reveals whether your maximum values are outliers or part of a natural distribution.

Common percentile configurations for different analyses:

# Risk analysis (focus on tails)
risk_percentiles = df.describe(percentiles=[.01, .05, .95, .99])

# Fine-grained distribution (quintiles)
quintiles = df.describe(percentiles=[.2, .4, .6, .8])

# Custom business thresholds
custom = df.describe(percentiles=[.1, .25, .5, .75, .9, .95])

Note that describe() always includes the 50th percentile (median) even if you don’t specify it.

Describing Series vs. DataFrames

Calling describe() on a single column returns a Series instead of a DataFrame. The output format differs slightly, and you can chain it with other Series methods.

# DataFrame describe - returns DataFrame
df_stats = df.describe()
print(type(df_stats))  # <class 'pandas.core.frame.DataFrame'>

# Series describe - returns Series
series_stats = df['salary'].describe()
print(type(series_stats))  # <class 'pandas.core.series.Series'>

print(series_stats)

Output:

count        10.000000
mean      79200.000000
std       23489.361362
min       45000.000000
25%       61250.000000
50%       77000.000000
75%       96750.000000
max      120000.000000
Name: salary, dtype: float64

Series output is easier to work with when you need specific statistics:

# Access individual statistics directly
salary_stats = df['salary'].describe()

print(f"Average salary: ${salary_stats['mean']:,.0f}")
print(f"Salary range: ${salary_stats['min']:,.0f} - ${salary_stats['max']:,.0f}")
print(f"Middle 50% earn: ${salary_stats['25%']:,.0f} - ${salary_stats['75%']:,.0f}")

Practical Use Cases and Tips

Detecting Outliers

Use the IQR method with describe() output to flag potential outliers:

def find_outliers(df, column):
    stats = df[column].describe()
    iqr = stats['75%'] - stats['25%']
    lower = stats['25%'] - 1.5 * iqr
    upper = stats['75%'] + 1.5 * iqr
    
    outliers = df[(df[column] < lower) | (df[column] > upper)]
    return outliers

# Find salary outliers
salary_outliers = find_outliers(df, 'salary')
print(f"Found {len(salary_outliers)} outliers")

Comparing Groups with GroupBy

Combine groupby() with describe() for segmented analysis:

# Statistics by department
grouped_stats = df.groupby('department')['salary'].describe()
print(grouped_stats)

Output:

             count      mean          std      min       25%      50%       75%       max
department                                                                               
Engineering    4.0  76750.00  26702.39935  45000.0  53750.00  73500.0  96500.00  115000.0
HR             3.0  78666.67  13279.05681  65000.0  71500.00  72000.0  83500.00   95000.0
Sales          3.0  85000.00  28160.25545  65000.0  71250.00  88000.0  95250.00  120000.0

For better readability with multiple columns, transpose the output:

# Transpose for easier reading
full_grouped = df.groupby('department').describe()
print(full_grouped.T)

Checking Data Quality

Build a quick data quality report:

def data_quality_report(df):
    stats = df.describe(include='all')
    
    report = pd.DataFrame({
        'dtype': df.dtypes,
        'non_null': df.count(),
        'null_count': df.isnull().sum(),
        'unique': df.nunique()
    })
    
    return report

print(data_quality_report(df))

Saving Describe Output

Export statistics for documentation or sharing:

# Save to CSV
df.describe().to_csv('summary_statistics.csv')

# Round for cleaner display
df.describe().round(2).to_markdown()

Conclusion

The describe() method provides immediate insight into your data’s characteristics. You now know how to interpret each statistic, analyze different data types with include and exclude, customize percentiles for specific analyses, and combine describe() with groupby() for comparative analysis.

For deeper statistical analysis, consider these next steps: use df.corr() to examine relationships between numeric columns, apply df.value_counts() for detailed categorical breakdowns, or leverage scipy.stats for hypothesis testing and distribution fitting. But start with describe()—it remains the fastest way to understand what you’re working with.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.