How to Use Describe in Pandas
Exploratory data analysis starts with one question: what does my data actually look like? Before building models, creating visualizations, or writing complex transformations, you need to understand...
Key Insights
- The
describe()method generates eight essential statistics for numeric columns in a single call, making it the fastest way to understand your data’s distribution, central tendency, and spread. - Using the
includeandexcludeparameters lets you analyze categorical data, numeric data, or both simultaneously—a feature many developers overlook. - Combining
describe()withgroupby()transforms simple summary statistics into powerful comparative analysis across segments of your data.
Introduction to the Describe Method
Exploratory data analysis starts with one question: what does my data actually look like? Before building models, creating visualizations, or writing complex transformations, you need to understand the basic shape and characteristics of your dataset. Pandas provides the describe() method for exactly this purpose.
The describe() method generates summary statistics for your DataFrame or Series in a single line of code. It calculates count, mean, standard deviation, minimum, maximum, and quartile values—everything you need to quickly assess data quality and distribution patterns.
Here’s the basic usage:
import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'age': [25, 32, 47, 51, 62, 23, 34, 45, 56, 38],
'salary': [50000, 65000, 82000, 95000, 120000, 45000, 70000, 88000, 105000, 72000],
'years_experience': [2, 5, 15, 20, 30, 1, 8, 18, 25, 10],
'department': ['Engineering', 'Sales', 'Engineering', 'HR', 'Sales',
'Engineering', 'HR', 'Sales', 'Engineering', 'HR']
})
# Get summary statistics
print(df.describe())
Output:
age salary years_experience
count 10.000000 10.000000 10.000000
mean 41.300000 79200.000000 13.400000
std 13.190147 23489.361362 9.701374
min 23.000000 45000.000000 1.000000
25% 30.250000 61250.000000 4.250000
50% 41.500000 77000.000000 12.500000
75% 52.250000 96750.000000 21.250000
max 62.000000 120000.000000 30.000000
With one method call, you now know the range, central tendency, and spread of every numeric column in your dataset.
Understanding the Default Output
Each row in the describe() output tells you something specific about your data. Understanding these metrics helps you spot issues and patterns immediately.
count reveals how many non-null values exist in each column. If count differs across columns, you have missing data. In our example, all columns show 10, meaning no missing values.
mean provides the arithmetic average. Compare this to the median (50%) to detect skewness. When mean significantly exceeds median, you likely have right-skewed data with high outliers.
std (standard deviation) measures spread around the mean. Higher values indicate more variability. Our salary column has a std of ~23,489, suggesting salaries vary considerably.
min and max show the range boundaries. These help identify potential data entry errors—a negative age or a salary of 1 would stand out immediately.
25%, 50%, 75% are the quartiles. The 50% value is the median. The interquartile range (75% minus 25%) contains the middle half of your data and resists outlier influence.
# Interpreting the output programmatically
stats = df.describe()
# Check for potential outliers using IQR
salary_iqr = stats.loc['75%', 'salary'] - stats.loc['25%', 'salary']
salary_upper_bound = stats.loc['75%', 'salary'] + (1.5 * salary_iqr)
salary_lower_bound = stats.loc['25%', 'salary'] - (1.5 * salary_iqr)
print(f"Salary IQR: ${salary_iqr:,.0f}")
print(f"Outlier bounds: ${salary_lower_bound:,.0f} - ${salary_upper_bound:,.0f}")
Describing Different Data Types
By default, describe() only analyzes numeric columns. The department column from our example doesn’t appear in the output. This behavior makes sense for most use cases, but you’ll often need to examine categorical data too.
The include parameter controls which data types to analyze:
# Include all columns regardless of type
print(df.describe(include='all'))
Output:
age department salary years_experience
count 10.000000 10 10.000000 10.000000
unique NaN 3 NaN NaN
top NaN Engineering NaN NaN
freq NaN 4 NaN NaN
mean 41.300000 NaN 79200.000000 13.400000
std 13.190147 NaN 23489.361362 9.701374
min 23.000000 NaN 45000.000000 1.000000
25% 30.250000 NaN 61250.000000 4.250000
50% 41.500000 NaN 77000.000000 12.500000
75% 52.250000 NaN 96750.000000 21.250000
max 62.000000 NaN 120000.000000 30.000000
For categorical columns, you get different statistics: unique (number of distinct values), top (most frequent value), and freq (frequency of the top value). Numeric statistics show NaN for categorical columns and vice versa.
Target specific data types with these patterns:
# Only object (string) columns
print(df.describe(include='object'))
# Only numeric columns (explicit)
print(df.describe(include=[np.number]))
# Multiple specific types
print(df.describe(include=['int64', 'float64']))
# Exclude certain types
print(df.describe(exclude=['object']))
For categorical data specifically:
# Analyze only categorical columns
categorical_stats = df.describe(include='object')
print(categorical_stats)
Output:
department
count 10
unique 3
top Engineering
freq 4
This tells us we have 3 unique departments, with Engineering being the most common (4 occurrences).
Customizing Percentiles
The default quartiles (25%, 50%, 75%) work well for general analysis, but specific use cases demand different percentile breakdowns. The percentiles parameter accepts a list of values between 0 and 1.
# Decile analysis (10th, 50th, 90th percentiles)
print(df.describe(percentiles=[.1, .5, .9]))
Output:
age salary years_experience
count 10.000000 10.000000 10.000000
mean 41.300000 79200.000000 13.400000
std 13.190147 23489.361362 9.701374
min 23.000000 45000.000000 1.000000
10% 24.800000 49500.000000 1.900000
50% 41.500000 77000.000000 12.500000
90% 57.400000 111500.000000 27.500000
max 62.000000 120000.000000 30.000000
This view helps identify extreme values. The gap between 90th percentile and max reveals whether your maximum values are outliers or part of a natural distribution.
Common percentile configurations for different analyses:
# Risk analysis (focus on tails)
risk_percentiles = df.describe(percentiles=[.01, .05, .95, .99])
# Fine-grained distribution (quintiles)
quintiles = df.describe(percentiles=[.2, .4, .6, .8])
# Custom business thresholds
custom = df.describe(percentiles=[.1, .25, .5, .75, .9, .95])
Note that describe() always includes the 50th percentile (median) even if you don’t specify it.
Describing Series vs. DataFrames
Calling describe() on a single column returns a Series instead of a DataFrame. The output format differs slightly, and you can chain it with other Series methods.
# DataFrame describe - returns DataFrame
df_stats = df.describe()
print(type(df_stats)) # <class 'pandas.core.frame.DataFrame'>
# Series describe - returns Series
series_stats = df['salary'].describe()
print(type(series_stats)) # <class 'pandas.core.series.Series'>
print(series_stats)
Output:
count 10.000000
mean 79200.000000
std 23489.361362
min 45000.000000
25% 61250.000000
50% 77000.000000
75% 96750.000000
max 120000.000000
Name: salary, dtype: float64
Series output is easier to work with when you need specific statistics:
# Access individual statistics directly
salary_stats = df['salary'].describe()
print(f"Average salary: ${salary_stats['mean']:,.0f}")
print(f"Salary range: ${salary_stats['min']:,.0f} - ${salary_stats['max']:,.0f}")
print(f"Middle 50% earn: ${salary_stats['25%']:,.0f} - ${salary_stats['75%']:,.0f}")
Practical Use Cases and Tips
Detecting Outliers
Use the IQR method with describe() output to flag potential outliers:
def find_outliers(df, column):
stats = df[column].describe()
iqr = stats['75%'] - stats['25%']
lower = stats['25%'] - 1.5 * iqr
upper = stats['75%'] + 1.5 * iqr
outliers = df[(df[column] < lower) | (df[column] > upper)]
return outliers
# Find salary outliers
salary_outliers = find_outliers(df, 'salary')
print(f"Found {len(salary_outliers)} outliers")
Comparing Groups with GroupBy
Combine groupby() with describe() for segmented analysis:
# Statistics by department
grouped_stats = df.groupby('department')['salary'].describe()
print(grouped_stats)
Output:
count mean std min 25% 50% 75% max
department
Engineering 4.0 76750.00 26702.39935 45000.0 53750.00 73500.0 96500.00 115000.0
HR 3.0 78666.67 13279.05681 65000.0 71500.00 72000.0 83500.00 95000.0
Sales 3.0 85000.00 28160.25545 65000.0 71250.00 88000.0 95250.00 120000.0
For better readability with multiple columns, transpose the output:
# Transpose for easier reading
full_grouped = df.groupby('department').describe()
print(full_grouped.T)
Checking Data Quality
Build a quick data quality report:
def data_quality_report(df):
stats = df.describe(include='all')
report = pd.DataFrame({
'dtype': df.dtypes,
'non_null': df.count(),
'null_count': df.isnull().sum(),
'unique': df.nunique()
})
return report
print(data_quality_report(df))
Saving Describe Output
Export statistics for documentation or sharing:
# Save to CSV
df.describe().to_csv('summary_statistics.csv')
# Round for cleaner display
df.describe().round(2).to_markdown()
Conclusion
The describe() method provides immediate insight into your data’s characteristics. You now know how to interpret each statistic, analyze different data types with include and exclude, customize percentiles for specific analyses, and combine describe() with groupby() for comparative analysis.
For deeper statistical analysis, consider these next steps: use df.corr() to examine relationships between numeric columns, apply df.value_counts() for detailed categorical breakdowns, or leverage scipy.stats for hypothesis testing and distribution fitting. But start with describe()—it remains the fastest way to understand what you’re working with.