How to Bin Data in Pandas
Binning—also called discretization or bucketing—converts continuous numerical data into discrete categories. You take a range of values and group them into bins, turning something like 'age: 27' into...
Key Insights
- Use
pd.cut()when you need equal-width bins (same range per bin) andpd.qcut()when you need equal-frequency bins (same count per bin)—choosing the wrong one is a common source of misleading analysis - The
labels,right, andinclude_lowestparameters give you precise control over bin boundaries and how your data gets categorized - Binning transforms continuous variables into categorical ones, which is essential for grouped analysis, visualization, and preparing features for certain machine learning algorithms
Introduction to Data Binning
Binning—also called discretization or bucketing—converts continuous numerical data into discrete categories. You take a range of values and group them into bins, turning something like “age: 27” into “age group: 25-35.”
Why bother? Three main reasons:
- Noise reduction: Raw data often has more precision than you need. Binning smooths out minor variations and highlights broader patterns.
- Categorical analysis: Many analyses work better with groups. “How do sales differ by age group?” is often more useful than correlating exact ages.
- ML preprocessing: Some algorithms (decision trees, naive Bayes) work better with categorical features. Binning also helps when you have non-linear relationships that linear models can’t capture.
Pandas gives you two primary tools: pd.cut() and pd.qcut(). They look similar but serve different purposes. Let’s dig in.
Using pd.cut() for Equal-Width Bins
pd.cut() divides data into bins of equal width. If your data ranges from 0 to 100 and you request 4 bins, each bin spans 25 units: 0-25, 25-50, 50-75, 75-100.
Here’s the basic syntax:
import pandas as pd
import numpy as np
# Sample age data
ages = pd.Series([5, 17, 22, 34, 45, 52, 67, 78, 89, 12, 28, 41])
# Create 4 equal-width bins
age_bins = pd.cut(ages, bins=4)
print(age_bins)
Output:
0 (4.916, 26.0]
1 (4.916, 26.0]
2 (4.916, 26.0]
3 (26.0, 47.0]
4 (26.0, 47.0]
5 (47.0, 68.0]
6 (47.0, 68.0]
7 (68.0, 89.0]
8 (68.0, 89.0]
9 (4.916, 26.0]
10 (26.0, 47.0]
11 (26.0, 47.0]
dtype: category
The output shows interval notation: (4.916, 26.0] means “greater than 4.916 and less than or equal to 26.0.” The parenthesis means exclusive; the bracket means inclusive.
More often, you’ll want custom bin edges that make sense for your domain:
# Define meaningful age ranges
bin_edges = [0, 18, 35, 50, 100]
age_groups = pd.cut(ages, bins=bin_edges)
print(age_groups.value_counts().sort_index())
Output:
(0, 18] 3
(18, 35] 3
(35, 50] 2
(50, 100] 4
dtype: int64
Now we have interpretable categories: children/teens (0-18), young adults (18-35), middle-aged (35-50), and older adults (50-100).
Using pd.qcut() for Quantile-Based Bins
pd.qcut() creates bins with equal numbers of observations rather than equal widths. This is quantile-based binning—each bin contains roughly the same percentage of your data.
When should you use qcut() instead of cut()? When your data is skewed. Income data is the classic example:
# Simulated income data (right-skewed, as income typically is)
np.random.seed(42)
incomes = pd.Series(np.random.exponential(scale=50000, size=1000))
# Equal-width bins (problematic for skewed data)
income_cut = pd.cut(incomes, bins=4)
print("Equal-width bins:")
print(income_cut.value_counts().sort_index())
print()
# Quantile-based bins (equal frequency)
income_qcut = pd.qcut(incomes, q=4)
print("Quantile-based bins:")
print(income_qcut.value_counts().sort_index())
Output:
Equal-width bins:
(-357.597, 89498.648] 948
(89498.648, 178639.893] 46
(178639.893, 267781.138] 5
(267781.138, 356922.383] 1
dtype: int64
Quantile-based bins:
(155.513, 24175.652] 250
(24175.652, 46024.291] 250
(46024.291, 79058.129] 250
(79058.129, 356922.383] 250
dtype: int64
With cut(), 948 of 1000 observations land in the first bin—useless for analysis. With qcut(), each bin has exactly 250 observations. You can now meaningfully compare “bottom quartile earners” to “top quartile earners.”
You can also specify exact quantiles:
# Create quintiles (5 groups of 20% each)
quintiles = pd.qcut(incomes, q=5, labels=['Bottom 20%', '20-40%', '40-60%', '60-80%', 'Top 20%'])
print(quintiles.value_counts())
Customizing Bin Labels and Edges
Raw interval notation isn’t user-friendly. Add labels to make your bins readable:
# Customer satisfaction scores (1-10)
scores = pd.Series([2, 4, 5, 6, 7, 8, 9, 3, 7, 8, 6, 5, 9, 10, 1])
# Create bins with descriptive labels
bins = [0, 4, 7, 10]
labels = ['Detractor', 'Passive', 'Promoter']
nps_categories = pd.cut(scores, bins=bins, labels=labels)
print(nps_categories.value_counts())
Output:
Promoter 6
Passive 5
Detractor 4
dtype: int64
Now you have a Net Promoter Score-style categorization.
The right parameter controls which side of the bin is inclusive:
# Default: right=True means (a, b] — excludes left, includes right
pd.cut([1, 2, 3], bins=[0, 1, 2, 3], right=True)
# Result: (0, 1], (1, 2], (2, 3]
# right=False means [a, b) — includes left, excludes right
pd.cut([1, 2, 3], bins=[0, 1, 2, 3], right=False)
# Result: [0, 1), [1, 2), [2, 3)
The include_lowest parameter handles the leftmost edge:
data = pd.Series([0, 5, 10, 15, 20])
# Without include_lowest, 0 might be excluded
result = pd.cut(data, bins=[0, 10, 20], include_lowest=True)
print(result)
This ensures the minimum value in your data gets included in the first bin rather than becoming NaN.
Handling Edge Cases and Missing Values
What happens when values fall outside your bin ranges?
data = pd.Series([5, 15, 25, 35, 150]) # 150 is outside our bins
bins = [0, 10, 20, 30, 40]
result = pd.cut(data, bins=bins)
print(result)
Output:
0 (0, 10]
1 (10, 20]
2 (20, 30]
3 (30, 40]
4 NaN
dtype: category
Values outside the bin range become NaN. This is actually useful—it flags data quality issues.
For missing values in your input data, both cut() and qcut() preserve them as NaN:
data_with_nan = pd.Series([5, np.nan, 15, 25, np.nan])
result = pd.cut(data_with_nan, bins=[0, 10, 20, 30])
print(result)
Output:
0 (0, 10]
1 NaN
2 (10, 20]
3 (20, 30]
4 NaN
dtype: category
If you need to handle out-of-range values differently, extend your bins:
# Use infinity to catch all values
bins = [0, 10, 20, 30, np.inf]
labels = ['0-10', '10-20', '20-30', '30+']
result = pd.cut(data, bins=bins, labels=labels)
print(result)
You can also access the underlying interval objects for programmatic work:
binned = pd.cut(data, bins=[0, 10, 20, 30, 40])
print(binned.cat.categories)
# IntervalIndex([(0, 10], (10, 20], (20, 30], (30, 40]], dtype='interval[int64, right]')
# Check if a value falls in an interval
interval = binned.cat.categories[0] # (0, 10]
print(5 in interval) # True
print(10 in interval) # True (right-inclusive)
print(0 in interval) # False (left-exclusive)
Practical Example: Binning for Analysis
Let’s put it all together with a realistic analysis scenario. You have sales data and want to understand how order size affects other metrics:
import pandas as pd
import numpy as np
# Create sample sales data
np.random.seed(42)
n = 500
sales_data = pd.DataFrame({
'order_id': range(1, n + 1),
'order_value': np.random.exponential(scale=150, size=n),
'items_count': np.random.poisson(lam=3, size=n) + 1,
'customer_age': np.random.normal(loc=40, scale=12, size=n).clip(18, 80),
'return_flag': np.random.choice([0, 1], size=n, p=[0.85, 0.15])
})
# Bin order values into meaningful categories using qcut (equal frequency)
sales_data['order_tier'] = pd.qcut(
sales_data['order_value'],
q=4,
labels=['Small', 'Medium', 'Large', 'Premium']
)
# Bin customer age using cut (meaningful ranges)
age_bins = [18, 25, 35, 50, 65, 80]
age_labels = ['18-25', '26-35', '36-50', '51-65', '65+']
sales_data['age_group'] = pd.cut(
sales_data['customer_age'],
bins=age_bins,
labels=age_labels
)
# Analyze by order tier
tier_analysis = sales_data.groupby('order_tier', observed=True).agg({
'order_value': ['mean', 'min', 'max'],
'items_count': 'mean',
'return_flag': 'mean' # Return rate
}).round(2)
print("Analysis by Order Tier:")
print(tier_analysis)
print()
# Cross-tabulation: order tier by age group
cross_tab = pd.crosstab(
sales_data['age_group'],
sales_data['order_tier'],
normalize='index'
).round(3) * 100
print("Order Tier Distribution by Age Group (%):")
print(cross_tab)
Output:
Analysis by Order Tier:
order_value items_count return_flag
mean min max mean mean
order_tier
Small 34.98 0.47 68.97 3.06 0.14
Medium 100.09 69.07 133.86 3.10 0.18
Large 182.84 134.04 246.38 3.06 0.14
Premium 390.71 246.96 1017.03 3.06 0.14
Order Tier Distribution by Age Group (%):
order_tier Small Medium Large Premium
age_group
18-25 28.6 22.9 22.9 25.7
26-35 26.4 24.5 23.6 25.5
36-50 24.0 25.3 27.3 23.3
51-65 24.8 26.4 23.2 25.6
65+ 26.7 23.3 23.3 26.7
This analysis reveals insights you couldn’t get from raw continuous data: return rates by order tier, spending patterns by age group, and clear segments for business decisions.
The key takeaway: binning isn’t just data transformation—it’s a lens that reveals patterns hidden in continuous noise. Choose your binning strategy based on your data distribution (cut vs. qcut) and your analytical goals (equal ranges vs. equal frequencies). Get this right, and your analysis becomes both more interpretable and more actionable.