How to Use pd.cut in Pandas
Continuous numerical data is messy. When you're analyzing customer ages, transaction amounts, or test scores, the raw numbers often obscure patterns that become obvious once you group them into...
Key Insights
pd.cuttransforms continuous numerical data into discrete bins, enabling categorical analysis and cleaner visualizations of distributions- Custom bin edges and labels give you precise control over how data gets categorized, making results interpretable for stakeholders
- Understanding the
rightandinclude_lowestparameters prevents off-by-one errors that silently corrupt your analysis
Introduction to Binning with pd.cut
Continuous numerical data is messy. When you’re analyzing customer ages, transaction amounts, or test scores, the raw numbers often obscure patterns that become obvious once you group them into meaningful categories. This process—called binning or discretization—converts continuous variables into discrete intervals.
pd.cut is Pandas’ primary tool for this job. It takes a series of numbers and assigns each value to a bin based on where it falls within specified ranges. The use cases are everywhere: converting ages into demographic groups for marketing analysis, transforming income into tax brackets, categorizing response times into performance tiers, or turning raw scores into letter grades.
Binning serves three practical purposes. First, it simplifies analysis by reducing noise in your data. Second, it enables categorical operations like groupby aggregations across meaningful segments. Third, it produces visualizations that humans can actually interpret—nobody wants to see a chart with 47 distinct age values when “18-25” and “26-35” tell the story better.
Basic Syntax and Parameters
The function signature looks like this:
pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')
Here’s what each parameter does:
- x: The input array or Series you want to bin
- bins: Either an integer (number of equal-width bins) or a sequence of bin edges
- right: Whether bins include the right edge (default True)
- labels: Custom names for each bin
- retbins: Whether to return the bin edges along with the result
- precision: Decimal precision for bin labels
- include_lowest: Whether the first bin should include its left edge
- duplicates: How to handle duplicate bin edges (‘raise’ or ‘drop’)
Let’s start with the simplest case—automatic equal-width bins:
import pandas as pd
import numpy as np
# Sample data
values = pd.Series([15, 22, 35, 42, 58, 67, 73, 88, 91, 99])
# Create 4 equal-width bins
binned = pd.cut(values, bins=4)
print(binned)
Output:
0 (14.916, 36.0]
1 (14.916, 36.0]
2 (14.916, 36.0]
3 (36.0, 57.0]
4 (57.0, 78.0]
5 (57.0, 78.0]
6 (57.0, 78.0]
7 (78.0, 99.0]
8 (78.0, 99.0]
9 (78.0, 99.0]
dtype: category
The output is a Categorical Series. Each value shows the interval notation: parentheses mean “exclusive” and brackets mean “inclusive.” So (14.916, 36.0] includes 36 but not 14.916.
Creating Custom Bin Edges
Automatic equal-width bins rarely match real-world requirements. You almost always need custom breakpoints that align with business logic or domain knowledge.
Pass a list of edges to the bins parameter:
ages = pd.Series([5, 12, 17, 23, 34, 45, 52, 61, 78, 85])
# Define custom age groups
bins = [0, 18, 35, 50, 65, 100]
age_groups = pd.cut(ages, bins=bins)
print(age_groups)
Output:
0 (0, 18]
1 (0, 18]
2 (0, 18]
3 (18, 35]
4 (18, 35]
5 (35, 50]
6 (50, 65]
7 (50, 65]
8 (65, 100]
9 (65, 100]
dtype: category
Notice that you need one more edge than the number of bins you want. Five edges create four bins. The edges define the boundaries: 0-18, 18-35, 35-50, 50-65, and 65-100.
One gotcha: values outside your bin range become NaN. If you had an age of 105 in this dataset, it wouldn’t fit into any bin and would be assigned a null value. Always ensure your edges cover the full range of your data.
Adding Labels to Bins
Interval notation is precise but ugly. The labels parameter lets you assign human-readable names:
scores = pd.Series([95, 87, 76, 68, 54, 82, 91, 73, 45, 88])
# Define grade boundaries and labels
bins = [0, 60, 70, 80, 90, 100]
labels = ['F', 'D', 'C', 'B', 'A']
grades = pd.cut(scores, bins=bins, labels=labels)
print(pd.DataFrame({'score': scores, 'grade': grades}))
Output:
score grade
0 95 A
1 87 B
2 76 C
3 68 D
4 54 F
5 82 B
6 91 A
7 73 C
8 45 F
9 88 B
The number of labels must equal the number of bins (one fewer than the number of edges). Get this wrong and Pandas throws a ValueError.
You can also pass labels=False to get integer bin indices instead of categories:
grades_numeric = pd.cut(scores, bins=bins, labels=False)
print(grades_numeric)
# Output: [4, 3, 2, 1, 0, 3, 4, 2, 0, 3]
This is useful when you need ordinal encoding for machine learning models.
Handling Edge Cases
The right and include_lowest parameters control exactly which values fall into which bins. This matters more than you’d think.
By default, right=True means bins are closed on the right: (a, b] includes b but not a. Setting right=False flips this to [a, b):
values = pd.Series([10, 20, 30, 40, 50])
bins = [10, 20, 30, 40, 50]
# Default: right-closed intervals
right_closed = pd.cut(values, bins=bins, right=True)
print("right=True:")
print(right_closed)
# Left-closed intervals
left_closed = pd.cut(values, bins=bins, right=False)
print("\nright=False:")
print(left_closed)
Output:
right=True:
0 NaN
1 (10, 20]
2 (20, 30]
3 (30, 40]
4 (40, 50]
dtype: category
right=False:
0 [10, 20)
1 [20, 30)
2 [30, 40)
3 [40, 50)
4 NaN
dtype: category
See the problem? With right=True, the value 10 falls outside all bins because (10, 20] excludes 10. With right=False, the value 50 becomes NaN because [40, 50) excludes 50.
The include_lowest parameter fixes the left boundary issue:
fixed = pd.cut(values, bins=bins, right=True, include_lowest=True)
print(fixed)
Output:
0 (9.999, 20.0]
1 (9.999, 20.0]
2 (20.0, 30.0]
3 (30.0, 40.0]
4 (40.0, 50.0]
dtype: category
Now 10 is included in the first bin. The label shows a slightly adjusted left edge to indicate this.
pd.cut vs pd.qcut
pd.cut creates equal-width bins. pd.qcut creates equal-frequency bins (quantiles). The difference matters enormously for skewed data.
# Skewed income data (in thousands)
incomes = pd.Series([25, 30, 32, 35, 38, 42, 45, 48, 55, 150])
# Equal-width bins
cut_bins = pd.cut(incomes, bins=4)
print("pd.cut (equal-width):")
print(cut_bins.value_counts().sort_index())
# Equal-frequency bins
qcut_bins = pd.qcut(incomes, q=4)
print("\npd.qcut (equal-frequency):")
print(qcut_bins.value_counts().sort_index())
Output:
pd.cut (equal-width):
(24.875, 56.25] 9
(56.25, 87.5] 0
(87.5, 118.75] 0
(118.75, 150.0] 1
dtype: int64
pd.qcut (equal-frequency):
(24.999, 33.5] 3
(33.5, 43.5] 3
(43.5, 53.25] 2
(53.25, 150.0] 2
dtype: int64
With pd.cut, the outlier income of 150 stretches the bins so wide that 9 of 10 values land in the first bin. The data is effectively not binned at all. pd.qcut forces roughly equal counts per bin, giving you a more useful distribution.
Use pd.cut when the bin boundaries have inherent meaning (age groups, grade thresholds). Use pd.qcut when you want to segment data into percentiles or ensure balanced group sizes.
Practical Applications
Here’s where binning becomes powerful: combining pd.cut with groupby for segment analysis.
# Simulated e-commerce data
np.random.seed(42)
n = 1000
customers = pd.DataFrame({
'customer_id': range(n),
'age': np.random.randint(18, 75, n),
'purchase_amount': np.random.exponential(50, n) + 10,
'items_bought': np.random.randint(1, 10, n)
})
# Create age segments
age_bins = [18, 25, 35, 45, 55, 65, 75]
age_labels = ['18-24', '25-34', '35-44', '45-54', '55-64', '65+']
customers['age_group'] = pd.cut(customers['age'], bins=age_bins, labels=age_labels, right=False)
# Analyze by segment
segment_analysis = customers.groupby('age_group', observed=True).agg({
'purchase_amount': ['mean', 'median', 'count'],
'items_bought': 'mean'
}).round(2)
print(segment_analysis)
Output:
purchase_amount items_bought
mean median count mean
age_group
18-24 56.83 43.67 118 4.97
25-34 62.35 47.89 175 5.06
35-44 61.29 48.35 182 4.95
45-54 57.23 44.53 188 4.89
55-64 59.69 46.79 171 5.09
65+ 60.14 47.15 166 5.10
This pattern—bin continuous variables, then aggregate—appears constantly in business analytics. You can quickly answer questions like “Which age group has the highest average order value?” or “How does purchase frequency vary across income brackets?”
The observed=True parameter in groupby is important when working with categorical data from pd.cut. It ensures you only see groups that actually exist in your data, not empty categories from the full categorical type.
Binning with pd.cut is a fundamental data transformation skill. Master the parameters, understand the edge cases, and you’ll find yourself reaching for it constantly when exploring and presenting numerical data.