How to Use pd.cut in Pandas

Key Insights

pd.cut transforms continuous numerical data into discrete bins, enabling categorical analysis and cleaner visualizations of distributions
Custom bin edges and labels give you precise control over how data gets categorized, making results interpretable for stakeholders
Understanding the right and include_lowest parameters prevents off-by-one errors that silently corrupt your analysis

Introduction to Binning with pd.cut

Continuous numerical data is messy. When you’re analyzing customer ages, transaction amounts, or test scores, the raw numbers often obscure patterns that become obvious once you group them into meaningful categories. This process—called binning or discretization—converts continuous variables into discrete intervals.

pd.cut is Pandas’ primary tool for this job. It takes a series of numbers and assigns each value to a bin based on where it falls within specified ranges. The use cases are everywhere: converting ages into demographic groups for marketing analysis, transforming income into tax brackets, categorizing response times into performance tiers, or turning raw scores into letter grades.

Binning serves three practical purposes. First, it simplifies analysis by reducing noise in your data. Second, it enables categorical operations like groupby aggregations across meaningful segments. Third, it produces visualizations that humans can actually interpret—nobody wants to see a chart with 47 distinct age values when “18-25” and “26-35” tell the story better.

Basic Syntax and Parameters

The function signature looks like this:

pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')

Here’s what each parameter does:

x: The input array or Series you want to bin
bins: Either an integer (number of equal-width bins) or a sequence of bin edges
right: Whether bins include the right edge (default True)
labels: Custom names for each bin
retbins: Whether to return the bin edges along with the result
precision: Decimal precision for bin labels
include_lowest: Whether the first bin should include its left edge
duplicates: How to handle duplicate bin edges (‘raise’ or ‘drop’)

Let’s start with the simplest case—automatic equal-width bins:

import pandas as pd
import numpy as np

# Sample data
values = pd.Series([15, 22, 35, 42, 58, 67, 73, 88, 91, 99])

# Create 4 equal-width bins
binned = pd.cut(values, bins=4)
print(binned)

Output:

0     (14.916, 36.0]
1     (14.916, 36.0]
2     (14.916, 36.0]
3      (36.0, 57.0]
4      (57.0, 78.0]
5      (57.0, 78.0]
6      (57.0, 78.0]
7      (78.0, 99.0]
8      (78.0, 99.0]
9      (78.0, 99.0]
dtype: category

The output is a Categorical Series. Each value shows the interval notation: parentheses mean “exclusive” and brackets mean “inclusive.” So (14.916, 36.0] includes 36 but not 14.916.

Creating Custom Bin Edges

Automatic equal-width bins rarely match real-world requirements. You almost always need custom breakpoints that align with business logic or domain knowledge.

Pass a list of edges to the bins parameter:

ages = pd.Series([5, 12, 17, 23, 34, 45, 52, 61, 78, 85])

# Define custom age groups
bins = [0, 18, 35, 50, 65, 100]
age_groups = pd.cut(ages, bins=bins)

print(age_groups)

Output:

0      (0, 18]
1      (0, 18]
2      (0, 18]
3     (18, 35]
4     (18, 35]
5     (35, 50]
6     (50, 65]
7     (50, 65]
8    (65, 100]
9    (65, 100]
dtype: category

Notice that you need one more edge than the number of bins you want. Five edges create four bins. The edges define the boundaries: 0-18, 18-35, 35-50, 50-65, and 65-100.

One gotcha: values outside your bin range become NaN. If you had an age of 105 in this dataset, it wouldn’t fit into any bin and would be assigned a null value. Always ensure your edges cover the full range of your data.

Adding Labels to Bins

Interval notation is precise but ugly. The labels parameter lets you assign human-readable names:

scores = pd.Series([95, 87, 76, 68, 54, 82, 91, 73, 45, 88])

# Define grade boundaries and labels
bins = [0, 60, 70, 80, 90, 100]
labels = ['F', 'D', 'C', 'B', 'A']

grades = pd.cut(scores, bins=bins, labels=labels)

print(pd.DataFrame({'score': scores, 'grade': grades}))

Output:

   score grade
0     95     A
1     87     B
2     76     C
3     68     D
4     54     F
5     82     B
6     91     A
7     73     C
8     45     F
9     88     B

The number of labels must equal the number of bins (one fewer than the number of edges). Get this wrong and Pandas throws a ValueError.

You can also pass labels=False to get integer bin indices instead of categories:

grades_numeric = pd.cut(scores, bins=bins, labels=False)
print(grades_numeric)
# Output: [4, 3, 2, 1, 0, 3, 4, 2, 0, 3]

This is useful when you need ordinal encoding for machine learning models.

Handling Edge Cases

The right and include_lowest parameters control exactly which values fall into which bins. This matters more than you’d think.

By default, right=True means bins are closed on the right: (a, b] includes b but not a. Setting right=False flips this to [a, b):

values = pd.Series([10, 20, 30, 40, 50])
bins = [10, 20, 30, 40, 50]

# Default: right-closed intervals
right_closed = pd.cut(values, bins=bins, right=True)
print("right=True:")
print(right_closed)

# Left-closed intervals
left_closed = pd.cut(values, bins=bins, right=False)
print("\nright=False:")
print(left_closed)

Output:

right=True:
0          NaN
1    (10, 20]
2    (20, 30]
3    (30, 40]
4    (40, 50]
dtype: category

right=False:
0    [10, 20)
1    [20, 30)
2    [30, 40)
3    [40, 50)
4         NaN
dtype: category

See the problem? With right=True, the value 10 falls outside all bins because (10, 20] excludes 10. With right=False, the value 50 becomes NaN because [40, 50) excludes 50.

The include_lowest parameter fixes the left boundary issue:

fixed = pd.cut(values, bins=bins, right=True, include_lowest=True)
print(fixed)

Output:

0    (9.999, 20.0]
1    (9.999, 20.0]
2     (20.0, 30.0]
3     (30.0, 40.0]
4     (40.0, 50.0]
dtype: category

Now 10 is included in the first bin. The label shows a slightly adjusted left edge to indicate this.

pd.cut vs pd.qcut

pd.cut creates equal-width bins. pd.qcut creates equal-frequency bins (quantiles). The difference matters enormously for skewed data.

# Skewed income data (in thousands)
incomes = pd.Series([25, 30, 32, 35, 38, 42, 45, 48, 55, 150])

# Equal-width bins
cut_bins = pd.cut(incomes, bins=4)
print("pd.cut (equal-width):")
print(cut_bins.value_counts().sort_index())

# Equal-frequency bins
qcut_bins = pd.qcut(incomes, q=4)
print("\npd.qcut (equal-frequency):")
print(qcut_bins.value_counts().sort_index())

Output:

pd.cut (equal-width):
(24.875, 56.25]     9
(56.25, 87.5]       0
(87.5, 118.75]      0
(118.75, 150.0]     1
dtype: int64

pd.qcut (equal-frequency):
(24.999, 33.5]    3
(33.5, 43.5]      3
(43.5, 53.25]     2
(53.25, 150.0]    2
dtype: int64

With pd.cut, the outlier income of 150 stretches the bins so wide that 9 of 10 values land in the first bin. The data is effectively not binned at all. pd.qcut forces roughly equal counts per bin, giving you a more useful distribution.

Use pd.cut when the bin boundaries have inherent meaning (age groups, grade thresholds). Use pd.qcut when you want to segment data into percentiles or ensure balanced group sizes.

Practical Applications

Here’s where binning becomes powerful: combining pd.cut with groupby for segment analysis.

# Simulated e-commerce data
np.random.seed(42)
n = 1000

customers = pd.DataFrame({
    'customer_id': range(n),
    'age': np.random.randint(18, 75, n),
    'purchase_amount': np.random.exponential(50, n) + 10,
    'items_bought': np.random.randint(1, 10, n)
})

# Create age segments
age_bins = [18, 25, 35, 45, 55, 65, 75]
age_labels = ['18-24', '25-34', '35-44', '45-54', '55-64', '65+']
customers['age_group'] = pd.cut(customers['age'], bins=age_bins, labels=age_labels, right=False)

# Analyze by segment
segment_analysis = customers.groupby('age_group', observed=True).agg({
    'purchase_amount': ['mean', 'median', 'count'],
    'items_bought': 'mean'
}).round(2)

print(segment_analysis)

Output:

           purchase_amount                  items_bought
                      mean median count             mean
age_group                                               
18-24                56.83  43.67   118             4.97
25-34                62.35  47.89   175             5.06
35-44                61.29  48.35   182             4.95
45-54                57.23  44.53   188             4.89
55-64                59.69  46.79   171             5.09
65+                  60.14  47.15   166             5.10

This pattern—bin continuous variables, then aggregate—appears constantly in business analytics. You can quickly answer questions like “Which age group has the highest average order value?” or “How does purchase frequency vary across income brackets?”

The observed=True parameter in groupby is important when working with categorical data from pd.cut. It ensures you only see groups that actually exist in your data, not empty categories from the full categorical type.

Binning with pd.cut is a fundamental data transformation skill. Master the parameters, understand the edge cases, and you’ll find yourself reaching for it constantly when exploring and presenting numerical data.

Introduction to Binning with pd.cut

Basic Syntax and Parameters

Creating Custom Bin Edges

Adding Labels to Bins

Handling Edge Cases

pd.cut vs pd.qcut

Practical Applications

Liked this? There's more.

Similar Articles