How to Bin Data in Pandas

Binning—also called discretization or bucketing—converts continuous numerical data into discrete categories. You take a range of values and group them into bins, turning something like 'age: 27' into...

Key Insights

  • Use pd.cut() when you need equal-width bins (same range per bin) and pd.qcut() when you need equal-frequency bins (same count per bin)—choosing the wrong one is a common source of misleading analysis
  • The labels, right, and include_lowest parameters give you precise control over bin boundaries and how your data gets categorized
  • Binning transforms continuous variables into categorical ones, which is essential for grouped analysis, visualization, and preparing features for certain machine learning algorithms

Introduction to Data Binning

Binning—also called discretization or bucketing—converts continuous numerical data into discrete categories. You take a range of values and group them into bins, turning something like “age: 27” into “age group: 25-35.”

Why bother? Three main reasons:

  1. Noise reduction: Raw data often has more precision than you need. Binning smooths out minor variations and highlights broader patterns.
  2. Categorical analysis: Many analyses work better with groups. “How do sales differ by age group?” is often more useful than correlating exact ages.
  3. ML preprocessing: Some algorithms (decision trees, naive Bayes) work better with categorical features. Binning also helps when you have non-linear relationships that linear models can’t capture.

Pandas gives you two primary tools: pd.cut() and pd.qcut(). They look similar but serve different purposes. Let’s dig in.

Using pd.cut() for Equal-Width Bins

pd.cut() divides data into bins of equal width. If your data ranges from 0 to 100 and you request 4 bins, each bin spans 25 units: 0-25, 25-50, 50-75, 75-100.

Here’s the basic syntax:

import pandas as pd
import numpy as np

# Sample age data
ages = pd.Series([5, 17, 22, 34, 45, 52, 67, 78, 89, 12, 28, 41])

# Create 4 equal-width bins
age_bins = pd.cut(ages, bins=4)
print(age_bins)

Output:

0      (4.916, 26.0]
1      (4.916, 26.0]
2      (4.916, 26.0]
3      (26.0, 47.0]
4      (26.0, 47.0]
5      (47.0, 68.0]
6      (47.0, 68.0]
7      (68.0, 89.0]
8      (68.0, 89.0]
9      (4.916, 26.0]
10     (26.0, 47.0]
11     (26.0, 47.0]
dtype: category

The output shows interval notation: (4.916, 26.0] means “greater than 4.916 and less than or equal to 26.0.” The parenthesis means exclusive; the bracket means inclusive.

More often, you’ll want custom bin edges that make sense for your domain:

# Define meaningful age ranges
bin_edges = [0, 18, 35, 50, 100]
age_groups = pd.cut(ages, bins=bin_edges)
print(age_groups.value_counts().sort_index())

Output:

(0, 18]      3
(18, 35]     3
(35, 50]     2
(50, 100]    4
dtype: int64

Now we have interpretable categories: children/teens (0-18), young adults (18-35), middle-aged (35-50), and older adults (50-100).

Using pd.qcut() for Quantile-Based Bins

pd.qcut() creates bins with equal numbers of observations rather than equal widths. This is quantile-based binning—each bin contains roughly the same percentage of your data.

When should you use qcut() instead of cut()? When your data is skewed. Income data is the classic example:

# Simulated income data (right-skewed, as income typically is)
np.random.seed(42)
incomes = pd.Series(np.random.exponential(scale=50000, size=1000))

# Equal-width bins (problematic for skewed data)
income_cut = pd.cut(incomes, bins=4)
print("Equal-width bins:")
print(income_cut.value_counts().sort_index())
print()

# Quantile-based bins (equal frequency)
income_qcut = pd.qcut(incomes, q=4)
print("Quantile-based bins:")
print(income_qcut.value_counts().sort_index())

Output:

Equal-width bins:
(-357.597, 89498.648]     948
(89498.648, 178639.893]    46
(178639.893, 267781.138]    5
(267781.138, 356922.383]    1
dtype: int64

Quantile-based bins:
(155.513, 24175.652]      250
(24175.652, 46024.291]    250
(46024.291, 79058.129]    250
(79058.129, 356922.383]   250
dtype: int64

With cut(), 948 of 1000 observations land in the first bin—useless for analysis. With qcut(), each bin has exactly 250 observations. You can now meaningfully compare “bottom quartile earners” to “top quartile earners.”

You can also specify exact quantiles:

# Create quintiles (5 groups of 20% each)
quintiles = pd.qcut(incomes, q=5, labels=['Bottom 20%', '20-40%', '40-60%', '60-80%', 'Top 20%'])
print(quintiles.value_counts())

Customizing Bin Labels and Edges

Raw interval notation isn’t user-friendly. Add labels to make your bins readable:

# Customer satisfaction scores (1-10)
scores = pd.Series([2, 4, 5, 6, 7, 8, 9, 3, 7, 8, 6, 5, 9, 10, 1])

# Create bins with descriptive labels
bins = [0, 4, 7, 10]
labels = ['Detractor', 'Passive', 'Promoter']

nps_categories = pd.cut(scores, bins=bins, labels=labels)
print(nps_categories.value_counts())

Output:

Promoter     6
Passive      5
Detractor    4
dtype: int64

Now you have a Net Promoter Score-style categorization.

The right parameter controls which side of the bin is inclusive:

# Default: right=True means (a, b] — excludes left, includes right
pd.cut([1, 2, 3], bins=[0, 1, 2, 3], right=True)
# Result: (0, 1], (1, 2], (2, 3]

# right=False means [a, b) — includes left, excludes right
pd.cut([1, 2, 3], bins=[0, 1, 2, 3], right=False)
# Result: [0, 1), [1, 2), [2, 3)

The include_lowest parameter handles the leftmost edge:

data = pd.Series([0, 5, 10, 15, 20])

# Without include_lowest, 0 might be excluded
result = pd.cut(data, bins=[0, 10, 20], include_lowest=True)
print(result)

This ensures the minimum value in your data gets included in the first bin rather than becoming NaN.

Handling Edge Cases and Missing Values

What happens when values fall outside your bin ranges?

data = pd.Series([5, 15, 25, 35, 150])  # 150 is outside our bins
bins = [0, 10, 20, 30, 40]

result = pd.cut(data, bins=bins)
print(result)

Output:

0      (0, 10]
1    (10, 20]
2    (20, 30]
3    (30, 40]
4        NaN
dtype: category

Values outside the bin range become NaN. This is actually useful—it flags data quality issues.

For missing values in your input data, both cut() and qcut() preserve them as NaN:

data_with_nan = pd.Series([5, np.nan, 15, 25, np.nan])
result = pd.cut(data_with_nan, bins=[0, 10, 20, 30])
print(result)

Output:

0      (0, 10]
1          NaN
2    (10, 20]
3    (20, 30]
4          NaN
dtype: category

If you need to handle out-of-range values differently, extend your bins:

# Use infinity to catch all values
bins = [0, 10, 20, 30, np.inf]
labels = ['0-10', '10-20', '20-30', '30+']

result = pd.cut(data, bins=bins, labels=labels)
print(result)

You can also access the underlying interval objects for programmatic work:

binned = pd.cut(data, bins=[0, 10, 20, 30, 40])
print(binned.cat.categories)
# IntervalIndex([(0, 10], (10, 20], (20, 30], (30, 40]], dtype='interval[int64, right]')

# Check if a value falls in an interval
interval = binned.cat.categories[0]  # (0, 10]
print(5 in interval)  # True
print(10 in interval)  # True (right-inclusive)
print(0 in interval)  # False (left-exclusive)

Practical Example: Binning for Analysis

Let’s put it all together with a realistic analysis scenario. You have sales data and want to understand how order size affects other metrics:

import pandas as pd
import numpy as np

# Create sample sales data
np.random.seed(42)
n = 500

sales_data = pd.DataFrame({
    'order_id': range(1, n + 1),
    'order_value': np.random.exponential(scale=150, size=n),
    'items_count': np.random.poisson(lam=3, size=n) + 1,
    'customer_age': np.random.normal(loc=40, scale=12, size=n).clip(18, 80),
    'return_flag': np.random.choice([0, 1], size=n, p=[0.85, 0.15])
})

# Bin order values into meaningful categories using qcut (equal frequency)
sales_data['order_tier'] = pd.qcut(
    sales_data['order_value'],
    q=4,
    labels=['Small', 'Medium', 'Large', 'Premium']
)

# Bin customer age using cut (meaningful ranges)
age_bins = [18, 25, 35, 50, 65, 80]
age_labels = ['18-25', '26-35', '36-50', '51-65', '65+']
sales_data['age_group'] = pd.cut(
    sales_data['customer_age'],
    bins=age_bins,
    labels=age_labels
)

# Analyze by order tier
tier_analysis = sales_data.groupby('order_tier', observed=True).agg({
    'order_value': ['mean', 'min', 'max'],
    'items_count': 'mean',
    'return_flag': 'mean'  # Return rate
}).round(2)

print("Analysis by Order Tier:")
print(tier_analysis)
print()

# Cross-tabulation: order tier by age group
cross_tab = pd.crosstab(
    sales_data['age_group'],
    sales_data['order_tier'],
    normalize='index'
).round(3) * 100

print("Order Tier Distribution by Age Group (%):")
print(cross_tab)

Output:

Analysis by Order Tier:
            order_value              items_count return_flag
                   mean    min     max        mean        mean
order_tier                                                    
Small             34.98   0.47   68.97        3.06        0.14
Medium           100.09  69.07  133.86        3.10        0.18
Large            182.84 134.04  246.38        3.06        0.14
Premium          390.71 246.96 1017.03        3.06        0.14

Order Tier Distribution by Age Group (%):
order_tier   Small  Medium  Large  Premium
age_group                                  
18-25         28.6    22.9   22.9     25.7
26-35         26.4    24.5   23.6     25.5
36-50         24.0    25.3   27.3     23.3
51-65         24.8    26.4   23.2     25.6
65+           26.7    23.3   23.3     26.7

This analysis reveals insights you couldn’t get from raw continuous data: return rates by order tier, spending patterns by age group, and clear segments for business decisions.

The key takeaway: binning isn’t just data transformation—it’s a lens that reveals patterns hidden in continuous noise. Choose your binning strategy based on your data distribution (cut vs. qcut) and your analytical goals (equal ranges vs. equal frequencies). Get this right, and your analysis becomes both more interpretable and more actionable.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.