Pandas - Bin Continuous Data (cut/qcut)

Binning transforms continuous numerical data into discrete categories or intervals. This technique is essential for data analysis, visualization, and machine learning feature engineering. Pandas...

Key Insights

  • pd.cut() bins data into equal-width intervals based on value ranges, ideal when you need consistent bin sizes regardless of data distribution
  • pd.qcut() creates equal-sized bins based on quantiles, ensuring each bin contains approximately the same number of observations
  • Both functions return categorical data by default, but can generate integer labels or custom bin names for downstream processing and analysis

Understanding Binning in Pandas

Binning transforms continuous numerical data into discrete categories or intervals. This technique is essential for data analysis, visualization, and machine learning feature engineering. Pandas provides two primary functions: cut() for equal-width binning and qcut() for quantile-based binning.

import pandas as pd
import numpy as np

# Sample dataset
np.random.seed(42)
data = pd.DataFrame({
    'age': np.random.randint(18, 80, 1000),
    'income': np.random.exponential(50000, 1000),
    'score': np.random.normal(75, 15, 1000)
})

print(data.head())

Using pd.cut() for Equal-Width Bins

pd.cut() divides the range of values into equal-width intervals. Specify either the number of bins or explicit bin edges.

# Create 5 equal-width bins
data['age_bins'] = pd.cut(data['age'], bins=5)
print(data['age_bins'].value_counts().sort_index())

# Output shows intervals like (17.938, 30.4], (30.4, 42.8], etc.

Define custom bin edges for precise control:

# Custom age groups
age_edges = [0, 25, 40, 60, 100]
age_labels = ['Young', 'Middle', 'Senior', 'Elderly']

data['age_group'] = pd.cut(data['age'], 
                            bins=age_edges, 
                            labels=age_labels)

print(data['age_group'].value_counts())

Control bin boundaries with the right parameter:

# Left-inclusive bins (default is right-inclusive)
data['age_left'] = pd.cut(data['age'], 
                          bins=5, 
                          right=False)

# Compare: (17.938, 30.4] vs [17.938, 30.4)

Return integer labels instead of intervals:

data['age_numeric'] = pd.cut(data['age'], 
                              bins=5, 
                              labels=False)

print(data[['age', 'age_numeric']].head(10))
# Returns 0, 1, 2, 3, 4 for each bin

Using pd.qcut() for Quantile-Based Bins

pd.qcut() creates bins with approximately equal numbers of observations, useful when data is skewed or you need balanced categories.

# Create 4 quartiles
data['income_quartile'] = pd.qcut(data['income'], q=4)
print(data['income_quartile'].value_counts())

# Each quartile contains ~250 observations

Specify custom quantiles:

# Deciles (10 equal groups)
data['income_decile'] = pd.qcut(data['income'], 
                                q=10, 
                                labels=False)

# Custom percentiles
percentiles = [0, 0.25, 0.5, 0.75, 0.9, 1.0]
data['income_custom'] = pd.qcut(data['income'], 
                                q=percentiles,
                                labels=['Bottom 25%', 'Q2', 'Q3', 'Top 10%', 'Top 1%'])

Handle duplicate bin edges with duplicates='drop':

# When data has many identical values
uniform_data = pd.Series([1, 1, 1, 1, 2, 3, 4, 5])

# This would raise ValueError without duplicates parameter
binned = pd.qcut(uniform_data, 
                 q=4, 
                 duplicates='drop')

print(binned.value_counts())

Practical Applications

Feature Engineering for Machine Learning

# Bin continuous features for tree-based models
from sklearn.ensemble import RandomForestClassifier

# Create target variable (high income)
data['high_income'] = (data['income'] > data['income'].median()).astype(int)

# Bin features
data['age_binned'] = pd.qcut(data['age'], q=5, labels=False)
data['score_binned'] = pd.cut(data['score'], bins=10, labels=False)

# Use binned features
X = data[['age_binned', 'score_binned']]
y = data['high_income']

# Train model
rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

Risk Segmentation

# Credit risk scoring
credit_data = pd.DataFrame({
    'credit_score': np.random.randint(300, 850, 500),
    'debt_ratio': np.random.uniform(0, 2, 500)
})

# Define risk categories
score_bins = [0, 580, 670, 740, 850]
score_labels = ['Poor', 'Fair', 'Good', 'Excellent']

credit_data['credit_category'] = pd.cut(credit_data['credit_score'],
                                         bins=score_bins,
                                         labels=score_labels)

# Debt-to-income ratio quantiles
credit_data['debt_risk'] = pd.qcut(credit_data['debt_ratio'],
                                    q=3,
                                    labels=['Low', 'Medium', 'High'])

# Cross-tabulation
risk_matrix = pd.crosstab(credit_data['credit_category'],
                          credit_data['debt_risk'])
print(risk_matrix)

Data Visualization Preparation

import matplotlib.pyplot as plt

# Bin data for histogram-like categorization
data['income_range'] = pd.cut(data['income'],
                               bins=10,
                               precision=0)

# Group and aggregate
income_stats = data.groupby('income_range').agg({
    'age': 'mean',
    'score': 'mean'
}).reset_index()

print(income_stats)

Advanced Techniques

Include Lowest Value

By default, the lowest value might fall outside bins. Use include_lowest=True:

values = pd.Series([1, 2, 3, 4, 5])

# Without include_lowest, value 1 might be excluded
bins_default = pd.cut(values, bins=3)

# Include the minimum value
bins_inclusive = pd.cut(values, bins=3, include_lowest=True)

print(f"Default: {bins_default.isna().sum()} NaN values")
print(f"Inclusive: {bins_inclusive.isna().sum()} NaN values")

Retrieve Bin Information

Access bin edges and understand the binning structure:

result, bins = pd.cut(data['age'], bins=5, retbins=True)

print("Bin edges:", bins)
print("Bin width:", bins[1] - bins[0])

# For qcut
result_q, bins_q = pd.qcut(data['income'], q=4, retbins=True)
print("Quantile edges:", bins_q)

Handling Missing Values

Both functions propagate NaN values by default:

data_with_nan = pd.Series([1, 2, np.nan, 4, 5, np.nan, 7, 8])

binned = pd.cut(data_with_nan, bins=3)
print(binned)
# NaN values remain as NaN in output

# Count bins excluding NaN
print(binned.value_counts(dropna=False))

Performance Considerations

For large datasets, binning is computationally efficient:

# Large dataset
large_data = pd.Series(np.random.randn(10_000_000))

# Efficient binning
%timeit pd.cut(large_data, bins=100)
# Typically <100ms for 10M records

%timeit pd.qcut(large_data, q=100)
# Slightly slower due to quantile calculation

Common Pitfalls

Empty bins with cut(): Equal-width bins may result in empty categories if data is unevenly distributed.

skewed = pd.Series(np.random.exponential(1, 1000))

# Many upper bins will be empty
bins_cut = pd.cut(skewed, bins=10)
print(bins_cut.value_counts().sort_index())

Duplicate edges with qcut(): Identical values can create duplicate bin edges.

# Many repeated values
repeated = pd.Series([1]*100 + [2]*100 + [3]*50)

# Handle duplicates explicitly
try:
    pd.qcut(repeated, q=5)
except ValueError as e:
    print(f"Error: {e}")
    
# Solution
binned = pd.qcut(repeated, q=5, duplicates='drop')

Choose cut() when you need consistent interval widths for interpretability or when bin boundaries have domain-specific meaning. Use qcut() when you need balanced sample sizes across bins or when working with skewed distributions where equal representation matters more than equal ranges.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.