How to Use pd.qcut in Pandas

Binning continuous data into discrete categories is a fundamental data preparation task. Pandas offers two primary functions for this: `pd.cut` and `pd.qcut`. Understanding when to use each will save...

Key Insights

  • pd.qcut divides data into bins with approximately equal numbers of observations, making it ideal for percentile-based analysis and avoiding empty or sparse bins that plague equal-width binning.
  • The duplicates='drop' parameter is essential when working with real-world data that contains repeated values at quantile boundaries—without it, your code will crash on skewed distributions.
  • Use pd.qcut for customer segmentation, feature engineering in machine learning, and any analysis where relative ranking matters more than absolute values.

Introduction to Quantile-Based Binning

Binning continuous data into discrete categories is a fundamental data preparation task. Pandas offers two primary functions for this: pd.cut and pd.qcut. Understanding when to use each will save you from common analytical mistakes.

pd.qcut performs quantile-based discretization. It divides your data into bins that each contain roughly the same number of observations. If you have 1000 customers and request 4 bins, each bin will contain approximately 250 customers.

pd.cut, by contrast, creates equal-width bins. It divides the range of your data into intervals of the same size, regardless of how many observations fall into each bin. This often results in uneven distributions—some bins overflowing while others sit nearly empty.

The choice matters. When analyzing income distributions, pd.cut might put 90% of your customers in the lowest bracket while the top bracket contains three billionaires. pd.qcut ensures each bracket represents an equal portion of your customer base.

Basic Syntax and Parameters

Here’s the complete function signature:

pd.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')

Let’s break down each parameter:

  • x: The input array or Series to bin
  • q: Number of quantiles (integer) or list of quantile boundaries (floats between 0 and 1)
  • labels: Custom names for the resulting bins; set to False to return integer indicators
  • retbins: If True, also returns the bin edges as an array
  • precision: Number of decimal places for bin labels
  • duplicates: How to handle duplicate bin edges—'raise' throws an error, 'drop' silently removes duplicates

The simplest use case creates quartiles:

import pandas as pd
import numpy as np

# Sample data: exam scores
scores = pd.Series([72, 85, 91, 68, 77, 82, 95, 61, 88, 79, 84, 90, 73, 86, 69])

# Create quartiles (4 equal-frequency bins)
quartiles = pd.qcut(scores, q=4)
print(quartiles)

Output:

0     (68.0, 75.0]
1     (82.5, 91.0]
2     (91.0, 95.0]
3     (60.999, 68.0]
4     (75.0, 82.5]
...
dtype: category
Categories (4, interval[float64, right]): [(60.999, 68.0] < (68.0, 75.0] < (75.0, 82.5] < (82.5, 91.0]]

Each category contains approximately the same number of students, giving you a true quartile ranking.

Creating Custom Quantile Bins

Sometimes you need more control than simply specifying the number of bins. Pass a list of quantile boundaries to define exactly where the splits occur:

# Customer purchase amounts
purchases = pd.Series([15, 45, 120, 89, 200, 55, 78, 340, 92, 67, 
                       150, 88, 42, 175, 95, 110, 65, 280, 73, 99])

# Create deciles (10 equal-frequency bins)
deciles = pd.qcut(purchases, q=10, labels=False)
print("Decile assignments:")
print(deciles.value_counts().sort_index())

# Create tertiles (thirds) with explicit boundaries
tertiles = pd.qcut(purchases, q=[0, 0.33, 0.66, 1.0])
print("\nTertile distribution:")
print(tertiles.value_counts())

This approach is particularly useful when your analysis requires specific percentile cutoffs. Marketing teams often want to identify the top 10% of customers or the bottom 20%:

# Identify top performers vs rest
performance_tiers = pd.qcut(
    purchases, 
    q=[0, 0.8, 0.9, 1.0],
    labels=['Standard', 'High Value', 'VIP']
)
print(performance_tiers.value_counts())

Adding Custom Labels

Interval notation like (60.999, 68.0] is precise but not user-friendly. The labels parameter lets you assign meaningful names:

# Student grades with letter labels
grades = pd.Series([72, 85, 91, 68, 77, 82, 95, 61, 88, 79, 84, 90, 73, 86, 69, 
                    58, 93, 76, 81, 87])

letter_grades = pd.qcut(
    grades, 
    q=5, 
    labels=['F', 'D', 'C', 'B', 'A']
)

print(letter_grades.value_counts().sort_index())

Output:

F    4
D    4
C    4
B    4
A    4
dtype: int64

For business applications, descriptive labels communicate findings more effectively:

# Customer lifetime value segmentation
clv = pd.Series([150, 2500, 890, 450, 3200, 175, 1100, 780, 2100, 320,
                 4500, 650, 1800, 290, 980, 1450, 3800, 520, 2800, 710])

segments = pd.qcut(
    clv,
    q=4,
    labels=['Bronze', 'Silver', 'Gold', 'Platinum']
)

# Create a summary DataFrame
summary = pd.DataFrame({
    'CLV': clv,
    'Segment': segments
})

print(summary.groupby('Segment')['CLV'].agg(['min', 'max', 'mean', 'count']))

Set labels=False when you need integer bin indicators for machine learning:

# Integer encoding for ML features
encoded_bins = pd.qcut(clv, q=4, labels=False)
print(encoded_bins)  # Returns: 0, 3, 2, 1, 3, 0, 2, ...

Handling Duplicate Bin Edges

Real-world data is messy. When many observations share the same value, quantile boundaries can overlap, causing pd.qcut to fail:

# Skewed data with many repeated values
skewed_data = pd.Series([1, 1, 1, 1, 1, 2, 2, 3, 5, 10])

# This will raise an error
try:
    bins = pd.qcut(skewed_data, q=4)
except ValueError as e:
    print(f"Error: {e}")

Output:

Error: Bin edges must be unique: array([ 1. ,  1. ,  1.5,  2.5, 10. ])

The 25th percentile and minimum are both 1, creating duplicate bin edges. The duplicates parameter solves this:

# Solution: drop duplicate edges
bins = pd.qcut(skewed_data, q=4, duplicates='drop')
print(bins.value_counts())

This creates fewer bins than requested, but each bin has distinct boundaries. Always use duplicates='drop' when working with data that might contain ties at quantile boundaries—survey responses, ratings, and categorical-like numeric data are common culprits.

# Robust binning function for production code
def safe_qcut(series, q, labels=None):
    """Safely bin data, handling duplicates automatically."""
    return pd.qcut(
        series, 
        q=q, 
        labels=labels, 
        duplicates='drop'
    )

# Works even with problematic data
ratings = pd.Series([5, 5, 5, 4, 4, 4, 3, 3, 2, 1])
safe_bins = safe_qcut(ratings, q=5)
print(safe_bins.value_counts())

Practical Use Cases

Customer Segmentation

# E-commerce customer data
customers = pd.DataFrame({
    'customer_id': range(1, 101),
    'total_spend': np.random.exponential(500, 100),
    'order_count': np.random.poisson(5, 100),
    'days_since_last_order': np.random.exponential(30, 100)
})

# Create RFM-style segments
customers['spend_tier'] = pd.qcut(
    customers['total_spend'], 
    q=5, 
    labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']
)

customers['frequency_tier'] = pd.qcut(
    customers['order_count'], 
    q=3, 
    labels=['Rare', 'Occasional', 'Frequent'],
    duplicates='drop'
)

# Recency: lower is better, so reverse the labels
customers['recency_tier'] = pd.qcut(
    customers['days_since_last_order'],
    q=3,
    labels=['Active', 'Cooling', 'At Risk'],
    duplicates='drop'
)

print(customers.head(10))

Feature Engineering for Machine Learning

# Preparing features for a model
df = pd.DataFrame({
    'income': np.random.lognormal(10, 1, 1000),
    'age': np.random.normal(45, 15, 1000).clip(18, 90),
    'credit_score': np.random.normal(700, 50, 1000).clip(300, 850)
})

# Create percentile-based features (often better than raw values for tree models)
for col in ['income', 'age', 'credit_score']:
    df[f'{col}_percentile'] = pd.qcut(df[col], q=10, labels=False, duplicates='drop')

# Get the actual bin edges for documentation
_, income_bins = pd.qcut(df['income'], q=10, retbins=True, duplicates='drop')
print(f"Income bin edges: {income_bins}")

Percentile Rankings

# Sales performance ranking
sales_data = pd.DataFrame({
    'rep_name': [f'Rep_{i}' for i in range(50)],
    'quarterly_sales': np.random.exponential(100000, 50)
})

sales_data['percentile_rank'] = pd.qcut(
    sales_data['quarterly_sales'],
    q=100,
    labels=False,
    duplicates='drop'
) + 1  # Add 1 to make it 1-100 instead of 0-99

sales_data['performance_band'] = pd.qcut(
    sales_data['quarterly_sales'],
    q=[0, 0.25, 0.75, 0.9, 1.0],
    labels=['Needs Improvement', 'Meets Expectations', 'Exceeds Expectations', 'Top Performer']
)

print(sales_data.sort_values('quarterly_sales', ascending=False).head(10))

Common Pitfalls and Best Practices

Choose q wisely. More bins isn’t always better. With 100 observations, requesting 20 bins means only 5 observations per bin—too few for meaningful analysis. Aim for at least 30 observations per bin for statistical reliability.

Handle NaN values explicitly. pd.qcut excludes NaN values from binning and returns NaN for those positions. If you need to track missing data as a separate category, handle it before binning:

data = pd.Series([10, 20, np.nan, 30, 40, np.nan, 50])

# Option 1: Fill NaN before binning
filled_data = data.fillna(data.median())
bins = pd.qcut(filled_data, q=3)

# Option 2: Bin non-null values, then add 'Unknown' category
bins = pd.qcut(data, q=3)
bins = bins.cat.add_categories(['Unknown'])
bins = bins.fillna('Unknown')

Know when to use pd.cut instead. Use pd.cut when the bin boundaries have inherent meaning (age groups like 18-25, 26-35, etc.) or when you need consistent bins across different datasets. Use pd.qcut when relative ranking matters more than absolute values.

Always inspect your bins. After creating bins, verify the distribution makes sense:

bins, edges = pd.qcut(data, q=4, retbins=True)
print(f"Bin edges: {edges}")
print(f"Counts per bin:\n{bins.value_counts().sort_index()}")

pd.qcut is a workhorse function for data analysis. Master it, and you’ll find yourself reaching for it constantly—from quick exploratory analysis to production feature engineering pipelines.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.