NumPy - np.digitize() - Bin Indices | Application Architect

Key Insights

np.digitize() maps array values to bin indices, essential for histograms, data categorization, and bucketing operations without creating full histograms
The right parameter controls bin edge inclusion: right=False (default) uses left-inclusive intervals [bin[i], bin[i+1]), while right=True uses right-inclusive intervals
Returns indices where values fall between bin edges, with 0 for values below the first bin and len(bins) for values above the last bin

Understanding np.digitize() Fundamentals

np.digitize() assigns each value in an input array to a bin and returns the index of that bin. Unlike np.histogram() which counts occurrences, digitize() returns the bin index for each individual element, making it invaluable for data transformation and categorization tasks.

import numpy as np

values = np.array([0.2, 6.4, 3.0, 1.6, 8.5])
bins = np.array([0.0, 1.0, 2.5, 4.0, 10.0])

indices = np.digitize(values, bins)
print(f"Values: {values}")
print(f"Bins: {bins}")
print(f"Indices: {indices}")
# Output:
# Values: [0.2 6.4 3.  1.6 8.5]
# Bins: [ 0.   1.   2.5  4.  10. ]
# Indices: [1 4 3 2 4]

The value 0.2 falls in bin 1 (between 0.0 and 1.0), 6.4 falls in bin 4 (between 4.0 and 10.0), and so on. The returned indices represent which interval each value belongs to.

Left vs Right Inclusive Bins

The right parameter determines how bin edges are treated. By default (right=False), bins are left-inclusive: [a, b). When right=True, bins become right-inclusive: (a, b].

import numpy as np

values = np.array([1.0, 2.0, 3.0])
bins = np.array([1.0, 2.0, 3.0])

# Default: left-inclusive [1.0, 2.0), [2.0, 3.0), [3.0, ...]
left_inclusive = np.digitize(values, bins, right=False)
print(f"Left-inclusive (right=False): {left_inclusive}")
# Output: [1 2 3]

# Right-inclusive: (..., 1.0], (1.0, 2.0], (2.0, 3.0]
right_inclusive = np.digitize(values, bins, right=True)
print(f"Right-inclusive (right=True): {right_inclusive}")
# Output: [0 1 2]

This distinction matters critically when values exactly match bin edges. With right=False, the value 1.0 belongs to bin 1. With right=True, it belongs to bin 0 (below the first bin).

Practical Application: Age Group Categorization

A common use case involves categorizing continuous data into discrete groups. Here’s how to classify ages into demographic brackets:

import numpy as np

ages = np.array([5, 12, 18, 25, 35, 45, 55, 65, 75, 85])
# Define age brackets: 0-18, 18-35, 35-50, 50-65, 65+
age_bins = np.array([0, 18, 35, 50, 65, 100])
age_labels = ['Child', 'Young Adult', 'Adult', 'Middle Age', 'Senior']

bin_indices = np.digitize(ages, age_bins)

# Map indices to labels (subtract 1 because digitize is 1-indexed)
age_categories = [age_labels[idx - 1] if 0 < idx <= len(age_labels) else 'Unknown' 
                  for idx in bin_indices]

for age, category in zip(ages, age_categories):
    print(f"Age {age}: {category}")

# Output:
# Age 5: Child
# Age 12: Child
# Age 18: Young Adult
# Age 25: Young Adult
# Age 35: Adult
# Age 45: Adult
# Age 55: Middle Age
# Age 65: Senior
# Age 75: Senior
# Age 85: Senior

Handling Edge Cases and Outliers

Values outside the bin range receive special indices: 0 for values below the minimum bin, and len(bins) for values above the maximum bin.

import numpy as np

values = np.array([-5, 0, 5, 10, 15, 20, 25])
bins = np.array([0, 10, 20])

indices = np.digitize(values, bins)
print(f"Values: {values}")
print(f"Bins: {bins}")
print(f"Indices: {indices}")
# Output:
# Values: [-5  0  5 10 15 20 25]
# Bins: [ 0 10 20]
# Indices: [0 1 1 2 2 3 3]

# Identify outliers
below_range = values[indices == 0]
above_range = values[indices == len(bins)]
in_range = values[(indices > 0) & (indices < len(bins))]

print(f"Below range: {below_range}")  # [-5]
print(f"Above range: {above_range}")  # [20 25]
print(f"In range: {in_range}")        # [ 0  5 10 15]

Performance-Critical Binning for Large Datasets

When working with large datasets, np.digitize() provides efficient vectorized binning. Here’s a performance comparison with a naive loop approach:

import numpy as np
import time

# Generate large dataset
np.random.seed(42)
large_data = np.random.uniform(0, 100, 1_000_000)
bins = np.linspace(0, 100, 101)

# Method 1: np.digitize (vectorized)
start = time.time()
indices_vectorized = np.digitize(large_data, bins)
vectorized_time = time.time() - start

# Method 2: Python loop (naive)
start = time.time()
indices_loop = []
for value in large_data:
    for i in range(len(bins) - 1):
        if bins[i] <= value < bins[i + 1]:
            indices_loop.append(i + 1)
            break
    else:
        indices_loop.append(len(bins))
loop_time = time.time() - start

print(f"Vectorized time: {vectorized_time:.4f}s")
print(f"Loop time: {loop_time:.4f}s")
print(f"Speedup: {loop_time / vectorized_time:.2f}x")
# Typical output:
# Vectorized time: 0.0156s
# Loop time: 45.2341s
# Speedup: 2900x

Combining with Group Operations

np.digitize() pairs naturally with grouping operations to compute statistics per bin:

import numpy as np

# Sales data: [amount, ...]
sales = np.array([15, 45, 120, 89, 250, 340, 78, 156, 420, 95])
# Bins: Small (<50), Medium (50-150), Large (150-300), XLarge (300+)
bins = np.array([0, 50, 150, 300, 500])
bin_labels = ['Small', 'Medium', 'Large', 'XLarge']

bin_indices = np.digitize(sales, bins)

# Calculate statistics per bin
for i in range(1, len(bins)):
    mask = bin_indices == i
    bin_sales = sales[mask]
    
    if len(bin_sales) > 0:
        print(f"\n{bin_labels[i-1]} Sales:")
        print(f"  Count: {len(bin_sales)}")
        print(f"  Total: ${bin_sales.sum()}")
        print(f"  Average: ${bin_sales.mean():.2f}")
        print(f"  Values: {bin_sales}")

# Output:
# Small Sales:
#   Count: 2
#   Total: $60
#   Average: $30.00
#   Values: [15 45]
#
# Medium Sales:
#   Count: 4
#   Total: $382
#   Average: $95.50
#   Values: [120  89  78  95]
# ...

Creating Quantile-Based Bins

Combine np.digitize() with np.percentile() to create bins based on data distribution:

import numpy as np

np.random.seed(42)
data = np.random.exponential(scale=50, size=1000)

# Create quartile bins
quartiles = np.percentile(data, [0, 25, 50, 75, 100])
print(f"Quartile boundaries: {quartiles}")

bin_indices = np.digitize(data, quartiles)

# Count elements in each quartile
for i in range(1, len(quartiles)):
    count = np.sum(bin_indices == i)
    print(f"Q{i}: {count} elements ({count/len(data)*100:.1f}%)")

# Output:
# Quartile boundaries: [  0.08   28.34   48.93   80.51  406.51]
# Q1: 250 elements (25.0%)
# Q2: 250 elements (25.0%)
# Q3: 250 elements (25.0%)
# Q4: 250 elements (25.0%)

Multi-Dimensional Binning

For multi-dimensional data, apply np.digitize() to each dimension independently:

import numpy as np

# 2D data: [x, y] coordinates
np.random.seed(42)
x_coords = np.random.uniform(0, 10, 100)
y_coords = np.random.uniform(0, 10, 100)

# Create 3x3 grid
x_bins = np.array([0, 3.33, 6.67, 10])
y_bins = np.array([0, 3.33, 6.67, 10])

x_indices = np.digitize(x_coords, x_bins)
y_indices = np.digitize(y_coords, y_bins)

# Combine into grid cell indices
grid_cells = list(zip(x_indices, y_indices))

# Count points per cell
unique_cells, counts = np.unique(grid_cells, axis=0, return_counts=True)
print("Grid cell populations:")
for cell, count in zip(unique_cells, counts):
    print(f"Cell ({cell[0]}, {cell[1]}): {count} points")

This approach enables spatial indexing, heatmap generation, and grid-based analysis without external dependencies.