NumPy - np.digitize() - Bin Indices
import numpy as np
Key Insights
np.digitize()maps array values to bin indices, essential for histograms, data categorization, and bucketing operations without creating full histograms- The
rightparameter controls bin edge inclusion:right=False(default) uses left-inclusive intervals [bin[i], bin[i+1]), whileright=Trueuses right-inclusive intervals - Returns indices where values fall between bin edges, with 0 for values below the first bin and len(bins) for values above the last bin
Understanding np.digitize() Fundamentals
np.digitize() assigns each value in an input array to a bin and returns the index of that bin. Unlike np.histogram() which counts occurrences, digitize() returns the bin index for each individual element, making it invaluable for data transformation and categorization tasks.
import numpy as np
values = np.array([0.2, 6.4, 3.0, 1.6, 8.5])
bins = np.array([0.0, 1.0, 2.5, 4.0, 10.0])
indices = np.digitize(values, bins)
print(f"Values: {values}")
print(f"Bins: {bins}")
print(f"Indices: {indices}")
# Output:
# Values: [0.2 6.4 3. 1.6 8.5]
# Bins: [ 0. 1. 2.5 4. 10. ]
# Indices: [1 4 3 2 4]
The value 0.2 falls in bin 1 (between 0.0 and 1.0), 6.4 falls in bin 4 (between 4.0 and 10.0), and so on. The returned indices represent which interval each value belongs to.
Left vs Right Inclusive Bins
The right parameter determines how bin edges are treated. By default (right=False), bins are left-inclusive: [a, b). When right=True, bins become right-inclusive: (a, b].
import numpy as np
values = np.array([1.0, 2.0, 3.0])
bins = np.array([1.0, 2.0, 3.0])
# Default: left-inclusive [1.0, 2.0), [2.0, 3.0), [3.0, ...]
left_inclusive = np.digitize(values, bins, right=False)
print(f"Left-inclusive (right=False): {left_inclusive}")
# Output: [1 2 3]
# Right-inclusive: (..., 1.0], (1.0, 2.0], (2.0, 3.0]
right_inclusive = np.digitize(values, bins, right=True)
print(f"Right-inclusive (right=True): {right_inclusive}")
# Output: [0 1 2]
This distinction matters critically when values exactly match bin edges. With right=False, the value 1.0 belongs to bin 1. With right=True, it belongs to bin 0 (below the first bin).
Practical Application: Age Group Categorization
A common use case involves categorizing continuous data into discrete groups. Here’s how to classify ages into demographic brackets:
import numpy as np
ages = np.array([5, 12, 18, 25, 35, 45, 55, 65, 75, 85])
# Define age brackets: 0-18, 18-35, 35-50, 50-65, 65+
age_bins = np.array([0, 18, 35, 50, 65, 100])
age_labels = ['Child', 'Young Adult', 'Adult', 'Middle Age', 'Senior']
bin_indices = np.digitize(ages, age_bins)
# Map indices to labels (subtract 1 because digitize is 1-indexed)
age_categories = [age_labels[idx - 1] if 0 < idx <= len(age_labels) else 'Unknown'
for idx in bin_indices]
for age, category in zip(ages, age_categories):
print(f"Age {age}: {category}")
# Output:
# Age 5: Child
# Age 12: Child
# Age 18: Young Adult
# Age 25: Young Adult
# Age 35: Adult
# Age 45: Adult
# Age 55: Middle Age
# Age 65: Senior
# Age 75: Senior
# Age 85: Senior
Handling Edge Cases and Outliers
Values outside the bin range receive special indices: 0 for values below the minimum bin, and len(bins) for values above the maximum bin.
import numpy as np
values = np.array([-5, 0, 5, 10, 15, 20, 25])
bins = np.array([0, 10, 20])
indices = np.digitize(values, bins)
print(f"Values: {values}")
print(f"Bins: {bins}")
print(f"Indices: {indices}")
# Output:
# Values: [-5 0 5 10 15 20 25]
# Bins: [ 0 10 20]
# Indices: [0 1 1 2 2 3 3]
# Identify outliers
below_range = values[indices == 0]
above_range = values[indices == len(bins)]
in_range = values[(indices > 0) & (indices < len(bins))]
print(f"Below range: {below_range}") # [-5]
print(f"Above range: {above_range}") # [20 25]
print(f"In range: {in_range}") # [ 0 5 10 15]
Performance-Critical Binning for Large Datasets
When working with large datasets, np.digitize() provides efficient vectorized binning. Here’s a performance comparison with a naive loop approach:
import numpy as np
import time
# Generate large dataset
np.random.seed(42)
large_data = np.random.uniform(0, 100, 1_000_000)
bins = np.linspace(0, 100, 101)
# Method 1: np.digitize (vectorized)
start = time.time()
indices_vectorized = np.digitize(large_data, bins)
vectorized_time = time.time() - start
# Method 2: Python loop (naive)
start = time.time()
indices_loop = []
for value in large_data:
for i in range(len(bins) - 1):
if bins[i] <= value < bins[i + 1]:
indices_loop.append(i + 1)
break
else:
indices_loop.append(len(bins))
loop_time = time.time() - start
print(f"Vectorized time: {vectorized_time:.4f}s")
print(f"Loop time: {loop_time:.4f}s")
print(f"Speedup: {loop_time / vectorized_time:.2f}x")
# Typical output:
# Vectorized time: 0.0156s
# Loop time: 45.2341s
# Speedup: 2900x
Combining with Group Operations
np.digitize() pairs naturally with grouping operations to compute statistics per bin:
import numpy as np
# Sales data: [amount, ...]
sales = np.array([15, 45, 120, 89, 250, 340, 78, 156, 420, 95])
# Bins: Small (<50), Medium (50-150), Large (150-300), XLarge (300+)
bins = np.array([0, 50, 150, 300, 500])
bin_labels = ['Small', 'Medium', 'Large', 'XLarge']
bin_indices = np.digitize(sales, bins)
# Calculate statistics per bin
for i in range(1, len(bins)):
mask = bin_indices == i
bin_sales = sales[mask]
if len(bin_sales) > 0:
print(f"\n{bin_labels[i-1]} Sales:")
print(f" Count: {len(bin_sales)}")
print(f" Total: ${bin_sales.sum()}")
print(f" Average: ${bin_sales.mean():.2f}")
print(f" Values: {bin_sales}")
# Output:
# Small Sales:
# Count: 2
# Total: $60
# Average: $30.00
# Values: [15 45]
#
# Medium Sales:
# Count: 4
# Total: $382
# Average: $95.50
# Values: [120 89 78 95]
# ...
Creating Quantile-Based Bins
Combine np.digitize() with np.percentile() to create bins based on data distribution:
import numpy as np
np.random.seed(42)
data = np.random.exponential(scale=50, size=1000)
# Create quartile bins
quartiles = np.percentile(data, [0, 25, 50, 75, 100])
print(f"Quartile boundaries: {quartiles}")
bin_indices = np.digitize(data, quartiles)
# Count elements in each quartile
for i in range(1, len(quartiles)):
count = np.sum(bin_indices == i)
print(f"Q{i}: {count} elements ({count/len(data)*100:.1f}%)")
# Output:
# Quartile boundaries: [ 0.08 28.34 48.93 80.51 406.51]
# Q1: 250 elements (25.0%)
# Q2: 250 elements (25.0%)
# Q3: 250 elements (25.0%)
# Q4: 250 elements (25.0%)
Multi-Dimensional Binning
For multi-dimensional data, apply np.digitize() to each dimension independently:
import numpy as np
# 2D data: [x, y] coordinates
np.random.seed(42)
x_coords = np.random.uniform(0, 10, 100)
y_coords = np.random.uniform(0, 10, 100)
# Create 3x3 grid
x_bins = np.array([0, 3.33, 6.67, 10])
y_bins = np.array([0, 3.33, 6.67, 10])
x_indices = np.digitize(x_coords, x_bins)
y_indices = np.digitize(y_coords, y_bins)
# Combine into grid cell indices
grid_cells = list(zip(x_indices, y_indices))
# Count points per cell
unique_cells, counts = np.unique(grid_cells, axis=0, return_counts=True)
print("Grid cell populations:")
for cell, count in zip(unique_cells, counts):
print(f"Cell ({cell[0]}, {cell[1]}): {count} points")
This approach enables spatial indexing, heatmap generation, and grid-based analysis without external dependencies.