NumPy - np.histogram() - Compute Histogram

Key Insights

np.histogram() computes frequency distributions by binning data into intervals, returning both counts and bin edges as separate arrays
The function offers multiple binning strategies including uniform width, custom edges, and automatic methods like ‘auto’, ‘sturges’, and ‘fd’ for optimal bin selection
Understanding the relationship between bin edges (n+1 elements) and counts (n elements) is critical for correctly interpreting and visualizing histogram data

Understanding np.histogram() Basics

np.histogram() takes an array of values and divides them into bins, counting how many values fall into each bin. Unlike plotting functions, it returns raw numerical data: bin counts and bin edges.

import numpy as np

data = np.array([1.2, 2.3, 2.8, 3.1, 4.5, 4.7, 5.2, 6.8, 7.1, 9.3])
counts, bin_edges = np.histogram(data, bins=5)

print("Counts:", counts)
print("Bin edges:", bin_edges)

Output:

Counts: [3 2 2 1 2]
Bin edges: [1.2  2.82 4.44 6.06 7.68 9.3 ]

The function returns two arrays: counts with 5 elements (one per bin) and bin_edges with 6 elements (boundaries of 5 bins). The first bin includes values from 1.2 to 2.82, the second from 2.82 to 4.44, and so on.

Specifying Bin Count and Edges

You control binning through the bins parameter, which accepts integers, sequences, or string methods.

data = np.random.normal(100, 15, 1000)

# Integer: number of equal-width bins
counts_10, edges_10 = np.histogram(data, bins=10)

# Sequence: explicit bin edges
custom_edges = [50, 75, 90, 100, 110, 125, 150]
counts_custom, edges_custom = np.histogram(data, bins=custom_edges)

# String: automatic binning method
counts_auto, edges_auto = np.histogram(data, bins='auto')

print(f"10 bins: {len(counts_10)} counts, {len(edges_10)} edges")
print(f"Custom bins: {len(counts_custom)} counts, {len(edges_custom)} edges")
print(f"Auto method: {len(counts_auto)} bins")

Custom bin edges let you create non-uniform intervals, useful for analyzing data with known thresholds or categories.

Automatic Binning Strategies

NumPy provides several algorithms for determining optimal bin counts. Each uses different statistical principles.

data = np.random.exponential(scale=2.0, size=500)

methods = ['auto', 'sturges', 'fd', 'doane', 'scott', 'sqrt']
results = {}

for method in methods:
    counts, edges = np.histogram(data, bins=method)
    results[method] = len(counts)

for method, num_bins in results.items():
    print(f"{method:10s}: {num_bins:3d} bins")

Output (approximate):

auto      :  17 bins
sturges   :  10 bins
fd        :  17 bins
doane     :  11 bins
scott     :  14 bins
sqrt      :  22 bins

sturges: Works well for Gaussian data, uses log₂(n) + 1
fd (Freedman-Diaconis): Robust to outliers, based on IQR
auto: Chooses between sturges and fd based on data characteristics
scott: Minimizes integrated mean squared error
sqrt: Simple rule using √n bins

Working with Range and Density

The range parameter limits the histogram to a specific interval, and density converts counts to probability densities.

data = np.random.normal(50, 10, 10000)

# Default range uses min/max of data
counts_default, edges_default = np.histogram(data, bins=20)

# Explicit range
counts_range, edges_range = np.histogram(data, bins=20, range=(30, 70))

# Density normalization
counts_density, edges_density = np.histogram(data, bins=20, range=(30, 70), density=True)

print(f"Default range: [{edges_default[0]:.2f}, {edges_default[-1]:.2f}]")
print(f"Explicit range: [{edges_range[0]:.2f}, {edges_range[-1]:.2f}]")
print(f"Sum of counts (default): {counts_default.sum()}")
print(f"Sum of counts (range): {counts_range.sum()}")
print(f"Integral of density: {np.sum(counts_density * np.diff(edges_density)):.4f}")

When density=True, the histogram integrates to 1.0, making it a probability density function. The relationship is: counts_density * bin_width = probability.

Handling Multidimensional Data

np.histogramdd() extends histogram computation to multiple dimensions, while np.histogram2d() handles the common 2D case.

# 2D histogram
x = np.random.normal(0, 1, 1000)
y = np.random.normal(0, 1.5, 1000)

counts_2d, x_edges, y_edges = np.histogram2d(x, y, bins=[10, 15])
print(f"2D histogram shape: {counts_2d.shape}")

# Multidimensional histogram
data_3d = np.random.randn(500, 3)
counts_nd, edges_nd = np.histogramdd(data_3d, bins=[8, 8, 8])
print(f"3D histogram shape: {counts_nd.shape}")
print(f"Number of edge arrays: {len(edges_nd)}")

For 2D histograms, counts_2d[i, j] represents the frequency of values where x falls in bin i and y falls in bin j. This is particularly useful for analyzing correlations between variables.

Weighted Histograms

The weights parameter assigns importance to individual data points, enabling weighted frequency distributions.

values = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
weights = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])

# Unweighted histogram
counts_unweighted, edges = np.histogram(values, bins=5)

# Weighted histogram
counts_weighted, _ = np.histogram(values, bins=5, weights=weights)

print("Unweighted counts:", counts_unweighted)
print("Weighted counts:", counts_weighted)
print("Total weight:", counts_weighted.sum())

Output:

Unweighted counts: [2 2 2 2 2]
Weighted counts: [0.3 0.7 1.1 1.5 1.9]
Total weight: 5.5

Weighted histograms are essential for Monte Carlo simulations, survey data with sampling weights, or when aggregating pre-binned data.

Practical Application: Distribution Analysis

Combine histogram computation with statistical analysis to characterize data distributions.

def analyze_distribution(data, bins='auto'):
    counts, edges = np.histogram(data, bins=bins)
    bin_centers = (edges[:-1] + edges[1:]) / 2
    bin_widths = np.diff(edges)
    
    # Find mode (most frequent bin)
    mode_idx = np.argmax(counts)
    mode_range = (edges[mode_idx], edges[mode_idx + 1])
    
    # Calculate weighted mean from histogram
    hist_mean = np.sum(counts * bin_centers) / np.sum(counts)
    
    # Find bins containing 95% of data
    cumulative = np.cumsum(counts)
    total = cumulative[-1]
    lower_idx = np.searchsorted(cumulative, 0.025 * total)
    upper_idx = np.searchsorted(cumulative, 0.975 * total)
    
    return {
        'bins': len(counts),
        'mode_range': mode_range,
        'hist_mean': hist_mean,
        'range_95': (edges[lower_idx], edges[upper_idx + 1])
    }

# Test with bimodal distribution
data = np.concatenate([
    np.random.normal(20, 3, 500),
    np.random.normal(40, 4, 500)
])

stats = analyze_distribution(data, bins=30)
print(f"Number of bins: {stats['bins']}")
print(f"Mode in range: [{stats['mode_range'][0]:.2f}, {stats['mode_range'][1]:.2f}]")
print(f"Histogram mean: {stats['hist_mean']:.2f}")
print(f"95% range: [{stats['range_95'][0]:.2f}, {stats['range_95'][1]:.2f}]")

This approach reconstructs distribution characteristics from binned data, useful when working with large datasets where storing raw values is impractical.

Performance Considerations

For large datasets, histogram computation is memory-efficient since it reduces data to a fixed number of bins regardless of input size.

import time

sizes = [10**4, 10**5, 10**6, 10**7]

for size in sizes:
    data = np.random.randn(size)
    
    start = time.time()
    counts, edges = np.histogram(data, bins=100)
    elapsed = time.time() - start
    
    print(f"n={size:8d}: {elapsed:.4f}s, memory={counts.nbytes + edges.nbytes} bytes")

The output array size depends only on the number of bins, not the input size. This makes histograms ideal for data reduction in streaming applications or when preprocessing data for machine learning pipelines where you need fixed-size feature vectors from variable-length sequences.