How to Create a Histogram in Matplotlib

Histograms are fundamental tools for understanding data distribution. Unlike bar charts that show categorical data, histograms group continuous numerical data into bins and display the frequency of...

Key Insights

  • Histograms reveal data distribution patterns that summary statistics miss—use them to spot skewness, outliers, and multimodal distributions before applying statistical models
  • Bin selection dramatically affects interpretation: too few bins hide important patterns, too many create noise, and Matplotlib’s automatic binning often needs manual adjustment
  • Overlaying multiple histograms with transparency (alpha=0.6) enables direct distribution comparisons, but normalize to density when sample sizes differ significantly

Introduction to Histograms

Histograms are fundamental tools for understanding data distribution. Unlike bar charts that show categorical data, histograms group continuous numerical data into bins and display the frequency of observations within each range. They answer critical questions: Is your data normally distributed? Are there outliers? Do you have multiple peaks suggesting distinct subpopulations?

Matplotlib remains the workhorse of Python data visualization despite newer alternatives like Plotly and Altair. Its histogram implementation is battle-tested, highly customizable, and integrates seamlessly with NumPy and pandas workflows. For exploratory data analysis and publication-quality figures, plt.hist() delivers consistent results with minimal code.

Basic Histogram Creation

The plt.hist() function requires just one argument: your data array. Matplotlib automatically determines bin count using the Sturges formula, though you’ll often override this default.

import matplotlib.pyplot as plt
import numpy as np

# Generate sample data: 1000 random values from normal distribution
data = np.random.normal(loc=50, scale=10, size=1000)

# Create basic histogram
plt.hist(data)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Basic Histogram')
plt.show()

This creates a functional histogram, but the default styling and automatic bin selection rarely suit your needs. The function returns three values—counts, bin edges, and patch objects—which you can capture for further analysis:

counts, bins, patches = plt.hist(data)
print(f"Bin edges: {bins}")
print(f"Counts per bin: {counts}")

Understanding these return values helps when you need precise control over your visualization or want to extract statistical information programmatically.

Customizing Histogram Appearance

Professional histograms require customization. The bins parameter is your most important tool—it accepts an integer for bin count or an array defining exact bin edges.

# Customized histogram with explicit styling
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, color='steelblue', edgecolor='black', 
         alpha=0.7, linewidth=1.2)
plt.xlabel('Value', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Customized Histogram with 30 Bins', fontsize=14)
plt.grid(axis='y', alpha=0.3)
plt.show()

The edgecolor parameter is critical—without it, bins blur together. Setting alpha below 1.0 adds transparency, essential when overlaying multiple distributions.

Bin strategy affects interpretation significantly. Compare these approaches:

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Fixed bin count
axes[0].hist(data, bins=10, color='coral', edgecolor='black')
axes[0].set_title('10 Bins (Too Few)')

axes[1].hist(data, bins=50, color='lightgreen', edgecolor='black')
axes[1].set_title('50 Bins (Too Many)')

# Custom bin edges for specific ranges
bin_edges = np.arange(20, 81, 5)  # Bins from 20 to 80, width of 5
axes[2].hist(data, bins=bin_edges, color='steelblue', edgecolor='black')
axes[2].set_title('Custom Bin Edges (Width=5)')

plt.tight_layout()
plt.show()

Too few bins oversimplify the distribution, hiding important features. Too many bins create a jagged appearance dominated by random noise. Custom bin edges let you align bins with meaningful boundaries—age groups, price ranges, or measurement precision limits.

Multiple Histograms and Overlays

Comparing distributions requires plotting multiple datasets together. Transparency is non-negotiable here—without it, one histogram obscures the other.

# Generate two datasets with different characteristics
data1 = np.random.normal(loc=50, scale=10, size=1000)
data2 = np.random.normal(loc=55, scale=8, size=1000)

# Overlay histograms with transparency
plt.figure(figsize=(10, 6))
plt.hist(data1, bins=30, color='blue', alpha=0.6, label='Group A', edgecolor='black')
plt.hist(data2, bins=30, color='red', alpha=0.6, label='Group B', edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Overlapping Histograms')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.show()

For side-by-side comparisons without overlap, use subplots:

fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True)

axes[0].hist(data1, bins=30, color='blue', edgecolor='black')
axes[0].set_title('Group A')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')

axes[1].hist(data2, bins=30, color='red', edgecolor='black')
axes[1].set_title('Group B')
axes[1].set_xlabel('Value')

plt.tight_layout()
plt.show()

The sharey=True parameter ensures both plots use the same y-axis scale, making visual comparison accurate. Without it, Matplotlib auto-scales each subplot independently, potentially misleading viewers.

Advanced Histogram Techniques

Cumulative histograms show the running total of observations up to each bin, useful for understanding percentiles and thresholds:

plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, cumulative=True, color='purple', 
         edgecolor='black', alpha=0.7)
plt.xlabel('Value')
plt.ylabel('Cumulative Frequency')
plt.title('Cumulative Histogram')
plt.grid(axis='y', alpha=0.3)
plt.show()

Normalized histograms (density plots) display probability density instead of raw counts. This is essential when comparing datasets with different sample sizes:

# Compare datasets with different sizes
large_sample = np.random.normal(loc=50, scale=10, size=5000)
small_sample = np.random.normal(loc=50, scale=10, size=500)

plt.figure(figsize=(10, 6))
plt.hist(large_sample, bins=30, density=True, alpha=0.6, 
         color='blue', label='n=5000', edgecolor='black')
plt.hist(small_sample, bins=30, density=True, alpha=0.6, 
         color='red', label='n=500', edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.title('Normalized Histograms')
plt.legend()
plt.show()

Setting density=True normalizes the histogram so the area under the curve equals 1, making the y-axis represent probability density rather than count. Without normalization, the larger sample would completely dominate the visualization.

For horizontal histograms, use orientation='horizontal':

plt.figure(figsize=(8, 10))
plt.hist(data, bins=30, orientation='horizontal', 
         color='teal', edgecolor='black')
plt.ylabel('Value')
plt.xlabel('Frequency')
plt.title('Horizontal Histogram')
plt.show()

Real-World Example

Let’s analyze a realistic dataset using pandas integration. This example demonstrates a complete workflow from data loading through styled visualization:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Create sample sales data
np.random.seed(42)
sales_data = pd.DataFrame({
    'transaction_amount': np.concatenate([
        np.random.gamma(shape=2, scale=50, size=800),  # Regular purchases
        np.random.gamma(shape=5, scale=150, size=200)  # Large purchases
    ])
})

# Create publication-ready histogram
plt.figure(figsize=(12, 7))
n, bins, patches = plt.hist(sales_data['transaction_amount'], 
                             bins=40, 
                             color='#2E86AB', 
                             edgecolor='white',
                             linewidth=0.8,
                             alpha=0.85)

# Add statistical annotations
mean_val = sales_data['transaction_amount'].mean()
median_val = sales_data['transaction_amount'].median()

plt.axvline(mean_val, color='red', linestyle='--', linewidth=2, 
            label=f'Mean: ${mean_val:.2f}')
plt.axvline(median_val, color='orange', linestyle='--', linewidth=2,
            label=f'Median: ${median_val:.2f}')

plt.xlabel('Transaction Amount ($)', fontsize=13, fontweight='bold')
plt.ylabel('Number of Transactions', fontsize=13, fontweight='bold')
plt.title('Sales Transaction Distribution - Q4 2024', 
          fontsize=15, fontweight='bold', pad=20)
plt.legend(fontsize=11, loc='upper right')
plt.grid(axis='y', alpha=0.3, linestyle=':', linewidth=0.8)

# Format x-axis as currency
ax = plt.gca()
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x:.0f}'))

plt.tight_layout()
plt.show()

# Print summary statistics
print(f"Total transactions: {len(sales_data)}")
print(f"Mean: ${mean_val:.2f}")
print(f"Median: ${median_val:.2f}")
print(f"Std Dev: ${sales_data['transaction_amount'].std():.2f}")

This example demonstrates several professional techniques: using specific color codes for brand consistency, adding reference lines for key statistics, formatting axes with domain-specific labels, and extracting numerical summaries alongside visualization.

Best Practices and Common Pitfalls

Bin selection matters more than you think. Start with Sturges’ rule (default), but always inspect the result. For skewed data, try the Freedman-Diaconis rule by passing bins='fd' to plt.hist(). For large datasets (n > 10,000), increase bin count to capture detail.

Avoid misleading scales. Always start your y-axis at zero for frequency histograms. Truncated axes exaggerate differences. For density histograms, the y-axis scale is less intuitive—add annotations explaining what density means.

Consider alternatives for specific scenarios. Kernel density estimation (KDE) plots smooth out histogram noise for presentation. Box plots better highlight outliers and quartiles. Violin plots combine both approaches. Use plt.hist() for exploration and when showing actual data distribution matters.

Label everything explicitly. Axis labels should include units. Titles should specify what data you’re showing and any filters applied. Legends are mandatory when comparing multiple distributions.

Test different bin counts. Create 3-4 versions with varying bin counts during exploration. The “right” number reveals structure without overfitting to noise. When in doubt, show multiple versions to stakeholders.

Histograms are deceptively simple but require thoughtful configuration to communicate effectively. Master these techniques, and you’ll catch data issues that summary statistics miss entirely.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.