How to Calculate the Mean in Python
The arithmetic mean—the sum of values divided by their count—is the most commonly used measure of central tendency in statistics. Whether you're analyzing user engagement metrics, processing sensor...
Key Insights
- Pure Python’s
sum()/len()works for simple cases, butstatistics.mean()handles edge cases and provides clearer intent for production code. - NumPy’s
mean()is 10-100x faster than pure Python for large datasets and supports multidimensional operations along specific axes. - Pandas automatically handles missing values when calculating means, making it the right choice for real-world data that’s rarely clean.
The arithmetic mean—the sum of values divided by their count—is the most commonly used measure of central tendency in statistics. Whether you’re analyzing user engagement metrics, processing sensor data, or building machine learning features, you’ll calculate means constantly. Python offers multiple ways to compute this fundamental statistic, each with distinct trade-offs. This guide covers them all, from basic Python to specialized libraries, so you can choose the right tool for your specific situation.
Calculating Mean Manually with Pure Python
Before reaching for libraries, understand the basic operation. The arithmetic mean is simply the sum of all values divided by the number of values.
# Manual calculation with a loop
numbers = [23, 45, 67, 89, 12, 34, 56, 78]
total = 0
for num in numbers:
total += num
mean = total / len(numbers)
print(f"Mean: {mean}") # Mean: 50.5
This works, but it’s verbose. Python’s built-in functions make it cleaner:
# Using built-in sum() and len()
numbers = [23, 45, 67, 89, 12, 34, 56, 78]
mean = sum(numbers) / len(numbers)
print(f"Mean: {mean}") # Mean: 50.5
This one-liner is readable and performs well for small to medium datasets. However, it has a critical flaw: it fails silently on edge cases.
# This raises ZeroDivisionError
empty_list = []
mean = sum(empty_list) / len(empty_list) # ZeroDivisionError
# You need explicit handling
def safe_mean(numbers):
if not numbers:
return None # or raise a custom exception
return sum(numbers) / len(numbers)
For quick scripts and throwaway code, sum()/len() is fine. For anything that might see production, use a library that handles edge cases properly.
Using the Statistics Module
Python’s standard library includes a statistics module specifically designed for statistical operations. It’s been available since Python 3.4, so there’s no excuse not to use it.
import statistics
numbers = [23, 45, 67, 89, 12, 34, 56, 78]
mean = statistics.mean(numbers)
print(f"Mean: {mean}") # Mean: 50.5
# Handles empty sequences properly
try:
statistics.mean([])
except statistics.StatisticsError as e:
print(f"Error: {e}") # Error: mean requires at least one data point
The statistics.mean() function raises a descriptive StatisticsError instead of a cryptic ZeroDivisionError. It also works with any iterable, not just lists.
Python 3.8 introduced statistics.fmean() for faster floating-point arithmetic:
import statistics
numbers = [23, 45, 67, 89, 12, 34, 56, 78]
# fmean() converts all values to float and uses faster algorithm
fast_mean = statistics.fmean(numbers)
print(f"Fast mean: {fast_mean}") # Fast mean: 50.5
# fmean() is significantly faster for large datasets
large_dataset = list(range(1, 100001))
import timeit
print(timeit.timeit(lambda: statistics.mean(large_dataset), number=10)) # ~0.8s
print(timeit.timeit(lambda: statistics.fmean(large_dataset), number=10)) # ~0.08s
Use statistics.mean() when you need exact decimal arithmetic (financial calculations) and statistics.fmean() when speed matters and floating-point precision is acceptable.
Calculating Mean with NumPy
When working with large datasets or numerical computing, NumPy is the standard choice. Its mean() function operates on arrays and is optimized for performance.
import numpy as np
# Basic mean calculation
numbers = np.array([23, 45, 67, 89, 12, 34, 56, 78])
mean = np.mean(numbers)
print(f"Mean: {mean}") # Mean: 50.5
# Also available as an array method
mean = numbers.mean()
print(f"Mean: {mean}") # Mean: 50.5
NumPy’s real power shows with multidimensional arrays. The axis parameter lets you compute means along specific dimensions:
import numpy as np
# 2D array: 3 rows, 4 columns
data = np.array([
[10, 20, 30, 40],
[15, 25, 35, 45],
[12, 22, 32, 42]
])
# Mean of all elements
overall_mean = np.mean(data)
print(f"Overall mean: {overall_mean}") # Overall mean: 27.333...
# Mean along axis 0 (column means)
column_means = np.mean(data, axis=0)
print(f"Column means: {column_means}") # Column means: [12.33, 22.33, 32.33, 42.33]
# Mean along axis 1 (row means)
row_means = np.mean(data, axis=1)
print(f"Row means: {row_means}") # Row means: [25. 30. 27.5]
For performance-critical applications, NumPy’s vectorized operations are dramatically faster than pure Python:
import numpy as np
import timeit
# Compare performance on 1 million elements
python_list = list(range(1000000))
numpy_array = np.array(python_list)
# Pure Python: ~50ms
print(timeit.timeit(lambda: sum(python_list) / len(python_list), number=10))
# NumPy: ~2ms (25x faster)
print(timeit.timeit(lambda: np.mean(numpy_array), number=10))
Mean Calculation with Pandas
Real-world data lives in tables, not clean arrays. Pandas is built for this reality, with robust handling of missing values and intuitive syntax for column-wise operations.
import pandas as pd
# Mean of a Series
temperatures = pd.Series([72.5, 68.3, 75.1, 69.8, 71.2])
mean_temp = temperatures.mean()
print(f"Mean temperature: {mean_temp}") # Mean temperature: 71.38
# Mean of DataFrame columns
df = pd.DataFrame({
'temperature': [72.5, 68.3, 75.1, 69.8, 71.2],
'humidity': [45, 52, 38, 61, 55],
'wind_speed': [12.3, 8.7, 15.2, 6.4, 10.1]
})
# Mean of each numeric column
print(df.mean())
# temperature 71.38
# humidity 50.20
# wind_speed 10.54
# Mean of a specific column
print(f"Mean humidity: {df['humidity'].mean()}") # Mean humidity: 50.2
The killer feature is automatic handling of missing values:
import pandas as pd
import numpy as np
# Data with missing values (common in real datasets)
df = pd.DataFrame({
'sales': [100, 150, np.nan, 200, 175, np.nan, 225],
'returns': [5, np.nan, 8, 12, np.nan, 7, 10]
})
# By default, mean() skips NaN values
print(f"Mean sales: {df['sales'].mean()}") # Mean sales: 170.0 (calculated from 5 values)
# Explicit control with skipna parameter
print(f"Skip NaN: {df['sales'].mean(skipna=True)}") # 170.0
print(f"Include NaN: {df['sales'].mean(skipna=False)}") # nan
# Mean across all columns, handling NaN appropriately
print(df.mean())
# sales 170.0
# returns 8.4
When skipna=True (the default), Pandas excludes missing values from both the sum and the count. This is almost always what you want with real data.
Weighted and Other Mean Variants
The arithmetic mean treats all values equally, but sometimes values have different importance. A weighted mean accounts for this.
import numpy as np
# Student grades with different credit weights
grades = np.array([85, 90, 78, 92]) # A, B, C, D courses
credits = np.array([3, 4, 2, 3]) # Credit hours per course
# Weighted mean (GPA calculation)
weighted_mean = np.average(grades, weights=credits)
print(f"Weighted GPA: {weighted_mean}") # Weighted GPA: 87.08
# Compare to simple mean
simple_mean = np.mean(grades)
print(f"Simple mean: {simple_mean}") # Simple mean: 86.25
The statistics module provides specialized mean functions for specific use cases:
import statistics
# Geometric mean: useful for growth rates and ratios
# (nth root of the product of n numbers)
growth_rates = [1.05, 1.08, 1.03, 1.10, 1.02] # 5%, 8%, 3%, 10%, 2% growth
geometric_mean = statistics.geometric_mean(growth_rates)
print(f"Average growth factor: {geometric_mean}") # ~1.0555 (5.55% average growth)
# Harmonic mean: useful for rates and ratios
# (reciprocal of the arithmetic mean of reciprocals)
speeds = [60, 40, 50] # mph for three equal-distance segments
harmonic_mean = statistics.harmonic_mean(speeds)
print(f"Average speed: {harmonic_mean}") # ~48.65 mph (not 50!)
# Why harmonic mean for speeds?
# If you drive 60 mph for 60 miles, then 40 mph for 60 miles:
# Time = 60/60 + 60/40 = 1 + 1.5 = 2.5 hours
# Average speed = 120 miles / 2.5 hours = 48 mph
The geometric mean is essential for financial calculations involving compound growth. The harmonic mean is correct for averaging rates when the denominator varies (like speed over equal distances).
Conclusion
Choose your mean calculation method based on your context:
Pure Python (sum()/len()): Quick scripts, learning exercises, or when you can’t install packages. Always add empty-sequence handling for production code.
statistics.mean() / statistics.fmean(): Standard library solution with proper error handling. Use mean() for exact arithmetic, fmean() for speed. Best for general-purpose Python applications.
NumPy (np.mean()): Large datasets, numerical computing, or when you need axis-based operations on multidimensional data. The performance difference is substantial—use NumPy when processing thousands of values or more.
Pandas (.mean()): Tabular data with potential missing values. The automatic NaN handling alone justifies using Pandas for any real-world data analysis.
Weighted/specialized means: Use np.average() with weights for weighted calculations, statistics.geometric_mean() for growth rates, and statistics.harmonic_mean() for averaging rates.
The mean is deceptively simple. Choosing the right implementation for your use case—considering performance, error handling, and data quality—separates production-ready code from fragile scripts.