How to Calculate the Mean in NumPy
Calculating the mean seems trivial until you're working with millions of data points, multidimensional arrays, or datasets riddled with missing values. Python's built-in `statistics.mean()` works...
Key Insights
- NumPy’s
np.mean()is 10-100x faster than Python’s built-instatistics.mean()for large arrays, and theaxisparameter lets you compute means across specific dimensions without loops. - Always use
np.nanmean()when your data might contain missing values—standardnp.mean()will silently return NaN and corrupt your entire calculation. - For weighted averages (like calculating GPA or portfolio returns), use
np.average()with theweightsparameter instead of manually implementing the calculation.
Introduction to NumPy Mean Calculations
Calculating the mean seems trivial until you’re working with millions of data points, multidimensional arrays, or datasets riddled with missing values. Python’s built-in statistics.mean() works fine for small lists, but it falls apart at scale.
NumPy solves this with vectorized operations that run at C speed. You get two ways to calculate means: the np.mean() function and the ndarray.mean() method. They’re functionally identical—use whichever reads better in your code.
import numpy as np
data = np.array([1, 2, 3, 4, 5])
# Both produce the same result
result1 = np.mean(data) # Function syntax
result2 = data.mean() # Method syntax
print(result1, result2) # 3.0 3.0
The function syntax is more flexible when you’re working with array-like inputs that aren’t already NumPy arrays. The method syntax is cleaner when chaining operations.
Basic Mean Calculation with np.mean()
The basic syntax is straightforward: pass an array, get a mean. NumPy handles the type conversion and returns a scalar.
import numpy as np
# From a Python list
prices = [29.99, 34.50, 22.00, 45.99, 31.25]
avg_price = np.mean(prices)
print(f"Average price: ${avg_price:.2f}") # Average price: $32.75
# From a NumPy array
temperatures = np.array([72.1, 68.5, 75.3, 71.8, 69.2, 73.6])
avg_temp = np.mean(temperatures)
print(f"Average temperature: {avg_temp:.1f}°F") # Average temperature: 71.8°F
# Works with different numeric types
integers = np.array([10, 20, 30, 40, 50], dtype=np.int32)
print(np.mean(integers)) # 30.0 (always returns float by default)
Notice that np.mean() always returns a floating-point result, even when the input is integers. This prevents the truncation errors you’d get with integer division.
Calculating Mean Along Axes
Real-world data is rarely one-dimensional. When you’re working with matrices or higher-dimensional arrays, the axis parameter becomes essential.
Think of axis as “the dimension you want to collapse.” For a 2D array:
axis=0collapses rows, giving you column meansaxis=1collapses columns, giving you row meansaxis=None(default) flattens everything and returns a single value
import numpy as np
# Sales data: 4 quarters (rows) x 3 products (columns)
sales = np.array([
[150, 200, 175], # Q1
[180, 220, 190], # Q2
[200, 250, 210], # Q3
[170, 230, 185] # Q4
])
# Mean sales per product (across all quarters)
product_means = np.mean(sales, axis=0)
print(f"Product averages: {product_means}")
# Product averages: [175. 225. 190.]
# Mean sales per quarter (across all products)
quarter_means = np.mean(sales, axis=1)
print(f"Quarterly averages: {quarter_means}")
# Quarterly averages: [175. 196.67 220. 195.]
# Overall mean
overall = np.mean(sales)
print(f"Overall average: {overall:.2f}")
# Overall average: 196.67
For 3D arrays, the same logic extends. If you have a shape of (depth, rows, cols), then axis=0 averages across depth, axis=1 across rows, and axis=2 across columns.
# Monthly data for 2 years, 4 quarters, 3 products
# Shape: (2, 4, 3)
yearly_sales = np.array([
[[150, 200, 175], [180, 220, 190], [200, 250, 210], [170, 230, 185]], # Year 1
[[160, 210, 180], [190, 230, 200], [220, 270, 230], [180, 240, 195]] # Year 2
])
# Average across years (for each quarter/product combination)
print(np.mean(yearly_sales, axis=0).shape) # (4, 3)
# Average across quarters (for each year/product combination)
print(np.mean(yearly_sales, axis=1).shape) # (2, 3)
# Average across products (for each year/quarter combination)
print(np.mean(yearly_sales, axis=2).shape) # (2, 4)
Handling Missing Data (NaN Values)
Here’s where many developers get burned. If your array contains even a single NaN value, np.mean() returns NaN for the entire calculation:
import numpy as np
# Sensor readings with a missing value
readings = np.array([23.5, 24.1, np.nan, 23.8, 24.3])
# This silently fails
bad_mean = np.mean(readings)
print(f"Standard mean: {bad_mean}") # Standard mean: nan
This behavior is technically correct (NaN propagation), but it’s rarely what you want. Use np.nanmean() to ignore NaN values:
import numpy as np
readings = np.array([23.5, 24.1, np.nan, 23.8, 24.3])
# Ignores NaN values
good_mean = np.nanmean(readings)
print(f"Mean ignoring NaN: {good_mean:.2f}") # Mean ignoring NaN: 23.93
# Works with axis parameter too
data_with_gaps = np.array([
[1.0, 2.0, np.nan],
[4.0, np.nan, 6.0],
[7.0, 8.0, 9.0]
])
print("Column means (ignoring NaN):")
print(np.nanmean(data_with_gaps, axis=0))
# [4. 5. 7.5]
print("Row means (ignoring NaN):")
print(np.nanmean(data_with_gaps, axis=1))
# [1.5 5. 8. ]
Pro tip: Always check for NaN values before deciding which function to use. A quick np.isnan(data).any() tells you if you need nanmean().
Data Types and Precision
The dtype parameter controls the precision of intermediate calculations. This matters more than you might think, especially with large arrays of integers.
import numpy as np
# Large integers can overflow in intermediate calculations
large_ints = np.array([2_000_000_000, 2_000_000_000, 2_000_000_000], dtype=np.int32)
# Default behavior uses the array's dtype for accumulation
print(f"Default mean: {np.mean(large_ints)}") # May show incorrect result
# Force 64-bit float for safe accumulation
print(f"Safe mean: {np.mean(large_ints, dtype=np.float64)}") # 2000000000.0
For financial calculations where precision matters, explicitly specify the dtype:
import numpy as np
# Currency values that need precision
transactions = np.array([0.1, 0.2, 0.3, 0.1, 0.1, 0.2])
# Standard float64 is usually sufficient
mean_transaction = np.mean(transactions, dtype=np.float64)
print(f"Mean transaction: ${mean_transaction:.10f}")
# Mean transaction: $0.1666666667
When working with very large arrays, you can save memory by using float32, but be aware of the precision tradeoff:
import numpy as np
data = np.random.randn(1_000_000)
mean_64 = np.mean(data, dtype=np.float64)
mean_32 = np.mean(data, dtype=np.float32)
print(f"Float64: {mean_64:.10f}")
print(f"Float32: {mean_32:.10f}")
print(f"Difference: {abs(mean_64 - mean_32):.2e}")
Weighted Mean with np.average()
np.mean() treats all values equally. When values have different importance—like calculating GPA, portfolio returns, or weighted survey responses—use np.average():
import numpy as np
# GPA calculation: grades with credit hours as weights
grades = np.array([4.0, 3.7, 3.3, 4.0, 3.0]) # A, A-, B+, A, B
credits = np.array([3, 4, 3, 2, 4]) # Credit hours
# Unweighted mean (wrong for GPA)
simple_mean = np.mean(grades)
print(f"Simple mean: {simple_mean:.2f}") # 3.60
# Weighted mean (correct GPA)
gpa = np.average(grades, weights=credits)
print(f"Weighted GPA: {gpa:.2f}") # 3.54
np.average() also works with axes for multidimensional data:
import numpy as np
# Quarterly returns for 3 stocks
returns = np.array([
[0.05, 0.08, 0.03, 0.06], # Stock A
[0.12, -0.02, 0.15, 0.08], # Stock B
[0.03, 0.04, 0.02, 0.05] # Stock C
])
# Portfolio weights
portfolio_weights = np.array([0.5, 0.3, 0.2]) # 50%, 30%, 20%
# Weighted average return per quarter
quarterly_portfolio_returns = np.average(returns, axis=0, weights=portfolio_weights)
print(f"Quarterly portfolio returns: {quarterly_portfolio_returns}")
# [0.067 0.046 0.064 0.064]
Performance Tips and Best Practices
Use keepdims for broadcasting compatibility. When you need to subtract the mean from your data (centering), keepdims=True preserves the array’s dimensionality:
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Without keepdims, this requires reshaping
mean_no_keepdims = np.mean(data, axis=1)
print(mean_no_keepdims.shape) # (3,)
# With keepdims, broadcasting works directly
mean_keepdims = np.mean(data, axis=1, keepdims=True)
print(mean_keepdims.shape) # (3, 1)
# Center the data (subtract row means)
centered = data - mean_keepdims
print(centered)
# [[-1. 0. 1.]
# [-1. 0. 1.]
# [-1. 0. 1.]]
Choose function vs method syntax based on context. Use np.mean(arr) when the input might be a list or when you want explicit clarity. Use arr.mean() when chaining operations or when the array is already clearly defined.
Pre-allocate output arrays for repeated calculations. If you’re computing means in a loop (which you should avoid when possible), use the out parameter:
import numpy as np
result = np.empty(10)
for i in range(10):
chunk = np.random.randn(1000)
np.mean(chunk, out=result[i:i+1])
The bottom line: NumPy’s mean functions are fast, flexible, and battle-tested. Use np.mean() for simple cases, np.nanmean() when missing data is possible, and np.average() when weights matter. Master the axis parameter, and you’ll never write a loop to calculate means again.