How to Calculate Skewness in Python
Skewness measures the asymmetry of a probability distribution around its mean. When you're analyzing data, understanding its shape tells you more than summary statistics alone. A dataset with a mean...
Key Insights
- SciPy’s
scipy.stats.skew()is the standard method for calculating skewness in Python, with thebias=Falseparameter providing the corrected sample skewness that’s appropriate for most real-world datasets. - Skewness values between -0.5 and 0.5 indicate approximately symmetric data, while values beyond ±1 suggest significant asymmetry that may require transformation before applying certain statistical methods.
- Log and Box-Cox transformations are effective techniques for reducing right skewness in data, which is critical when preparing features for machine learning models that assume normally distributed inputs.
Introduction to Skewness
Skewness measures the asymmetry of a probability distribution around its mean. When you’re analyzing data, understanding its shape tells you more than summary statistics alone. A dataset with a mean of 50 and standard deviation of 10 could be perfectly symmetric or heavily lopsided—skewness reveals which.
There are three types of skewness:
- Positive (right) skewness: The tail extends to the right. Most values cluster on the left, with outliers pulling the mean higher than the median. Income distributions are classic examples.
- Negative (left) skewness: The tail extends to the left. Most values cluster on the right, with outliers pulling the mean lower than the median. Age at retirement in developed countries often shows this pattern.
- Zero skewness: The distribution is symmetric around the mean. The normal distribution has zero skewness by definition.
Why does this matter? Many statistical methods assume normally distributed data. Linear regression, t-tests, and ANOVA all perform better with symmetric distributions. Skewness helps you identify when transformations are needed and guides your choice of statistical tests.
The Mathematics Behind Skewness
The most common measure is the Fisher-Pearson coefficient of skewness, defined as the third standardized moment:
$$\gamma_1 = \frac{E[(X - \mu)^3]}{\sigma^3}$$
For a sample, this becomes:
$$g_1 = \frac{\frac{1}{n} \sum_{i=1}^{n}(x_i - \bar{x})^3}{\left(\frac{1}{n} \sum_{i=1}^{n}(x_i - \bar{x})^2\right)^{3/2}}$$
The numerator captures the third moment—cubing the deviations preserves their sign, so negative deviations contribute negatively and positive deviations contribute positively. The denominator normalizes by the standard deviation cubed, making skewness dimensionless and comparable across datasets.
For sample data, an adjusted formula corrects for bias:
$$G_1 = \frac{\sqrt{n(n-1)}}{n-2} \cdot g_1$$
This adjustment matters for small samples. As sample size increases, the difference between biased and unbiased estimates diminishes.
Here’s how to calculate skewness manually using NumPy:
import numpy as np
def calculate_skewness(data, bias=True):
"""Calculate skewness manually for educational purposes."""
n = len(data)
mean = np.mean(data)
# Calculate the second and third moments
m2 = np.sum((data - mean) ** 2) / n
m3 = np.sum((data - mean) ** 3) / n
# Biased skewness (population formula)
g1 = m3 / (m2 ** 1.5)
if bias:
return g1
else:
# Apply sample correction
adjustment = np.sqrt(n * (n - 1)) / (n - 2)
return adjustment * g1
# Test with sample data
np.random.seed(42)
right_skewed = np.random.exponential(scale=2, size=1000)
print(f"Biased skewness: {calculate_skewness(right_skewed, bias=True):.4f}")
print(f"Unbiased skewness: {calculate_skewness(right_skewed, bias=False):.4f}")
Calculating Skewness with SciPy
For production code, use scipy.stats.skew(). It’s optimized, well-tested, and handles edge cases properly.
from scipy import stats
import numpy as np
# Generate different distributions
np.random.seed(42)
normal_data = np.random.normal(loc=0, scale=1, size=1000)
right_skewed = np.random.exponential(scale=2, size=1000)
left_skewed = -np.random.exponential(scale=2, size=1000) + 10
# Calculate skewness with both methods
print("Distribution Comparison:")
print("-" * 45)
for name, data in [("Normal", normal_data),
("Right-skewed", right_skewed),
("Left-skewed", left_skewed)]:
biased = stats.skew(data, bias=True)
unbiased = stats.skew(data, bias=False)
print(f"{name:15} | Biased: {biased:7.4f} | Unbiased: {unbiased:7.4f}")
Output:
Distribution Comparison:
---------------------------------------------
Normal | Biased: -0.0128 | Unbiased: -0.0128
Right-skewed | Biased: 1.9537 | Unbiased: 1.9566
Left-skewed | Biased: -1.9537 | Unbiased: -1.9566
The bias parameter defaults to True, which gives you the population skewness. Set bias=False for sample-corrected skewness—this is what you want for most real-world analyses where your data is a sample from a larger population.
SciPy also handles NaN values with the nan_policy parameter:
data_with_nans = np.array([1, 2, 3, np.nan, 5, 6, 7])
# Different NaN handling strategies
print(f"Propagate (default): {stats.skew(data_with_nans, nan_policy='propagate')}")
print(f"Omit NaNs: {stats.skew(data_with_nans, nan_policy='omit'):.4f}")
# 'raise' will throw an error if NaNs are present
Alternative Methods: Pandas and Manual Calculation
When working with DataFrames, Pandas provides a convenient skew() method that operates on columns by default:
import pandas as pd
import numpy as np
from scipy import stats
# Create a DataFrame with multiple features
np.random.seed(42)
df = pd.DataFrame({
'income': np.random.exponential(50000, 500),
'age': np.random.normal(45, 12, 500),
'satisfaction_score': -np.random.exponential(2, 500) + 10,
'transaction_count': np.random.poisson(5, 500)
})
# Pandas skewness (uses unbiased by default)
pandas_skew = df.skew()
print("Pandas skewness (unbiased):")
print(pandas_skew.round(4))
print()
# Compare with SciPy
print("Comparison with SciPy (bias=False):")
for col in df.columns:
scipy_skew = stats.skew(df[col], bias=False)
pandas_val = pandas_skew[col]
print(f"{col:20} | Pandas: {pandas_val:7.4f} | SciPy: {scipy_skew:7.4f}")
Key difference: Pandas uses the unbiased estimator by default, matching scipy.stats.skew(bias=False). This is the right choice for sample data, which is what you’re typically working with in Pandas.
Use Pandas when you’re already working with DataFrames and need skewness across multiple columns. Use SciPy when you need more control over the calculation or are working with NumPy arrays directly.
Visualizing Skewness
Numbers are useful, but visualization makes skewness intuitive. Here’s how to create informative plots:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
# Generate three distributions
np.random.seed(42)
distributions = {
'Left-skewed\n(negative)': -np.random.exponential(2, 5000) + 15,
'Symmetric\n(normal)': np.random.normal(7.5, 2, 5000),
'Right-skewed\n(positive)': np.random.exponential(2, 5000)
}
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
for ax, (name, data) in zip(axes, distributions.items()):
skewness = stats.skew(data, bias=False)
# Plot histogram with KDE
sns.histplot(data, kde=True, ax=ax, color='steelblue', alpha=0.7)
# Add vertical lines for mean and median
mean_val = np.mean(data)
median_val = np.median(data)
ax.axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.2f}')
ax.axvline(median_val, color='green', linestyle='-', linewidth=2, label=f'Median: {median_val:.2f}')
ax.set_title(f'{name}\nSkewness: {skewness:.3f}', fontsize=12)
ax.legend(fontsize=9)
ax.set_xlabel('Value')
ax.set_ylabel('Frequency')
plt.tight_layout()
plt.savefig('skewness_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
Notice how the relationship between mean and median reveals skewness direction: in right-skewed data, the mean exceeds the median; in left-skewed data, the median exceeds the mean.
Interpreting and Applying Skewness
Use these practical thresholds for interpretation:
| Skewness Value | Interpretation |
|---|---|
| -0.5 to 0.5 | Approximately symmetric |
| -1 to -0.5 or 0.5 to 1 | Moderately skewed |
| < -1 or > 1 | Highly skewed |
When skewness is high, consider transformations. The most common approaches for right-skewed data:
import numpy as np
from scipy import stats
from scipy.stats import boxcox
# Generate right-skewed data (common in real-world scenarios)
np.random.seed(42)
original_data = np.random.exponential(scale=100, size=1000)
# Ensure all values are positive for log transform
original_data = original_data + 1 # Shift if needed
# Apply transformations
log_transformed = np.log(original_data)
sqrt_transformed = np.sqrt(original_data)
boxcox_transformed, lambda_param = boxcox(original_data)
# Compare skewness before and after
print("Skewness Comparison After Transformations:")
print("-" * 50)
print(f"Original data: {stats.skew(original_data, bias=False):7.4f}")
print(f"Log transformed: {stats.skew(log_transformed, bias=False):7.4f}")
print(f"Square root: {stats.skew(sqrt_transformed, bias=False):7.4f}")
print(f"Box-Cox (λ={lambda_param:.3f}): {stats.skew(boxcox_transformed, bias=False):7.4f}")
Output:
Skewness Comparison After Transformations:
--------------------------------------------------
Original data: 1.9566
Log transformed: -0.0842
Square root: 0.8234
Box-Cox (λ=0.087): -0.0012
Box-Cox finds the optimal transformation automatically, but log transform is often sufficient and more interpretable. For left-skewed data, try squaring or exponentiating the values.
Real-world applications where skewness matters:
- Feature engineering for ML: Many algorithms (linear regression, neural networks) perform better with symmetric features. Check skewness and transform accordingly.
- Risk assessment: Financial returns often show negative skewness—large losses are more common than large gains of equal magnitude.
- Quality control: Manufacturing defect rates typically show right skewness, with most products being defect-free.
- A/B testing: Skewed metrics like revenue per user require different statistical tests than symmetric metrics.
Conclusion
Python offers multiple reliable methods for calculating skewness. Use scipy.stats.skew() with bias=False as your default—it’s the most flexible and widely applicable. Switch to pandas.DataFrame.skew() when analyzing multiple columns in tabular data.
Remember that skewness is just one aspect of distribution shape. Pair it with kurtosis to understand tail behavior, and always visualize your data. When skewness exceeds ±1, consider transformations before applying methods that assume normality.
For a complete picture of your data’s distribution, combine skewness analysis with formal normality tests like Shapiro-Wilk or Anderson-Darling—but that’s a topic for another article.