How to Perform the Anderson-Darling Test in Python

Key Insights

The Anderson-Darling test is more sensitive to deviations in distribution tails than the Kolmogorov-Smirnov test, making it better for detecting outliers and extreme values
SciPy’s anderson() function returns critical values at five significance levels, allowing you to assess confidence without running multiple tests
Always combine statistical tests with visual inspection—a Q-Q plot alongside the Anderson-Darling test gives you both quantitative evidence and intuitive understanding

Introduction to the Anderson-Darling Test

The Anderson-Darling test is a goodness-of-fit test that determines whether your data follows a specific probability distribution. While it’s commonly used for normality testing, it can evaluate fit against several distributions including exponential, logistic, and Gumbel.

Why choose Anderson-Darling over alternatives? The Shapiro-Wilk test is excellent for normality but limited to that single distribution. The Kolmogorov-Smirnov test is more general but treats all parts of the distribution equally. Anderson-Darling applies greater weight to the tails, catching deviations that K-S might miss.

This matters in practice. Financial risk models, quality control processes, and scientific experiments often care most about extreme values. A manufacturing process might produce parts that look normally distributed in the middle but have suspicious outliers. Anderson-Darling will catch that.

Common use cases include:

Validating assumptions before parametric statistical tests
Quality control in manufacturing (Six Sigma applications)
Financial modeling where tail behavior determines risk
Preprocessing checks in machine learning pipelines

Understanding the Test Statistic and Critical Values

The Anderson-Darling statistic measures how well your data fits a theoretical distribution by comparing the empirical cumulative distribution function (ECDF) to the theoretical CDF. The formula applies a weighting function that emphasizes discrepancies in the tails:

$$A^2 = -n - \frac{1}{n} \sum_{i=1}^{n} (2i-1)[\ln(F(Y_i)) + \ln(1-F(Y_{n+1-i}))]$$

Don’t worry about computing this manually—SciPy handles it. What you need to understand is interpretation.

The null hypothesis states that your data follows the specified distribution. A larger A² statistic indicates greater deviation from the theoretical distribution. SciPy returns critical values at five significance levels: 15%, 10%, 5%, 2.5%, and 1%.

The decision rule is straightforward:

If your statistic exceeds the critical value at a given significance level, reject the null hypothesis at that level
If your statistic is below all critical values, you cannot reject the null hypothesis—the data is consistent with the distribution

Note the asymmetry: failing to reject doesn’t prove your data follows the distribution. It means you lack sufficient evidence to claim otherwise.

Basic Implementation with SciPy

Let’s start with the fundamental usage. SciPy’s anderson() function lives in scipy.stats:

import numpy as np
from scipy import stats

# Generate normally distributed data
np.random.seed(42)
normal_data = np.random.normal(loc=50, scale=10, size=200)

# Perform Anderson-Darling test for normality
result = stats.anderson(normal_data, dist='norm')

print(f"Test Statistic: {result.statistic:.4f}")
print(f"Critical Values: {result.critical_values}")
print(f"Significance Levels: {result.significance_level}")

Output:

Test Statistic: 0.2508
Critical Values: [0.555 0.632 0.759 0.885 1.053]
Significance Levels: [15.  10.   5.   2.5  1. ]

The statistic (0.2508) is below all critical values. We cannot reject normality at any significance level. This makes sense—we generated the data from a normal distribution.

Now let’s test data that isn’t normal:

# Generate exponentially distributed data
exponential_data = np.random.exponential(scale=10, size=200)

# Test for normality (should fail)
result = stats.anderson(exponential_data, dist='norm')

print(f"Test Statistic: {result.statistic:.4f}")
print("\nComparison with critical values:")
for i, (cv, sl) in enumerate(zip(result.critical_values, result.significance_level)):
    comparison = ">" if result.statistic > cv else "<"
    reject = "REJECT" if result.statistic > cv else "fail to reject"
    print(f"  {sl}% level: {result.statistic:.4f} {comparison} {cv:.4f} → {reject}")

Output:

Test Statistic: 14.7823

Comparison with critical values:
  15.0% level: 14.7823 > 0.5550 → REJECT
  10.0% level: 14.7823 > 0.6320 → REJECT
  5.0% level: 14.7823 > 0.7590 → REJECT
  2.5% level: 14.7823 > 0.8850 → REJECT
  1.0% level: 14.7823 > 1.0530 → REJECT

The statistic massively exceeds all critical values. We reject normality with high confidence.

Testing Against Different Distributions

SciPy supports five distributions: 'norm', 'expon', 'logistic', 'gumbel', and 'gumbel_l' (left-skewed Gumbel, also called 'extreme1').

# Generate data from different distributions
np.random.seed(123)
exp_data = np.random.exponential(scale=5, size=300)
logistic_data = np.random.logistic(loc=10, scale=2, size=300)

# Test exponential data against exponential distribution
exp_result = stats.anderson(exp_data, dist='expon')
print("Exponential data tested against exponential distribution:")
print(f"  Statistic: {exp_result.statistic:.4f}")
print(f"  Critical value at 5%: {exp_result.critical_values[2]:.4f}")
print(f"  Result: {'Reject' if exp_result.statistic > exp_result.critical_values[2] else 'Fail to reject'}")

# Test logistic data against logistic distribution
log_result = stats.anderson(logistic_data, dist='logistic')
print("\nLogistic data tested against logistic distribution:")
print(f"  Statistic: {log_result.statistic:.4f}")
print(f"  Critical value at 5%: {log_result.critical_values[2]:.4f}")
print(f"  Result: {'Reject' if log_result.statistic > log_result.critical_values[2] else 'Fail to reject'}")

# Cross-test: exponential data against normal
cross_result = stats.anderson(exp_data, dist='norm')
print("\nExponential data tested against normal distribution:")
print(f"  Statistic: {cross_result.statistic:.4f}")
print(f"  Critical value at 5%: {cross_result.critical_values[2]:.4f}")
print(f"  Result: {'Reject' if cross_result.statistic > cross_result.critical_values[2] else 'Fail to reject'}")

For distributions not built into SciPy, you’ll need alternative approaches. The scipy.stats.anderson_ksamp function tests whether multiple samples come from the same distribution. For arbitrary distributions, consider using the statsmodels library or implementing a parametric bootstrap approach.

Practical Example: Real-World Data Analysis

Let’s work through a complete analysis workflow. Imagine you’re analyzing response times from a web service and need to determine if they’re normally distributed before applying parametric statistical methods.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Simulate realistic response time data (often right-skewed)
np.random.seed(456)
response_times = np.random.lognormal(mean=4.5, sigma=0.5, size=500)

# Create visualization
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Histogram with normal fit overlay
axes[0].hist(response_times, bins=30, density=True, alpha=0.7, edgecolor='black')
mu, std = response_times.mean(), response_times.std()
x = np.linspace(response_times.min(), response_times.max(), 100)
axes[0].plot(x, stats.norm.pdf(x, mu, std), 'r-', linewidth=2, label='Normal fit')
axes[0].set_xlabel('Response Time (ms)')
axes[0].set_ylabel('Density')
axes[0].set_title('Distribution of Response Times')
axes[0].legend()

# Q-Q plot
stats.probplot(response_times, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot (Normal)')

# Box plot to show outliers
axes[2].boxplot(response_times, vert=True)
axes[2].set_ylabel('Response Time (ms)')
axes[2].set_title('Box Plot')

plt.tight_layout()
plt.savefig('response_time_analysis.png', dpi=150)
plt.show()

# Run Anderson-Darling test
print("=" * 50)
print("Anderson-Darling Test Results")
print("=" * 50)

result = stats.anderson(response_times, dist='norm')
print(f"\nTest Statistic: {result.statistic:.4f}")
print(f"\nSignificance Level | Critical Value | Decision")
print("-" * 50)

for cv, sl in zip(result.critical_values, result.significance_level):
    decision = "Reject H0" if result.statistic > cv else "Fail to reject H0"
    print(f"{sl:>17}% | {cv:>14.4f} | {decision}")

# Recommendation based on results
print("\n" + "=" * 50)
if result.statistic > result.critical_values[2]:  # 5% level
    print("RECOMMENDATION: Data is NOT normally distributed.")
    print("Consider using non-parametric methods or transforming the data.")
    
    # Try log transformation
    log_times = np.log(response_times)
    log_result = stats.anderson(log_times, dist='norm')
    print(f"\nLog-transformed data test statistic: {log_result.statistic:.4f}")
    if log_result.statistic < log_result.critical_values[2]:
        print("Log transformation achieves normality!")
else:
    print("RECOMMENDATION: Data is consistent with normal distribution.")
    print("Parametric methods are appropriate.")

This workflow demonstrates the practical integration of visual and statistical analysis. The Q-Q plot shows systematic deviation from the diagonal line for non-normal data, while the histogram reveals skewness. The Anderson-Darling test quantifies what you see visually.

Limitations and Best Practices

Sample size matters. With very small samples (n < 20), the test has low power—it may fail to detect non-normality even when present. With very large samples (n > 5000), the test becomes overly sensitive and may reject normality for trivially small deviations that don’t matter practically.

A practical heuristic: for samples between 20 and 5000 observations, trust the test results. Outside that range, weight visual inspection more heavily.

When the test fails normality:

Consider transformations (log, square root, Box-Cox)
Use non-parametric alternatives (Mann-Whitney instead of t-test, Spearman instead of Pearson)
Apply robust statistical methods designed for non-normal data
Check for outliers that might be corrupting otherwise normal data

Complementary validation approaches:

Run multiple tests: Shapiro-Wilk for samples under 5000, Kolmogorov-Smirnov for comparison
Always create Q-Q plots—they reveal the nature of non-normality (heavy tails, skewness, multimodality)
Calculate skewness and kurtosis for additional context

def comprehensive_normality_check(data, alpha=0.05):
    """Run multiple normality tests and return summary."""
    results = {}
    
    # Anderson-Darling
    ad = stats.anderson(data, dist='norm')
    results['anderson_darling'] = {
        'statistic': ad.statistic,
        'reject': ad.statistic > ad.critical_values[2]  # 5% level
    }
    
    # Shapiro-Wilk (for smaller samples)
    if len(data) <= 5000:
        sw_stat, sw_p = stats.shapiro(data)
        results['shapiro_wilk'] = {
            'statistic': sw_stat,
            'p_value': sw_p,
            'reject': sw_p < alpha
        }
    
    # Descriptive statistics
    results['skewness'] = stats.skew(data)
    results['kurtosis'] = stats.kurtosis(data)
    
    return results

The Anderson-Darling test is a reliable tool in your statistical toolkit, but it’s not a replacement for understanding your data. Use it alongside visualization and domain knowledge to make informed decisions about your analysis approach.