How to Perform the Sign Test in Python
The sign test is one of the oldest and simplest non-parametric statistical tests. It determines whether there's a consistent difference between pairs of observations—think before/after measurements,...
Key Insights
- The sign test is a non-parametric alternative to the paired t-test that makes no assumptions about data distribution, making it ideal for ordinal data or small samples with outliers.
- Python offers two main approaches: manual calculation using SciPy’s binomial test, or the convenient
sign_testfunction from statsmodels. - While less powerful than the Wilcoxon signed-rank test, the sign test’s simplicity and minimal assumptions make it valuable when you only care about the direction of change, not magnitude.
Introduction to the Sign Test
The sign test is one of the oldest and simplest non-parametric statistical tests. It determines whether there’s a consistent difference between pairs of observations—think before/after measurements, matched subjects, or repeated measures on the same individuals.
Unlike the paired t-test, the sign test doesn’t assume your data follows a normal distribution. It doesn’t even care about the magnitude of differences. All it asks is: do values tend to increase or decrease?
Use the sign test when:
- Your data is ordinal (ranked but not necessarily numeric)
- Sample sizes are small and normality is questionable
- You have outliers that would distort parametric tests
- You only care about direction of change, not how much
Common applications include comparing user preferences between two products, measuring patient outcomes before and after treatment, and evaluating performance changes across matched pairs.
How the Sign Test Works
The logic is straightforward. For each pair of observations, calculate the difference. Classify each difference as positive (+), negative (−), or zero (tie). Discard ties. Count the positives and negatives.
Under the null hypothesis of no difference, positive and negative signs should occur with equal probability (0.5 each). The test statistic is simply the count of the less frequent sign. We then use the binomial distribution to calculate how likely we’d see such an extreme result by chance.
Here’s the manual process:
import numpy as np
# Before and after measurements
before = np.array([72, 85, 68, 90, 75, 82, 78, 88, 71, 80])
after = np.array([68, 82, 65, 88, 70, 85, 74, 84, 68, 78])
# Calculate differences
differences = after - before
print(f"Differences: {differences}")
# Count signs
positive_count = np.sum(differences > 0)
negative_count = np.sum(differences < 0)
ties = np.sum(differences == 0)
print(f"Positive signs: {positive_count}")
print(f"Negative signs: {negative_count}")
print(f"Ties (excluded): {ties}")
# Test statistic is the smaller count
n = positive_count + negative_count # Total non-tied pairs
test_statistic = min(positive_count, negative_count)
print(f"Test statistic: {test_statistic}")
print(f"n (excluding ties): {n}")
Output:
Differences: [-4 -3 -3 -2 -5 3 -4 -4 -3 -2]
Positive signs: 1
Negative signs: 9
Ties (excluded): 0
Test statistic: 1
n (excluding ties): 10
With 9 negative signs and only 1 positive, the data strongly suggests values decreased after treatment.
Performing the Sign Test with SciPy
SciPy doesn’t have a dedicated sign test function, but we can use the binomial test directly. The logic: if there’s no true difference, each non-tied observation has a 50% chance of being positive.
from scipy import stats
import numpy as np
# Sample data: reaction times before and after training
before = np.array([245, 312, 278, 295, 267, 301, 289, 256, 283, 271, 298, 264])
after = np.array([238, 298, 271, 289, 258, 295, 276, 251, 279, 268, 287, 259])
# Calculate differences and signs
differences = after - before
positive_count = np.sum(differences > 0)
negative_count = np.sum(differences < 0)
n = positive_count + negative_count
print(f"Sample size (excluding ties): {n}")
print(f"Positive differences: {positive_count}")
print(f"Negative differences: {negative_count}")
# Two-tailed sign test using binomial test
# We test the probability of getting k or fewer successes (or k or more)
k = min(positive_count, negative_count)
# For scipy >= 1.7, use binomtest
result = stats.binomtest(k, n, p=0.5, alternative='two-sided')
p_value = result.pvalue
print(f"\nSign Test Results:")
print(f"Test statistic (smaller count): {k}")
print(f"P-value (two-tailed): {p_value:.4f}")
# Interpretation
alpha = 0.05
if p_value < alpha:
print(f"Result: Significant at α={alpha}. Reject null hypothesis.")
else:
print(f"Result: Not significant at α={alpha}. Fail to reject null hypothesis.")
Output:
Sample size (excluding ties): 12
Positive differences: 0
Negative differences: 12
Sign Test Results:
Test statistic (smaller count): 0
P-value (two-tailed): 0.0005
Result: Significant at α=0.05. Reject null hypothesis.
All 12 subjects showed decreased reaction times after training. The probability of this happening by chance (if training had no effect) is about 0.05%, providing strong evidence that training improves reaction times.
Using Statsmodels for the Sign Test
Statsmodels provides a more direct approach with its sign_test function. This handles the difference calculation and statistical testing in one call.
from statsmodels.stats.descriptivestats import sign_test
import numpy as np
# Same reaction time data
before = np.array([245, 312, 278, 295, 267, 301, 289, 256, 283, 271, 298, 264])
after = np.array([238, 298, 271, 289, 258, 295, 276, 251, 279, 268, 287, 259])
# Calculate differences
differences = after - before
# Perform sign test
# Tests whether the median of differences equals mu0 (default 0)
statistic, p_value = sign_test(differences, mu0=0)
print(f"Statsmodels Sign Test Results:")
print(f"Test statistic (M): {statistic}")
print(f"P-value (two-tailed): {p_value:.4f}")
# Compare with manual calculation
print(f"\nVerification:")
print(f"Positive count: {np.sum(differences > 0)}")
print(f"Negative count: {np.sum(differences < 0)}")
Output:
Statsmodels Sign Test Results:
Test statistic (M): -12.0
P-value (two-tailed): 0.0005
Verification:
Positive count: 0
Negative count: 12
The statsmodels function returns M, which represents the signed count (negative when negatives dominate, positive when positives dominate). The p-value matches our SciPy calculation.
Note that statsmodels tests against a hypothesized median (mu0). Setting mu0=0 tests whether the median difference is zero—equivalent to testing for no systematic change.
Interpreting Results and Handling Edge Cases
Understanding the output requires attention to several nuances.
Handling ties: Observations where the difference equals zero are typically excluded. This reduces your effective sample size, potentially affecting statistical power.
import numpy as np
from scipy import stats
# Data with ties
before = np.array([50, 55, 60, 65, 70, 75, 80, 85])
after = np.array([52, 55, 58, 65, 72, 73, 82, 85]) # Two ties
differences = after - before
print(f"Differences: {differences}")
# Identify and handle ties
non_zero_diff = differences[differences != 0]
ties = np.sum(differences == 0)
print(f"Ties excluded: {ties}")
print(f"Effective sample size: {len(non_zero_diff)}")
positive_count = np.sum(non_zero_diff > 0)
negative_count = np.sum(non_zero_diff < 0)
n = len(non_zero_diff)
k = min(positive_count, negative_count)
# Two-tailed test
result_two_tailed = stats.binomtest(k, n, p=0.5, alternative='two-sided')
print(f"\nTwo-tailed p-value: {result_two_tailed.pvalue:.4f}")
# One-tailed test: testing if after > before
# Count how many times after was greater
greater_count = np.sum(non_zero_diff > 0)
result_one_tailed = stats.binomtest(greater_count, n, p=0.5, alternative='greater')
print(f"One-tailed p-value (after > before): {result_one_tailed.pvalue:.4f}")
One-tailed vs. two-tailed: Use a one-tailed test when you have a directional hypothesis (e.g., “treatment improves outcomes”). Use two-tailed when you’re testing for any difference.
Limitations: The sign test ignores magnitude. A difference of 1 counts the same as a difference of 1000. This makes it less powerful than the Wilcoxon signed-rank test when magnitude information is meaningful. However, this “weakness” becomes a strength when dealing with ordinal data or extreme outliers.
Practical Example: A/B Testing Scenario
Let’s apply the sign test to a realistic scenario. You’ve run an A/B test where each user saw both versions of a checkout flow (counterbalanced order). You measured completion time and want to know if Version B is faster.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Simulated completion times (seconds) for 25 users
np.random.seed(42)
version_a = np.array([45, 62, 38, 71, 55, 48, 67, 52, 59, 44,
73, 41, 58, 65, 49, 56, 63, 47, 54, 69,
51, 60, 43, 66, 57])
version_b = np.array([42, 58, 35, 65, 51, 45, 61, 48, 55, 42,
68, 39, 54, 60, 46, 52, 58, 44, 51, 64,
47, 56, 41, 62, 53])
# Calculate differences (A - B, positive means B was faster)
differences = version_a - version_b
# Perform sign test
positive_count = np.sum(differences > 0) # B faster
negative_count = np.sum(differences < 0) # A faster
ties = np.sum(differences == 0)
n = positive_count + negative_count
print("A/B Test Analysis: Checkout Flow Comparison")
print("=" * 45)
print(f"Total users: {len(differences)}")
print(f"Version B faster: {positive_count} users")
print(f"Version A faster: {negative_count} users")
print(f"No difference: {ties} users")
# One-tailed test: Is Version B significantly faster?
result = stats.binomtest(positive_count, n, p=0.5, alternative='greater')
print(f"\nHypothesis: Version B reduces completion time")
print(f"P-value (one-tailed): {result.pvalue:.4f}")
print(f"95% CI for proportion favoring B: [{result.proportion_ci().low:.3f}, {result.proportion_ci().high:.3f}]")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Plot 1: Paired differences
colors = ['green' if d > 0 else 'red' if d < 0 else 'gray' for d in differences]
axes[0].bar(range(len(differences)), differences, color=colors, alpha=0.7)
axes[0].axhline(y=0, color='black', linestyle='-', linewidth=0.5)
axes[0].set_xlabel('User')
axes[0].set_ylabel('Time Difference (A - B) in seconds')
axes[0].set_title('Completion Time Differences\n(Green = B faster, Red = A faster)')
# Plot 2: Sign distribution
signs = ['B Faster\n(positive)', 'A Faster\n(negative)', 'Tie']
counts = [positive_count, negative_count, ties]
colors = ['green', 'red', 'gray']
axes[1].bar(signs, counts, color=colors, alpha=0.7)
axes[1].set_ylabel('Count')
axes[1].set_title(f'Sign Distribution\np-value = {result.pvalue:.4f}')
plt.tight_layout()
plt.savefig('sign_test_results.png', dpi=150)
plt.show()
# Decision
alpha = 0.05
print(f"\nConclusion at α={alpha}:")
if result.pvalue < alpha:
print("✓ Version B significantly reduces checkout time.")
print(" Recommendation: Deploy Version B.")
else:
print("✗ No significant difference detected.")
print(" Recommendation: Gather more data or consider other metrics.")
This example demonstrates a complete workflow: data preparation, hypothesis testing, confidence interval estimation, and visualization. The sign test tells us whether users consistently completed checkout faster with Version B, regardless of how much faster.
Conclusion
The sign test occupies a specific niche in your statistical toolkit. Choose it when:
- Over the paired t-test: Your data isn’t normally distributed, sample size is small, or you have significant outliers.
- Over the Wilcoxon signed-rank test: Your data is ordinal, you don’t trust magnitude information, or you want the simplest possible test.
Choose alternatives when:
- Use paired t-test: Data is approximately normal and you want maximum statistical power.
- Use Wilcoxon signed-rank: Data is at least interval-scaled and magnitude of differences is meaningful.
The sign test’s power comes from its simplicity. By reducing data to signs, it becomes robust to outliers and distribution assumptions. In Python, you can implement it manually with SciPy’s binomial test for full control, or use statsmodels for convenience. Either approach gives you a reliable non-parametric tool for paired comparisons.