Experiment Design for Data Scientists
Good experiment design prevents the most common analytics mistakes: confounding, p-hacking, and underpowered tests.
Key Insights
- Calculate sample size before running the experiment, not after
- Randomization at the right unit (user vs session vs page view) determines validity
- Pre-registration of hypotheses and metrics prevents p-hacking
Sample Size Calculation
from statsmodels.stats.power import TTestIndPower
analysis = TTestIndPower()
sample_size = analysis.solve_power(
effect_size=0.05, # minimum detectable effect
alpha=0.05, # significance level
power=0.80, # desired power
ratio=1.0 # equal group sizes
)
Randomization Units
Choose the unit that prevents contamination:
- User-level: Best for most product experiments
- Session-level: Only if experiences don’t persist
- Cluster-level: When users interact (marketplaces, social networks)
Guard Rails
Set guardrail metrics before launch. If key business metrics degrade beyond thresholds, the experiment stops automatically regardless of the primary metric result.