Experiment Design for Data Scientists

Good experiment design prevents the most common analytics mistakes: confounding, p-hacking, and underpowered tests.

Key Insights

  • Calculate sample size before running the experiment, not after
  • Randomization at the right unit (user vs session vs page view) determines validity
  • Pre-registration of hypotheses and metrics prevents p-hacking

Sample Size Calculation

from statsmodels.stats.power import TTestIndPower

analysis = TTestIndPower()
sample_size = analysis.solve_power(
    effect_size=0.05,  # minimum detectable effect
    alpha=0.05,        # significance level
    power=0.80,        # desired power
    ratio=1.0          # equal group sizes
)

Randomization Units

Choose the unit that prevents contamination:

  • User-level: Best for most product experiments
  • Session-level: Only if experiences don’t persist
  • Cluster-level: When users interact (marketplaces, social networks)

Guard Rails

Set guardrail metrics before launch. If key business metrics degrade beyond thresholds, the experiment stops automatically regardless of the primary metric result.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.