Feature Toggles: Gradual Feature Rollout

Key Insights

Feature toggles enable gradual rollouts by decoupling deployment from release, letting you ship code to production without exposing it to all users simultaneously.
Consistent hashing with user identifiers creates deterministic, sticky assignments that ensure users see the same feature state across sessions and services.
Every toggle incurs technical debt—implement sunset policies and automated detection from day one to prevent your codebase from becoming a graveyard of dead branches.

The Case for Gradual Rollouts

Big-bang releases are a gamble. You write code for weeks, merge it all at once, and hope nothing breaks. When something does break—and it will—you’re debugging under pressure while your entire user base experiences the problem.

Gradual rollouts flip this model. Instead of exposing new features to 100% of users immediately, you start with 1%, monitor for issues, then expand to 5%, 25%, and eventually everyone. If something goes wrong at 5%, you’ve impacted a fraction of your users and can roll back instantly.

Feature toggles (also called feature flags) are the mechanism that makes this possible. They come in several flavors:

Release toggles: Control feature visibility during rollout, removed after full deployment
Ops toggles: Kill switches for operational concerns, often long-lived
Experiment toggles: A/B testing variations, removed after experiment concludes
Permission toggles: Entitlement-based access, often permanent

This article focuses on release toggles for gradual rollouts, but the underlying infrastructure supports all types.

Anatomy of a Feature Toggle System

A feature toggle system has three core components: configuration storage, evaluation context, and the evaluation engine itself.

interface ToggleConfig {
  name: string;
  enabled: boolean;
  rolloutPercentage: number;
  targetingRules: TargetingRule[];
  createdAt: Date;
  owner: string;
  description: string;
}

interface EvaluationContext {
  userId: string;
  userEmail?: string;
  userGroups?: string[];
  region?: string;
  deviceType?: string;
  customAttributes?: Record<string, string>;
}

class FeatureToggleClient {
  private config: Map<string, ToggleConfig>;
  
  constructor(private configSource: ConfigSource) {
    this.config = new Map();
    this.configSource.subscribe((updates) => this.updateConfig(updates));
  }
  
  isEnabled(toggleName: string, context: EvaluationContext): boolean {
    const toggle = this.config.get(toggleName);
    
    if (!toggle) {
      return false; // Default to off for unknown toggles
    }
    
    if (!toggle.enabled) {
      return false;
    }
    
    // Check targeting rules first
    if (toggle.targetingRules.length > 0) {
      const ruleResult = this.evaluateRules(toggle.targetingRules, context);
      if (ruleResult !== null) {
        return ruleResult;
      }
    }
    
    // Fall back to percentage rollout
    return this.isInRolloutPercentage(toggleName, context.userId, toggle.rolloutPercentage);
  }
}

The lifecycle matters as much as the implementation. Every toggle should have an owner, a creation date, and an expected removal date. Without this metadata, toggles accumulate like sediment, and six months later nobody knows if new_checkout_flow_v2 is safe to remove.

Implementing Percentage-Based Rollouts

The naive approach to percentage rollouts—generating a random number and checking if it’s under the threshold—creates a terrible user experience. Users see the feature on one page load, then it disappears on the next. They report bugs that your team can’t reproduce.

You need sticky assignment. The same user should always see the same toggle state for a given rollout percentage. Consistent hashing solves this:

import { createHash } from 'crypto';

function isInRolloutPercentage(
  toggleName: string,
  userId: string,
  percentage: number
): boolean {
  if (percentage === 0) return false;
  if (percentage === 100) return true;
  
  // Combine toggle name and user ID for unique bucketing per toggle
  const hashInput = `${toggleName}:${userId}`;
  const hash = createHash('sha256').update(hashInput).digest('hex');
  
  // Take first 8 hex characters (32 bits) and convert to number
  const hashValue = parseInt(hash.substring(0, 8), 16);
  
  // Normalize to 0-100 range
  const bucket = (hashValue / 0xffffffff) * 100;
  
  return bucket < percentage;
}

This approach has critical properties:

Deterministic: Same inputs always produce the same output
Uniform distribution: Users spread evenly across buckets
Toggle isolation: Including the toggle name means different toggles get different user distributions
Cross-service consistency: Any service with the same algorithm and inputs produces the same result

When you increase the rollout from 10% to 20%, the original 10% of users stay in—you’re only adding new users, never removing existing ones. This prevents the jarring experience of features disappearing.

Targeting Strategies Beyond Percentages

Percentage rollouts are blunt instruments. Often you want more precision: internal users first, then beta testers, then specific regions, then everyone else.

A rule engine handles this complexity:

interface TargetingRule {
  conditions: Condition[];
  operator: 'AND' | 'OR';
  result: boolean;
}

interface Condition {
  attribute: string;
  operator: 'equals' | 'contains' | 'in' | 'matches';
  value: string | string[];
}

function evaluateRules(
  rules: TargetingRule[],
  context: EvaluationContext
): boolean | null {
  for (const rule of rules) {
    const conditionResults = rule.conditions.map((condition) =>
      evaluateCondition(condition, context)
    );
    
    const ruleMatches =
      rule.operator === 'AND'
        ? conditionResults.every(Boolean)
        : conditionResults.some(Boolean);
    
    if (ruleMatches) {
      return rule.result;
    }
  }
  
  return null; // No rules matched, fall through to percentage
}

function evaluateCondition(
  condition: Condition,
  context: EvaluationContext
): boolean {
  const contextValue = getContextValue(context, condition.attribute);
  
  if (contextValue === undefined) {
    return false;
  }
  
  switch (condition.operator) {
    case 'equals':
      return contextValue === condition.value;
    case 'contains':
      return String(contextValue).includes(String(condition.value));
    case 'in':
      return Array.isArray(condition.value) && 
             condition.value.includes(String(contextValue));
    case 'matches':
      return new RegExp(String(condition.value)).test(String(contextValue));
    default:
      return false;
  }
}

A typical rollout strategy layers these rules:

Enable for users with email ending in @yourcompany.com (internal dogfooding)
Enable for users in the beta-testers group
Enable for users in the us-west-2 region at 10%
Enable for all other users at 0%

Rules evaluate in order, and the first match wins. This gives you precise control over rollout sequencing.

Monitoring and Rollback Mechanisms

A gradual rollout without monitoring is just a slow release. You need to know if the feature is causing problems before expanding exposure.

class MonitoredToggleClient {
  constructor(
    private toggleClient: FeatureToggleClient,
    private metrics: MetricsClient,
    private circuitBreaker: CircuitBreaker
  ) {}
  
  isEnabled(toggleName: string, context: EvaluationContext): boolean {
    // Check kill switch first
    if (this.circuitBreaker.isOpen(toggleName)) {
      this.metrics.increment(`toggle.${toggleName}.circuit_breaker_open`);
      return false;
    }
    
    const startTime = Date.now();
    const result = this.toggleClient.isEnabled(toggleName, context);
    const duration = Date.now() - startTime;
    
    // Emit metrics for every evaluation
    this.metrics.timing(`toggle.${toggleName}.evaluation_time`, duration);
    this.metrics.increment(`toggle.${toggleName}.evaluated`, {
      result: String(result),
    });
    
    return result;
  }
  
  async executeWithToggle<T>(
    toggleName: string,
    context: EvaluationContext,
    enabledPath: () => Promise<T>,
    disabledPath: () => Promise<T>
  ): Promise<T> {
    const enabled = this.isEnabled(toggleName, context);
    const path = enabled ? 'enabled' : 'disabled';
    
    try {
      const startTime = Date.now();
      const result = await (enabled ? enabledPath() : disabledPath());
      
      this.metrics.timing(`toggle.${toggleName}.${path}.duration`, Date.now() - startTime);
      this.metrics.increment(`toggle.${toggleName}.${path}.success`);
      
      return result;
    } catch (error) {
      this.metrics.increment(`toggle.${toggleName}.${path}.error`);
      
      // Trip circuit breaker on elevated error rates
      if (enabled) {
        this.circuitBreaker.recordFailure(toggleName);
      }
      
      throw error;
    }
  }
}

Track these metrics during rollout:

Error rates: Compare enabled vs. disabled cohorts
Latency: P50, P95, P99 for both paths
Business KPIs: Conversion rates, engagement metrics
Resource utilization: CPU, memory, database load

Set up automated alerts that trip the circuit breaker when error rates exceed thresholds. A 5% error rate increase should automatically disable the feature faster than any human can respond.

Managing Toggle Lifecycle and Technical Debt

Feature toggles are technical debt by design. You’re intentionally adding branching logic that should eventually be removed. The problem is “eventually” often means “never.”

Enforce hygiene with automation:

// toggle-lint.ts - Run in CI pipeline
interface ToggleMetadata {
  name: string;
  createdAt: Date;
  expectedRemovalDate: Date;
  owner: string;
  jiraTicket: string;
}

const MAX_TOGGLE_AGE_DAYS = 90;

async function lintToggles(toggles: ToggleMetadata[]): Promise<LintResult[]> {
  const results: LintResult[] = [];
  const now = new Date();
  
  for (const toggle of toggles) {
    const ageInDays = Math.floor(
      (now.getTime() - toggle.createdAt.getTime()) / (1000 * 60 * 60 * 24)
    );
    
    if (ageInDays > MAX_TOGGLE_AGE_DAYS) {
      results.push({
        toggle: toggle.name,
        severity: 'error',
        message: `Toggle is ${ageInDays} days old (max: ${MAX_TOGGLE_AGE_DAYS}). ` +
                 `Owner: ${toggle.owner}, Ticket: ${toggle.jiraTicket}`,
      });
    }
    
    if (now > toggle.expectedRemovalDate) {
      results.push({
        toggle: toggle.name,
        severity: 'warning',
        message: `Toggle is past its expected removal date. ` +
                 `Review and either remove or extend with justification.`,
      });
    }
  }
  
  return results;
}

Naming conventions also matter. Include the type and expected lifespan in the name:

release_new_checkout_2024q1 — Release toggle, expected removal Q1 2024
ops_disable_external_payments — Ops toggle, long-lived kill switch
exp_blue_button_conversion — Experiment toggle, remove after analysis

When removing toggles, don’t just delete the configuration. Search your codebase for all references, remove the branching logic, and keep only the “enabled” path. This is tedious but essential—orphaned toggle checks are confusing and sometimes cause subtle bugs.

Best Practices Checklist

Before implementing feature toggles for gradual rollout:

Use consistent hashing for percentage rollouts—never random assignment
Layer targeting rules for controlled exposure: internal → beta → regional → global
Instrument everything with metrics comparing enabled and disabled cohorts
Implement circuit breakers that automatically disable problematic features
Require metadata including owner, creation date, and expected removal date
Automate cleanup detection with linters that fail builds for stale toggles
Document the removal process so engineers know how to fully clean up

Feature toggles aren’t the only deployment strategy. Consider alternatives:

Blue-green deployments: When you need instant rollback of infrastructure changes
Canary releases: When the feature can’t be easily toggled at runtime
Branch by abstraction: For long-running refactors where toggle complexity would be excessive

Use feature toggles when you need user-level granularity, when the feature can be toggled without deployment, and when you want to decouple release timing from deployment timing. For everything else, simpler strategies often suffice.