Feature Toggles: Gradual Feature Rollout
Big-bang releases are a gamble. You write code for weeks, merge it all at once, and hope nothing breaks. When something does break—and it will—you're debugging under pressure while your entire user...
Key Insights
- Feature toggles enable gradual rollouts by decoupling deployment from release, letting you ship code to production without exposing it to all users simultaneously.
- Consistent hashing with user identifiers creates deterministic, sticky assignments that ensure users see the same feature state across sessions and services.
- Every toggle incurs technical debt—implement sunset policies and automated detection from day one to prevent your codebase from becoming a graveyard of dead branches.
The Case for Gradual Rollouts
Big-bang releases are a gamble. You write code for weeks, merge it all at once, and hope nothing breaks. When something does break—and it will—you’re debugging under pressure while your entire user base experiences the problem.
Gradual rollouts flip this model. Instead of exposing new features to 100% of users immediately, you start with 1%, monitor for issues, then expand to 5%, 25%, and eventually everyone. If something goes wrong at 5%, you’ve impacted a fraction of your users and can roll back instantly.
Feature toggles (also called feature flags) are the mechanism that makes this possible. They come in several flavors:
- Release toggles: Control feature visibility during rollout, removed after full deployment
- Ops toggles: Kill switches for operational concerns, often long-lived
- Experiment toggles: A/B testing variations, removed after experiment concludes
- Permission toggles: Entitlement-based access, often permanent
This article focuses on release toggles for gradual rollouts, but the underlying infrastructure supports all types.
Anatomy of a Feature Toggle System
A feature toggle system has three core components: configuration storage, evaluation context, and the evaluation engine itself.
interface ToggleConfig {
name: string;
enabled: boolean;
rolloutPercentage: number;
targetingRules: TargetingRule[];
createdAt: Date;
owner: string;
description: string;
}
interface EvaluationContext {
userId: string;
userEmail?: string;
userGroups?: string[];
region?: string;
deviceType?: string;
customAttributes?: Record<string, string>;
}
class FeatureToggleClient {
private config: Map<string, ToggleConfig>;
constructor(private configSource: ConfigSource) {
this.config = new Map();
this.configSource.subscribe((updates) => this.updateConfig(updates));
}
isEnabled(toggleName: string, context: EvaluationContext): boolean {
const toggle = this.config.get(toggleName);
if (!toggle) {
return false; // Default to off for unknown toggles
}
if (!toggle.enabled) {
return false;
}
// Check targeting rules first
if (toggle.targetingRules.length > 0) {
const ruleResult = this.evaluateRules(toggle.targetingRules, context);
if (ruleResult !== null) {
return ruleResult;
}
}
// Fall back to percentage rollout
return this.isInRolloutPercentage(toggleName, context.userId, toggle.rolloutPercentage);
}
}
The lifecycle matters as much as the implementation. Every toggle should have an owner, a creation date, and an expected removal date. Without this metadata, toggles accumulate like sediment, and six months later nobody knows if new_checkout_flow_v2 is safe to remove.
Implementing Percentage-Based Rollouts
The naive approach to percentage rollouts—generating a random number and checking if it’s under the threshold—creates a terrible user experience. Users see the feature on one page load, then it disappears on the next. They report bugs that your team can’t reproduce.
You need sticky assignment. The same user should always see the same toggle state for a given rollout percentage. Consistent hashing solves this:
import { createHash } from 'crypto';
function isInRolloutPercentage(
toggleName: string,
userId: string,
percentage: number
): boolean {
if (percentage === 0) return false;
if (percentage === 100) return true;
// Combine toggle name and user ID for unique bucketing per toggle
const hashInput = `${toggleName}:${userId}`;
const hash = createHash('sha256').update(hashInput).digest('hex');
// Take first 8 hex characters (32 bits) and convert to number
const hashValue = parseInt(hash.substring(0, 8), 16);
// Normalize to 0-100 range
const bucket = (hashValue / 0xffffffff) * 100;
return bucket < percentage;
}
This approach has critical properties:
- Deterministic: Same inputs always produce the same output
- Uniform distribution: Users spread evenly across buckets
- Toggle isolation: Including the toggle name means different toggles get different user distributions
- Cross-service consistency: Any service with the same algorithm and inputs produces the same result
When you increase the rollout from 10% to 20%, the original 10% of users stay in—you’re only adding new users, never removing existing ones. This prevents the jarring experience of features disappearing.
Targeting Strategies Beyond Percentages
Percentage rollouts are blunt instruments. Often you want more precision: internal users first, then beta testers, then specific regions, then everyone else.
A rule engine handles this complexity:
interface TargetingRule {
conditions: Condition[];
operator: 'AND' | 'OR';
result: boolean;
}
interface Condition {
attribute: string;
operator: 'equals' | 'contains' | 'in' | 'matches';
value: string | string[];
}
function evaluateRules(
rules: TargetingRule[],
context: EvaluationContext
): boolean | null {
for (const rule of rules) {
const conditionResults = rule.conditions.map((condition) =>
evaluateCondition(condition, context)
);
const ruleMatches =
rule.operator === 'AND'
? conditionResults.every(Boolean)
: conditionResults.some(Boolean);
if (ruleMatches) {
return rule.result;
}
}
return null; // No rules matched, fall through to percentage
}
function evaluateCondition(
condition: Condition,
context: EvaluationContext
): boolean {
const contextValue = getContextValue(context, condition.attribute);
if (contextValue === undefined) {
return false;
}
switch (condition.operator) {
case 'equals':
return contextValue === condition.value;
case 'contains':
return String(contextValue).includes(String(condition.value));
case 'in':
return Array.isArray(condition.value) &&
condition.value.includes(String(contextValue));
case 'matches':
return new RegExp(String(condition.value)).test(String(contextValue));
default:
return false;
}
}
A typical rollout strategy layers these rules:
- Enable for users with email ending in
@yourcompany.com(internal dogfooding) - Enable for users in the
beta-testersgroup - Enable for users in the
us-west-2region at 10% - Enable for all other users at 0%
Rules evaluate in order, and the first match wins. This gives you precise control over rollout sequencing.
Monitoring and Rollback Mechanisms
A gradual rollout without monitoring is just a slow release. You need to know if the feature is causing problems before expanding exposure.
class MonitoredToggleClient {
constructor(
private toggleClient: FeatureToggleClient,
private metrics: MetricsClient,
private circuitBreaker: CircuitBreaker
) {}
isEnabled(toggleName: string, context: EvaluationContext): boolean {
// Check kill switch first
if (this.circuitBreaker.isOpen(toggleName)) {
this.metrics.increment(`toggle.${toggleName}.circuit_breaker_open`);
return false;
}
const startTime = Date.now();
const result = this.toggleClient.isEnabled(toggleName, context);
const duration = Date.now() - startTime;
// Emit metrics for every evaluation
this.metrics.timing(`toggle.${toggleName}.evaluation_time`, duration);
this.metrics.increment(`toggle.${toggleName}.evaluated`, {
result: String(result),
});
return result;
}
async executeWithToggle<T>(
toggleName: string,
context: EvaluationContext,
enabledPath: () => Promise<T>,
disabledPath: () => Promise<T>
): Promise<T> {
const enabled = this.isEnabled(toggleName, context);
const path = enabled ? 'enabled' : 'disabled';
try {
const startTime = Date.now();
const result = await (enabled ? enabledPath() : disabledPath());
this.metrics.timing(`toggle.${toggleName}.${path}.duration`, Date.now() - startTime);
this.metrics.increment(`toggle.${toggleName}.${path}.success`);
return result;
} catch (error) {
this.metrics.increment(`toggle.${toggleName}.${path}.error`);
// Trip circuit breaker on elevated error rates
if (enabled) {
this.circuitBreaker.recordFailure(toggleName);
}
throw error;
}
}
}
Track these metrics during rollout:
- Error rates: Compare enabled vs. disabled cohorts
- Latency: P50, P95, P99 for both paths
- Business KPIs: Conversion rates, engagement metrics
- Resource utilization: CPU, memory, database load
Set up automated alerts that trip the circuit breaker when error rates exceed thresholds. A 5% error rate increase should automatically disable the feature faster than any human can respond.
Managing Toggle Lifecycle and Technical Debt
Feature toggles are technical debt by design. You’re intentionally adding branching logic that should eventually be removed. The problem is “eventually” often means “never.”
Enforce hygiene with automation:
// toggle-lint.ts - Run in CI pipeline
interface ToggleMetadata {
name: string;
createdAt: Date;
expectedRemovalDate: Date;
owner: string;
jiraTicket: string;
}
const MAX_TOGGLE_AGE_DAYS = 90;
async function lintToggles(toggles: ToggleMetadata[]): Promise<LintResult[]> {
const results: LintResult[] = [];
const now = new Date();
for (const toggle of toggles) {
const ageInDays = Math.floor(
(now.getTime() - toggle.createdAt.getTime()) / (1000 * 60 * 60 * 24)
);
if (ageInDays > MAX_TOGGLE_AGE_DAYS) {
results.push({
toggle: toggle.name,
severity: 'error',
message: `Toggle is ${ageInDays} days old (max: ${MAX_TOGGLE_AGE_DAYS}). ` +
`Owner: ${toggle.owner}, Ticket: ${toggle.jiraTicket}`,
});
}
if (now > toggle.expectedRemovalDate) {
results.push({
toggle: toggle.name,
severity: 'warning',
message: `Toggle is past its expected removal date. ` +
`Review and either remove or extend with justification.`,
});
}
}
return results;
}
Naming conventions also matter. Include the type and expected lifespan in the name:
release_new_checkout_2024q1— Release toggle, expected removal Q1 2024ops_disable_external_payments— Ops toggle, long-lived kill switchexp_blue_button_conversion— Experiment toggle, remove after analysis
When removing toggles, don’t just delete the configuration. Search your codebase for all references, remove the branching logic, and keep only the “enabled” path. This is tedious but essential—orphaned toggle checks are confusing and sometimes cause subtle bugs.
Best Practices Checklist
Before implementing feature toggles for gradual rollout:
- Use consistent hashing for percentage rollouts—never random assignment
- Layer targeting rules for controlled exposure: internal → beta → regional → global
- Instrument everything with metrics comparing enabled and disabled cohorts
- Implement circuit breakers that automatically disable problematic features
- Require metadata including owner, creation date, and expected removal date
- Automate cleanup detection with linters that fail builds for stale toggles
- Document the removal process so engineers know how to fully clean up
Feature toggles aren’t the only deployment strategy. Consider alternatives:
- Blue-green deployments: When you need instant rollback of infrastructure changes
- Canary releases: When the feature can’t be easily toggled at runtime
- Branch by abstraction: For long-running refactors where toggle complexity would be excessive
Use feature toggles when you need user-level granularity, when the feature can be toggled without deployment, and when you want to decouple release timing from deployment timing. For everything else, simpler strategies often suffice.