Feature Flags in Production: Gradual Rollouts, A/B Testing, and Kill Switches

Every major tech company ships features to a subset of users before rolling them out to everyone. Facebook rolls out changes to 1% of traffic first. Stripe tests payment flow changes with 5% of merchants before general availability. GitHub ships dark mode to beta users months before the official launch.
The mechanism behind all of this: feature flags. A feature flag is a conditional in your code that controls whether a feature is active — evaluated at runtime, configurable without deploying new code.
In 2026, feature flags have expanded beyond simple on/off toggles into a platform for gradual rollouts, targeted experiments, operational kill switches, and progressive delivery. This guide covers how to implement them correctly, which tools to use, and the patterns that make them powerful.
The Problem: Deploy and Pray
The traditional deploy model is binary: code ships to all users simultaneously. This creates several failure modes:
Big-bang releases: The "release day" model where a feature that took 3 months to build ships to 100% of users at once. When something breaks, you roll back the entire deployment — including unrelated changes.
Long-lived feature branches: Teams isolate features in branches to avoid shipping half-finished work. Branches diverge from main for weeks. Merging becomes painful. Integration issues surface late.
No experimentation infrastructure: Measuring whether a change actually improves user behavior requires A/B testing infrastructure most teams don't have.
Feature flags solve all three: features are deployed (to 0% of users) long before release, mainline development continues without branches, and experiments can be run with proper statistical controls.
graph LR
subgraph "Traditional deploy"
A[Code merged] --> B[All users]
B --> C{Bug?}
C -- Yes --> D[Rollback entire deploy]
end
subgraph "Feature flags"
E[Code deployed] --> F[0% rollout]
F --> G[1% → internal team]
G --> H[5% → beta users]
H --> I[25% → gradual]
I --> J[100% → complete]
J --> K{Bug?}
K -- Yes --> L[Toggle flag off in 1s]
end
style D fill:#ef4444,color:#fff
style L fill:#22c55e,color:#fff
How It Works: Anatomy of a Feature Flag
A feature flag evaluation has three parts:
1. Flag definition: Name, type (boolean, string, number), default value, targeting rules
2. Context: Information about the current request — user ID, company, region, plan tier
3. Evaluation: Rules evaluated against context → returns variant
# Simplified flag evaluation logic
def evaluate_flag(flag_name: str, context: dict) -> bool | str:
flag = get_flag_definition(flag_name)
# Check targeting rules in order
for rule in flag.targeting_rules:
if rule.matches(context):
return rule.variant # Return the matched variant
# No rules matched — return default
return flag.default_variant
The evaluation happens in milliseconds, in-process (the SDK has a local copy of flag definitions cached from the flag service). The flag service doesn't sit in the critical path of every request.
Flag Types
| Type | Use Case | Example |
|------|----------|---------|
| Boolean | Feature on/off | new_checkout_flow: true/false |
| String | A/B variants | homepage_hero: "control"/"variant_a"/"variant_b" |
| Number | Gradual rollout % | ai_summary_rollout: 0.25 (25% of users) |
| JSON | Complex configuration | rate_limits: {"free": 100, "pro": 1000} |
Implementation: OpenFeature Standard
OpenFeature is a CNCF standard that decouples your feature flag code from the specific vendor. You write against the OpenFeature SDK; you swap providers without changing application code.
# pip install openfeature-sdk openfeature-provider-launchdarkly
from openfeature import api
from openfeature.evaluation_context import EvaluationContext
from openfeature.provider.launchdarkly import LaunchDarklyProvider # or any other provider
# Initialize with your provider (done once at startup)
api.set_provider(LaunchDarklyProvider(sdk_key="sdk-your-key-here"))
client = api.get_client()
# In your request handler
def checkout_handler(request):
# Build context from request
ctx = EvaluationContext(
targeting_key=str(request.user.id),
attributes={
"plan": request.user.plan, # "free" / "pro" / "enterprise"
"region": request.headers.get("CF-IPCountry", "US"),
"email": request.user.email, # For beta cohorts
"company_id": str(request.user.company_id),
"user_age_days": (datetime.now() - request.user.created_at).days,
}
)
# Boolean flag — is the new checkout enabled for this user?
use_new_checkout = client.get_boolean_value(
"new-checkout-flow",
default_value=False,
evaluation_context=ctx,
)
# String flag — which pricing experiment variant?
pricing_variant = client.get_string_value(
"pricing-page-experiment",
default_value="control",
evaluation_context=ctx,
)
if use_new_checkout:
return new_checkout_view(request, pricing_variant)
else:
return legacy_checkout_view(request)
Gradual Rollout with LaunchDarkly
LaunchDarkly is the market leader in 2026. Configuration is in their dashboard, but also exportable as JSON:
{
"key": "new-checkout-flow",
"kind": "boolean",
"variations": [false, true],
"rules": [
{
"description": "Internal team always sees new checkout",
"clauses": [{"attribute": "email", "op": "endsWith", "values": ["@mycompany.com"]}],
"variation": 1
},
{
"description": "Enterprise customers excluded (revenue risk)",
"clauses": [{"attribute": "plan", "op": "in", "values": ["enterprise"]}],
"variation": 0
}
],
"fallthrough": {
"rollout": {
"variations": [
{"variation": 0, "weight": 80000},
{"variation": 1, "weight": 20000}
]
}
},
"offVariation": 0,
"on": true
}
This flag configuration: always shows new checkout to @mycompany.com users, never shows it to enterprise, rolls it out to 20% of everyone else.
Self-Hosted: Unleash
For teams that can't send user data to a SaaS vendor (GDPR, security policies), Unleash is the best open-source alternative:
# docker-compose.yml for Unleash
version: '3'
services:
unleash:
image: unleashorg/unleash-server:latest
ports:
- "4242:4242"
environment:
DATABASE_URL: postgres://unleash:password@db/unleash
UNLEASH_DEFAULT_ADMIN_USERNAME: admin
UNLEASH_DEFAULT_ADMIN_PASSWORD: changeme
depends_on:
- db
db:
image: postgres:16
environment:
POSTGRES_DB: unleash
POSTGRES_USER: unleash
POSTGRES_PASSWORD: password
# Python SDK for Unleash
from UnleashClient import UnleashClient
client = UnleashClient(
url="http://localhost:4242/api",
app_name="my-app",
custom_headers={"Authorization": "Bearer your-api-key"}
)
client.initialize_client()
# Evaluate with context
context = {
"userId": str(user.id),
"properties": {
"plan": user.plan,
"region": user.region,
}
}
if client.is_enabled("new-checkout-flow", context):
return new_checkout()
else:
return legacy_checkout()
Four Patterns That Make Feature Flags Powerful
Pattern 1: The Kill Switch
The most operationally valuable flag type. A kill switch is a boolean flag that's ON in production — until something breaks. Then you turn it OFF in 5 seconds without a deploy.
# Kill switch for a new payment processor
@app.route('/api/payments', methods=['POST'])
def process_payment():
if not client.get_boolean_value("new-payment-processor", default_value=False, ctx=ctx):
# Old processor
return legacy_payment_processor.charge(request.json)
try:
return new_payment_processor.charge(request.json)
except NewProcessorException as e:
# Automatic fallback + alert
alert_pagerduty(f"New payment processor failed: {e}")
return legacy_payment_processor.charge(request.json)
When the new processor has issues, ops turns off the flag. No deploy. No rollback. No 3am war room. Just a toggle.
Pattern 2: Ring Deployment
Deploy to progressively larger rings of users, validating metrics at each stage:
flowchart LR
A["Ring 0\nInternal (0.1%)"] --> B["Ring 1\nBeta users (1%)"]
B --> C["Ring 2\nFree tier (10%)"]
C --> D["Ring 3\nPro tier (50%)"]
D --> E["Ring 4\nAll users (100%)"]
A -.->|"Monitor:\nerror rate\nlatency\nbusiness metrics"| A
B -.->|"Monitor 24hrs"| B
C -.->|"Monitor 48hrs"| C
D -.->|"Monitor 72hrs"| D
style A fill:#3b82f6,color:#fff
style E fill:#22c55e,color:#fff
# Ring deployment configuration
rings = [
{"name": "internal", "targeting": {"email": {"endsWith": "@mycompany.com"}}, "weight": 100},
{"name": "beta", "targeting": {"properties.beta_user": True}, "weight": 100},
{"name": "free_10_percent", "targeting": {"plan": "free"}, "weight": 10},
{"name": "pro_50_percent", "targeting": {"plan": "pro"}, "weight": 50},
{"name": "all_users", "targeting": None, "weight": 100},
]
# Move to next ring after validating metrics
def advance_ring(flag_name: str, current_ring: int) -> bool:
metrics = get_feature_metrics(flag_name, hours=24)
if metrics.error_rate_increase > 0.01: # 1% error rate increase
alert(f"Flag {flag_name}: error rate elevated, holding at ring {current_ring}")
return False
if metrics.p99_latency_increase_ms > 50: # 50ms p99 latency increase
alert(f"Flag {flag_name}: latency elevated, holding at ring {current_ring}")
return False
return True # Safe to advance
Pattern 3: Experiment Flags with Statistical Significance
Feature flags become A/B testing infrastructure when you add metric tracking and significance testing:
import scipy.stats as stats
import numpy as np
def evaluate_experiment(flag_key: str, metric_name: str, min_sample: int = 1000) -> dict:
"""
Check if an experiment has reached statistical significance.
Returns: variant recommendation and confidence level.
"""
control_data = get_metric_data(flag_key, variant="control", metric=metric_name)
treatment_data = get_metric_data(flag_key, variant="treatment", metric=metric_name)
if min(len(control_data), len(treatment_data)) < min_sample:
return {"status": "insufficient_data", "samples": len(control_data) + len(treatment_data)}
# Two-sample t-test for continuous metrics (e.g., conversion rate, revenue)
t_stat, p_value = stats.ttest_ind(control_data, treatment_data)
control_mean = np.mean(control_data)
treatment_mean = np.mean(treatment_data)
lift = (treatment_mean - control_mean) / control_mean * 100
return {
"status": "significant" if p_value < 0.05 else "not_significant",
"p_value": round(p_value, 4),
"lift_percent": round(lift, 2),
"control_mean": round(control_mean, 4),
"treatment_mean": round(treatment_mean, 4),
"recommendation": "ship" if (p_value < 0.05 and lift > 0) else "rollback" if (p_value < 0.05 and lift < 0) else "continue",
"samples": len(control_data) + len(treatment_data),
}
# Usage:
result = evaluate_experiment("checkout-redesign", "conversion_rate")
# → {"status": "significant", "lift_percent": 3.4, "p_value": 0.012, "recommendation": "ship"}
Pattern 4: Operational Configuration Flags
Flags aren't just for features. Use them for runtime configuration that operations may need to adjust under load:
# Rate limit configuration that ops can adjust without a deploy
rate_config = client.get_object_value(
"api-rate-limits",
default_value={"free": 100, "pro": 1000, "enterprise": 10000},
evaluation_context=ctx,
)
# Under attack, ops sets: {"free": 10, "pro": 100, "enterprise": 1000}
# 10× reduction across the board, in 30 seconds, without a deploy
if request.rate_count > rate_config[user.plan]:
return Response(status=429, headers={"Retry-After": "60"})
Cost and Latency: What Flag Evaluation Actually Costs
The operational concern teams often raise: "Won't feature flag evaluation add latency?" The answer, when implemented correctly: no.
Modern flag SDKs use a streaming architecture. On startup, the SDK downloads all flag definitions and stores them in memory. Flag evaluation happens entirely in-process — no network call, no database lookup. The SDK subscribes to a server-sent event stream and updates its local cache when flags change.
Evaluation time: sub-millisecond. Typically 50-200 microseconds, including context evaluation and rule matching.
The only performance concern is the initial SDK initialization (100-500ms to download and cache all flags). Don't evaluate flags before initialization completes — use the async initialization pattern with defaults.
# Benchmark flag evaluation latency
import time
import statistics
latencies = []
for _ in range(10000):
start = time.perf_counter()
client.get_boolean_value("new-feature", default_value=False, evaluation_context=ctx)
latencies.append((time.perf_counter() - start) * 1000)
print(f"p50: {statistics.median(latencies):.3f}ms") # → 0.041ms
print(f"p99: {statistics.quantiles(latencies, n=100)[98]:.3f}ms") # → 0.128ms
The LaunchDarkly and Unleash SDKs both benchmark at under 0.2ms p99 for flag evaluation. For 99.9% of applications, feature flag evaluation is not in your performance budget.
Managing Technical Debt: Flag Lifecycle
The danger of feature flags is accumulating hundreds of stale flags in your codebase. Each flag is a branch in your logic — too many, and the code becomes impossible to reason about.
# Flag with built-in expiry tracking
@flag_lifecycle(
flag_key="new-checkout-flow",
expected_ship_date="2026-06-01",
owner="team-checkout",
jira_ticket="ENG-4521"
)
def checkout_handler(request):
if client.get_boolean_value("new-checkout-flow", default_value=False, ctx=ctx):
...
Enforce flag retirement:
1. Set a ticket at flag creation: Create the cleanup ticket before the flag goes live
2. Alert on old flags: Monitor for flags > 90 days old that haven't been cleaned up
3. Regular flag reviews: Quarterly audit of all flags — is each one still needed?
-- Query to find stale flags (LaunchDarkly stores flag metadata) SELECT flag_key, created_date, last_modified, owner FROM feature_flags WHERE last_modified < NOW() - INTERVAL '90 days' AND is_permanent = false ORDER BY last_modified ASC;
Server-Side vs Client-Side Flags
Feature flags can be evaluated in two places:
Server-side flags: Evaluated in your backend. The client never sees the flag state — it only receives the feature or doesn't. No flag state exposed in client-side JavaScript. Good for: security-sensitive features, anything involving backend logic, pricing experiments.
Client-side flags: Evaluated in the browser or mobile app. The SDK downloads flag definitions and evaluates them locally. Enables UI personalization without a server round trip. Risk: flag rules are visible in client-side JavaScript — don't use for features you want to hide from users who inspect network traffic.
// Client-side SDK (LaunchDarkly Browser SDK)
import { LDClient, initialize } from 'launchdarkly-js-client-sdk';
const user = {
kind: 'user',
key: currentUser.id,
plan: currentUser.plan,
email: currentUser.email,
};
const client: LDClient = initialize('client-side-sdk-key', user);
await client.waitForInitialization();
// Evaluate a flag — happens locally, no server call
const showNewNav = client.variation('new-navigation', false);
if (showNewNav) {
renderNewNavigation();
}
// React hook pattern
import { useLDClient, useFlags } from 'launchdarkly-react-client-sdk';
function Navigation() {
const { 'new-navigation': showNewNav } = useFlags();
return showNewNav ? <NewNavigation /> : <LegacyNavigation />;
}
For the backend equivalent:
# Server-side: flag evaluated in Python, result passed to template
def homepage_view(request):
ctx = EvaluationContext(
targeting_key=str(request.user.id),
attributes={"plan": request.user.plan}
)
show_new_nav = client.get_boolean_value("new-navigation", default_value=False, evaluation_context=ctx)
return render(request, "homepage.html", {
"show_new_nav": show_new_nav,
# Flag state is in template context, not exposed as JS to client
})
The hybrid pattern: Use server-side flags for feature gating; use client-side for UI personalization where the latency of a server round-trip would be noticeable (navigation, layout).
Flag Targeting: Beyond Percentage Rollouts
Percentage rollouts are the most common targeting strategy, but several others are more appropriate in specific situations:
# Targeting strategies and when to use them
# 1. User segment: specific users by attribute
# Use for: beta cohorts, VIP customers, internal team
flag_config = {
"rules": [
{"clauses": [{"attribute": "email", "op": "endsWith", "values": ["@mycompany.com"]}], "variation": 1},
{"clauses": [{"attribute": "beta_opt_in", "op": "in", "values": [True]}], "variation": 1},
]
}
# 2. Sticky bucketing: same user always gets same variant
# Default behavior in most SDKs — user hash is consistent
# Critical for A/B tests: users shouldn't switch groups mid-experiment
# 3. Time-based: flag automatically turns off after a date
# Use for: temporary maintenance banners, holiday promotions
from datetime import datetime
def time_gated_flag(flag_name: str, end_date: datetime) -> bool:
if datetime.now() > end_date:
return False
return client.get_boolean_value(flag_name, default_value=False, evaluation_context=ctx)
# 4. Dependency: flag only active if another flag is active
# Use for: progressive feature builds
def dependent_flag(parent_flag: str, child_flag: str) -> bool:
if not client.get_boolean_value(parent_flag, default_value=False, evaluation_context=ctx):
return False
return client.get_boolean_value(child_flag, default_value=False, evaluation_context=ctx)
# 5. Context-based: target by request properties
# Use for: region-specific features, mobile vs web
ctx_with_request = EvaluationContext(
targeting_key=str(user.id),
attributes={
"region": request.headers.get("CF-IPCountry", "US"),
"platform": request.headers.get("X-Platform", "web"), # "ios", "android", "web"
"app_version": request.headers.get("X-App-Version", "0.0.0"),
}
)
Production Considerations
SDK Initialization and Fallbacks
The flag SDK must not block application startup or add request latency. Initialize asynchronously, provide defaults, cache aggressively:
# Async initialization — don't block startup
async def startup():
await flag_client.initialize() # Fetches flags from service, caches locally
# During startup failure — default_value is your safety net
# Never let flag evaluation throw exceptions into your business logic
try:
enabled = client.get_boolean_value("new-feature", default_value=False, ctx=ctx)
except Exception as e:
log.error(f"Flag evaluation failed: {e}")
enabled = False # Fail safe
SDKs Are Local — Flag Changes Are Near-Instant
Modern flag SDKs stream flag updates via server-sent events. Changes in the LaunchDarkly or Unleash dashboard propagate to all running SDK instances in 1-2 seconds. You don't need a deploy, a restart, or an API call — the change propagates automatically.
This is what makes kill switches operationally effective: you flip the flag, and within seconds, traffic shifts.
Feature Flags in CI/CD: Testing Against Real Flag States
Testing with feature flags requires understanding which variant your tests should run against. Two strategies:
Test both variants: Run your integration test suite against each variant. This ensures neither path regresses during a rollout.
# pytest parameterization over flag variants
import pytest
from unittest.mock import patch
@pytest.fixture(params=["control", "treatment"])
def checkout_variant(request):
"""Run each test against both checkout variants."""
with patch.object(flag_client, 'get_string_value', return_value=request.param):
yield request.param
def test_checkout_completes(client, checkout_variant):
"""Both variants should complete checkout successfully."""
response = client.post('/api/checkout', json={"items": [{"id": 1, "qty": 1}]})
assert response.status_code == 200
assert response.json()["order_id"] is not None
# Test passes regardless of which variant — both paths must work
Flag overrides in staging: Set specific users to specific variants in staging environments for deterministic testing.
# LaunchDarkly: staging environment flag overrides
# Individual user targeting (by ID) overrides all rules
user_targets:
- variation: 1 # Treatment variant
values:
- user_id_of_test_account_1
- user_id_of_qa_bot
- user_id_of_automated_test_user
This ensures your QA environment always sees the new variant, while developers can test the control variant by using their personal accounts.
The Flag-Deployment Dependency
One subtle CI/CD consideration: the flag must exist in the flag service before the code that references it is deployed. If you deploy code that calls client.get_boolean_value("new-feature", ...) before creating the flag in LaunchDarkly/Unleash, the SDK returns the default value. That's fine if your default value is the safe path (default_value=False for disabled features).
The workflow:
1. Create flag in flag service (off by default, 0% rollout)
2. Deploy code (all users get default value = old behavior)
3. Enable flag for internal team (0.1% → validate)
4. Gradual rollout (1% → 10% → 50% → 100%)
5. Clean up flag (remove from code + delete from flag service)
Never deploy code that requires a flag to be already enabled on deploy. Always code for the default-value path to be safe.
Conclusion
Feature flags are one of the highest-leverage tools in modern software delivery. They separate deployment from release, enable safe experimentation, and give operations teams the ability to respond to incidents in seconds rather than minutes.
The key practices:
- OpenFeature standard decouples your code from vendor lock-in
- Ring deployments validate changes incrementally before full rollout
- Kill switches are the most operationally valuable flag type — add them proactively
- Statistical significance testing turns experiments into data, not opinions
- Flag lifecycle management prevents the codebase from becoming an unmaintainable branch forest
Start simple: add a kill switch to your next risky feature. Measure the reduction in rollback frequency and incident duration. The ROI makes itself obvious quickly.
The broader shift feature flags enable is cultural: deployment becomes routine rather than an event. When you can deploy code to production with zero users seeing it, and gradually roll it out with metrics validation at each step, the fear of shipping disappears. Teams ship more often, in smaller increments, with more confidence. That's the real value of the pattern — not just the kill switch, but the deployment culture it enables.
Sources
- OpenFeature CNCF project documentation
- LaunchDarkly engineering blog: Feature flag best practices
- Unleash self-hosted documentation
- Experimentation at scale: Airbnb, Stripe, Facebook engineering blogs
- Continuous Delivery (Jez Humble, David Farley) — chapter on dark launching
Enjoyed this post? Follow AmtocSoft for AI tutorials from beginner to professional.
☕ Buy Me a Coffee | 🔔 YouTube | 💼 LinkedIn | 🐦 X/Twitter
Comments
Post a Comment