AmtocSoft Tech Insights: Feature Flags in Production: Gradual Rollouts, A/B Testing, and Kill Switches

Wednesday, April 15, 2026

Feature Flags in Production: Gradual Rollouts, A/B Testing, and Kill Switches

Hero: Feature flag rollout percentages increasing from 1% to 100% with metrics tracking

Every major tech company ships features to a subset of users before rolling them out to everyone. Facebook rolls out changes to 1% of traffic first. Stripe tests payment flow changes with 5% of merchants before general availability. GitHub ships dark mode to beta users months before the official launch.

The mechanism behind all of this: feature flags. A feature flag is a conditional in your code that controls whether a feature is active — evaluated at runtime, configurable without deploying new code.

In 2026, feature flags have expanded beyond simple on/off toggles into a platform for gradual rollouts, targeted experiments, operational kill switches, and progressive delivery. This guide covers how to implement them correctly, which tools to use, and the patterns that make them powerful.

The Problem: Deploy and Pray

The traditional deploy model is binary: code ships to all users simultaneously. This creates several failure modes:

Big-bang releases: The "release day" model where a feature that took 3 months to build ships to 100% of users at once. When something breaks, you roll back the entire deployment — including unrelated changes.

Long-lived feature branches: Teams isolate features in branches to avoid shipping half-finished work. Branches diverge from main for weeks. Merging becomes painful. Integration issues surface late.

No experimentation infrastructure: Measuring whether a change actually improves user behavior requires A/B testing infrastructure most teams don't have.

Feature flags solve all three: features are deployed (to 0% of users) long before release, mainline development continues without branches, and experiments can be run with proper statistical controls.

graph LR subgraph "Traditional deploy" A[Code merged] --> B[All users] B --> C{Bug?} C -- Yes --> D[Rollback entire deploy] end subgraph "Feature flags" E[Code deployed] --> F[0% rollout] F --> G[1% → internal team] G --> H[5% → beta users] H --> I[25% → gradual] I --> J[100% → complete] J --> K{Bug?} K -- Yes --> L[Toggle flag off in 1s] end style D fill:#ef4444,color:#fff style L fill:#22c55e,color:#fff

How It Works: Anatomy of a Feature Flag

A feature flag evaluation has three parts:

Flag definition: Name, type (boolean, string, number), default value, targeting rules
Context: Information about the current request — user ID, company, region, plan tier
Evaluation: Rules evaluated against context → returns variant

# Simplified flag evaluation logic
def evaluate_flag(flag_name: str, context: dict) -> bool | str:
    flag = get_flag_definition(flag_name)

    # Check targeting rules in order
    for rule in flag.targeting_rules:
        if rule.matches(context):
            return rule.variant  # Return the matched variant

    # No rules matched — return default
    return flag.default_variant

The evaluation happens in milliseconds, in-process (the SDK has a local copy of flag definitions cached from the flag service). The flag service doesn't sit in the critical path of every request.

Flag Types

Type	Use Case	Example
Boolean	Feature on/off	`new_checkout_flow: true/false`
String	A/B variants	`homepage_hero: "control"/"variant_a"/"variant_b"`
Number	Gradual rollout %	`ai_summary_rollout: 0.25` (25% of users)
JSON	Complex configuration	`rate_limits: {"free": 100, "pro": 1000}`

Implementation: OpenFeature Standard

OpenFeature is a CNCF standard that decouples your feature flag code from the specific vendor. You write against the OpenFeature SDK; you swap providers without changing application code.

# pip install openfeature-sdk openfeature-provider-launchdarkly
from openfeature import api
from openfeature.evaluation_context import EvaluationContext
from openfeature.provider.launchdarkly import LaunchDarklyProvider  # or any other provider

# Initialize with your provider (done once at startup)
api.set_provider(LaunchDarklyProvider(sdk_key="sdk-your-key-here"))

client = api.get_client()

# In your request handler
def checkout_handler(request):
    # Build context from request
    ctx = EvaluationContext(
        targeting_key=str(request.user.id),
        attributes={
            "plan": request.user.plan,          # "free" / "pro" / "enterprise"
            "region": request.headers.get("CF-IPCountry", "US"),
            "email": request.user.email,        # For beta cohorts
            "company_id": str(request.user.company_id),
            "user_age_days": (datetime.now() - request.user.created_at).days,
        }
    )

    # Boolean flag — is the new checkout enabled for this user?
    use_new_checkout = client.get_boolean_value(
        "new-checkout-flow",
        default_value=False,
        evaluation_context=ctx,
    )

    # String flag — which pricing experiment variant?
    pricing_variant = client.get_string_value(
        "pricing-page-experiment",
        default_value="control",
        evaluation_context=ctx,
    )

    if use_new_checkout:
        return new_checkout_view(request, pricing_variant)
    else:
        return legacy_checkout_view(request)

Gradual Rollout with LaunchDarkly

LaunchDarkly is the market leader in 2026. Configuration is in their dashboard, but also exportable as JSON:

{
  "key": "new-checkout-flow",
  "kind": "boolean",
  "variations": [false, true],
  "rules": [
    {
      "description": "Internal team always sees new checkout",
      "clauses": [{"attribute": "email", "op": "endsWith", "values": ["@mycompany.com"]}],
      "variation": 1
    },
    {
      "description": "Enterprise customers excluded (revenue risk)",
      "clauses": [{"attribute": "plan", "op": "in", "values": ["enterprise"]}],
      "variation": 0
    }
  ],
  "fallthrough": {
    "rollout": {
      "variations": [
        {"variation": 0, "weight": 80000},
        {"variation": 1, "weight": 20000}
      ]
    }
  },
  "offVariation": 0,
  "on": true
}

This flag configuration: always shows new checkout to @mycompany.com users, never shows it to enterprise, rolls it out to 20% of everyone else.

Self-Hosted: Unleash

For teams that can't send user data to a SaaS vendor (GDPR, security policies), Unleash is the best open-source alternative:

# docker-compose.yml for Unleash
version: '3'
services:
  unleash:
    image: unleashorg/unleash-server:latest
    ports:
      - "4242:4242"
    environment:
      DATABASE_URL: postgres://unleash:password@db/unleash
      UNLEASH_DEFAULT_ADMIN_USERNAME: admin
      UNLEASH_DEFAULT_ADMIN_PASSWORD: changeme
    depends_on:
      - db
  db:
    image: postgres:16
    environment:
      POSTGRES_DB: unleash
      POSTGRES_USER: unleash
      POSTGRES_PASSWORD: password

# Python SDK for Unleash
from UnleashClient import UnleashClient

client = UnleashClient(
    url="http://localhost:4242/api",
    app_name="my-app",
    custom_headers={"Authorization": "Bearer your-api-key"}
)
client.initialize_client()

# Evaluate with context
context = {
    "userId": str(user.id),
    "properties": {
        "plan": user.plan,
        "region": user.region,
    }
}

if client.is_enabled("new-checkout-flow", context):
    return new_checkout()
else:
    return legacy_checkout()

Four Patterns That Make Feature Flags Powerful

Pattern 1: The Kill Switch

The most operationally valuable flag type. A kill switch is a boolean flag that's ON in production — until something breaks. Then you turn it OFF in 5 seconds without a deploy.

# Kill switch for a new payment processor
@app.route('/api/payments', methods=['POST'])
def process_payment():
    if not client.get_boolean_value("new-payment-processor", default_value=False, ctx=ctx):
        # Old processor
        return legacy_payment_processor.charge(request.json)

    try:
        return new_payment_processor.charge(request.json)
    except NewProcessorException as e:
        # Automatic fallback + alert
        alert_pagerduty(f"New payment processor failed: {e}")
        return legacy_payment_processor.charge(request.json)

When the new processor has issues, ops turns off the flag. No deploy. No rollback. No 3am war room. Just a toggle.

Pattern 2: Ring Deployment

Deploy to progressively larger rings of users, validating metrics at each stage:

flowchart LR A["Ring 0\nInternal (0.1%)"] --> B["Ring 1\nBeta users (1%)"] B --> C["Ring 2\nFree tier (10%)"] C --> D["Ring 3\nPro tier (50%)"] D --> E["Ring 4\nAll users (100%)"] A -.->|"Monitor:\nerror rate\nlatency\nbusiness metrics"| A B -.->|"Monitor 24hrs"| B C -.->|"Monitor 48hrs"| C D -.->|"Monitor 72hrs"| D style A fill:#3b82f6,color:#fff style E fill:#22c55e,color:#fff

# Ring deployment configuration
rings = [
    {"name": "internal", "targeting": {"email": {"endsWith": "@mycompany.com"}}, "weight": 100},
    {"name": "beta", "targeting": {"properties.beta_user": True}, "weight": 100},
    {"name": "free_10_percent", "targeting": {"plan": "free"}, "weight": 10},
    {"name": "pro_50_percent", "targeting": {"plan": "pro"}, "weight": 50},
    {"name": "all_users", "targeting": None, "weight": 100},
]

# Move to next ring after validating metrics
def advance_ring(flag_name: str, current_ring: int) -> bool:
    metrics = get_feature_metrics(flag_name, hours=24)

    if metrics.error_rate_increase > 0.01:  # 1% error rate increase
        alert(f"Flag {flag_name}: error rate elevated, holding at ring {current_ring}")
        return False

    if metrics.p99_latency_increase_ms > 50:  # 50ms p99 latency increase
        alert(f"Flag {flag_name}: latency elevated, holding at ring {current_ring}")
        return False

    return True  # Safe to advance

Pattern 3: Experiment Flags with Statistical Significance

Feature flags become A/B testing infrastructure when you add metric tracking and significance testing:

import scipy.stats as stats
import numpy as np

def evaluate_experiment(flag_key: str, metric_name: str, min_sample: int = 1000) -> dict:
    """
    Check if an experiment has reached statistical significance.
    Returns: variant recommendation and confidence level.
    """
    control_data = get_metric_data(flag_key, variant="control", metric=metric_name)
    treatment_data = get_metric_data(flag_key, variant="treatment", metric=metric_name)

    if min(len(control_data), len(treatment_data)) < min_sample:
        return {"status": "insufficient_data", "samples": len(control_data) + len(treatment_data)}

    # Two-sample t-test for continuous metrics (e.g., conversion rate, revenue)
    t_stat, p_value = stats.ttest_ind(control_data, treatment_data)

    control_mean = np.mean(control_data)
    treatment_mean = np.mean(treatment_data)
    lift = (treatment_mean - control_mean) / control_mean * 100

    return {
        "status": "significant" if p_value < 0.05 else "not_significant",
        "p_value": round(p_value, 4),
        "lift_percent": round(lift, 2),
        "control_mean": round(control_mean, 4),
        "treatment_mean": round(treatment_mean, 4),
        "recommendation": "ship" if (p_value < 0.05 and lift > 0) else "rollback" if (p_value < 0.05 and lift < 0) else "continue",
        "samples": len(control_data) + len(treatment_data),
    }

# Usage:
result = evaluate_experiment("checkout-redesign", "conversion_rate")
# → {"status": "significant", "lift_percent": 3.4, "p_value": 0.012, "recommendation": "ship"}

Pattern 4: Operational Configuration Flags

Flags aren't just for features. Use them for runtime configuration that operations may need to adjust under load:

# Rate limit configuration that ops can adjust without a deploy
rate_config = client.get_object_value(
    "api-rate-limits",
    default_value={"free": 100, "pro": 1000, "enterprise": 10000},
    evaluation_context=ctx,
)

# Under attack, ops sets: {"free": 10, "pro": 100, "enterprise": 1000}
# 10× reduction across the board, in 30 seconds, without a deploy

if request.rate_count > rate_config[user.plan]:
    return Response(status=429, headers={"Retry-After": "60"})

Cost and Latency: What Flag Evaluation Actually Costs

The operational concern teams often raise: "Won't feature flag evaluation add latency?" The answer, when implemented correctly: no.

Modern flag SDKs use a streaming architecture. On startup, the SDK downloads all flag definitions and stores them in memory. Flag evaluation happens entirely in-process — no network call, no database lookup. The SDK subscribes to a server-sent event stream and updates its local cache when flags change.

Evaluation time: sub-millisecond. Typically 50-200 microseconds, including context evaluation and rule matching.

The only performance concern is the initial SDK initialization (100-500ms to download and cache all flags). Don't evaluate flags before initialization completes — use the async initialization pattern with defaults.

# Benchmark flag evaluation latency
import time
import statistics

latencies = []
for _ in range(10000):
    start = time.perf_counter()
    client.get_boolean_value("new-feature", default_value=False, evaluation_context=ctx)
    latencies.append((time.perf_counter() - start) * 1000)

print(f"p50: {statistics.median(latencies):.3f}ms")  # → 0.041ms
print(f"p99: {statistics.quantiles(latencies, n=100)[98]:.3f}ms")  # → 0.128ms

The LaunchDarkly and Unleash SDKs both benchmark at under 0.2ms p99 for flag evaluation. For 99.9% of applications, feature flag evaluation is not in your performance budget.

Managing Technical Debt: Flag Lifecycle

The danger of feature flags is accumulating hundreds of stale flags in your codebase. Each flag is a branch in your logic — too many, and the code becomes impossible to reason about.

# Flag with built-in expiry tracking
@flag_lifecycle(
    flag_key="new-checkout-flow",
    expected_ship_date="2026-06-01",
    owner="team-checkout",
    jira_ticket="ENG-4521"
)
def checkout_handler(request):
    if client.get_boolean_value("new-checkout-flow", default_value=False, ctx=ctx):
        ...

Enforce flag retirement:
1. Set a ticket at flag creation: Create the cleanup ticket before the flag goes live
2. Alert on old flags: Monitor for flags > 90 days old that haven't been cleaned up
3. Regular flag reviews: Quarterly audit of all flags — is each one still needed?

-- Query to find stale flags (LaunchDarkly stores flag metadata)
SELECT flag_key, created_date, last_modified, owner
FROM feature_flags
WHERE last_modified < NOW() - INTERVAL '90 days'
  AND is_permanent = false
ORDER BY last_modified ASC;

Server-Side vs Client-Side Flags

Feature flags can be evaluated in two places:

Server-side flags: Evaluated in your backend. The client never sees the flag state — it only receives the feature or doesn't. No flag state exposed in client-side JavaScript. Good for: security-sensitive features, anything involving backend logic, pricing experiments.

Client-side flags: Evaluated in the browser or mobile app. The SDK downloads flag definitions and evaluates them locally. Enables UI personalization without a server round trip. Risk: flag rules are visible in client-side JavaScript — don't use for features you want to hide from users who inspect network traffic.

// Client-side SDK (LaunchDarkly Browser SDK)
import { LDClient, initialize } from 'launchdarkly-js-client-sdk';

const user = {
  kind: 'user',
  key: currentUser.id,
  plan: currentUser.plan,
  email: currentUser.email,
};

const client: LDClient = initialize('client-side-sdk-key', user);

await client.waitForInitialization();

// Evaluate a flag — happens locally, no server call
const showNewNav = client.variation('new-navigation', false);
if (showNewNav) {
  renderNewNavigation();
}

// React hook pattern
import { useLDClient, useFlags } from 'launchdarkly-react-client-sdk';

function Navigation() {
  const { 'new-navigation': showNewNav } = useFlags();
  return showNewNav ? <NewNavigation /> : <LegacyNavigation />;
}

For the backend equivalent:

# Server-side: flag evaluated in Python, result passed to template
def homepage_view(request):
    ctx = EvaluationContext(
        targeting_key=str(request.user.id),
        attributes={"plan": request.user.plan}
    )
    show_new_nav = client.get_boolean_value("new-navigation", default_value=False, evaluation_context=ctx)

    return render(request, "homepage.html", {
        "show_new_nav": show_new_nav,
        # Flag state is in template context, not exposed as JS to client
    })

The hybrid pattern: Use server-side flags for feature gating; use client-side for UI personalization where the latency of a server round-trip would be noticeable (navigation, layout).

Flag Targeting: Beyond Percentage Rollouts

Percentage rollouts are the most common targeting strategy, but several others are more appropriate in specific situations:

# Targeting strategies and when to use them

# 1. User segment: specific users by attribute
#    Use for: beta cohorts, VIP customers, internal team
flag_config = {
    "rules": [
        {"clauses": [{"attribute": "email", "op": "endsWith", "values": ["@mycompany.com"]}], "variation": 1},
        {"clauses": [{"attribute": "beta_opt_in", "op": "in", "values": [True]}], "variation": 1},
    ]
}

# 2. Sticky bucketing: same user always gets same variant
#    Default behavior in most SDKs — user hash is consistent
#    Critical for A/B tests: users shouldn't switch groups mid-experiment

# 3. Time-based: flag automatically turns off after a date
#    Use for: temporary maintenance banners, holiday promotions
from datetime import datetime

def time_gated_flag(flag_name: str, end_date: datetime) -> bool:
    if datetime.now() > end_date:
        return False
    return client.get_boolean_value(flag_name, default_value=False, evaluation_context=ctx)

# 4. Dependency: flag only active if another flag is active
#    Use for: progressive feature builds
def dependent_flag(parent_flag: str, child_flag: str) -> bool:
    if not client.get_boolean_value(parent_flag, default_value=False, evaluation_context=ctx):
        return False
    return client.get_boolean_value(child_flag, default_value=False, evaluation_context=ctx)

# 5. Context-based: target by request properties
#    Use for: region-specific features, mobile vs web
ctx_with_request = EvaluationContext(
    targeting_key=str(user.id),
    attributes={
        "region": request.headers.get("CF-IPCountry", "US"),
        "platform": request.headers.get("X-Platform", "web"),  # "ios", "android", "web"
        "app_version": request.headers.get("X-App-Version", "0.0.0"),
    }
)

Production Considerations

SDK Initialization and Fallbacks

The flag SDK must not block application startup or add request latency. Initialize asynchronously, provide defaults, cache aggressively:

# Async initialization — don't block startup
async def startup():
    await flag_client.initialize()  # Fetches flags from service, caches locally

# During startup failure — default_value is your safety net
# Never let flag evaluation throw exceptions into your business logic
try:
    enabled = client.get_boolean_value("new-feature", default_value=False, ctx=ctx)
except Exception as e:
    log.error(f"Flag evaluation failed: {e}")
    enabled = False  # Fail safe

SDKs Are Local — Flag Changes Are Near-Instant

Modern flag SDKs stream flag updates via server-sent events. Changes in the LaunchDarkly or Unleash dashboard propagate to all running SDK instances in 1-2 seconds. You don't need a deploy, a restart, or an API call — the change propagates automatically.

This is what makes kill switches operationally effective: you flip the flag, and within seconds, traffic shifts.

Feature Flags in CI/CD: Testing Against Real Flag States

Testing with feature flags requires understanding which variant your tests should run against. Two strategies:

Test both variants: Run your integration test suite against each variant. This ensures neither path regresses during a rollout.

# pytest parameterization over flag variants
import pytest
from unittest.mock import patch

@pytest.fixture(params=["control", "treatment"])
def checkout_variant(request):
    """Run each test against both checkout variants."""
    with patch.object(flag_client, 'get_string_value', return_value=request.param):
        yield request.param

def test_checkout_completes(client, checkout_variant):
    """Both variants should complete checkout successfully."""
    response = client.post('/api/checkout', json={"items": [{"id": 1, "qty": 1}]})
    assert response.status_code == 200
    assert response.json()["order_id"] is not None
    # Test passes regardless of which variant — both paths must work

Flag overrides in staging: Set specific users to specific variants in staging environments for deterministic testing.

# LaunchDarkly: staging environment flag overrides
# Individual user targeting (by ID) overrides all rules
user_targets:
  - variation: 1  # Treatment variant
    values:
      - user_id_of_test_account_1
      - user_id_of_qa_bot
      - user_id_of_automated_test_user

This ensures your QA environment always sees the new variant, while developers can test the control variant by using their personal accounts.

The Flag-Deployment Dependency

One subtle CI/CD consideration: the flag must exist in the flag service before the code that references it is deployed. If you deploy code that calls client.get_boolean_value("new-feature", ...) before creating the flag in LaunchDarkly/Unleash, the SDK returns the default value. That's fine if your default value is the safe path (default_value=False for disabled features).

The workflow:
1. Create flag in flag service (off by default, 0% rollout)
2. Deploy code (all users get default value = old behavior)
3. Enable flag for internal team (0.1% → validate)
4. Gradual rollout (1% → 10% → 50% → 100%)
5. Clean up flag (remove from code + delete from flag service)

Never deploy code that requires a flag to be already enabled on deploy. Always code for the default-value path to be safe.

Conclusion

Feature flags are one of the highest-leverage tools in modern software delivery. They separate deployment from release, enable safe experimentation, and give operations teams the ability to respond to incidents in seconds rather than minutes.

The key practices:
- OpenFeature standard decouples your code from vendor lock-in
- Ring deployments validate changes incrementally before full rollout
- Kill switches are the most operationally valuable flag type — add them proactively
- Statistical significance testing turns experiments into data, not opinions
- Flag lifecycle management prevents the codebase from becoming an unmaintainable branch forest

Start simple: add a kill switch to your next risky feature. Measure the reduction in rollback frequency and incident duration. The ROI makes itself obvious quickly.

The broader shift feature flags enable is cultural: deployment becomes routine rather than an event. When you can deploy code to production with zero users seeing it, and gradually roll it out with metrics validation at each step, the fear of shipping disappears. Teams ship more often, in smaller increments, with more confidence. That's the real value of the pattern — not just the kill switch, but the deployment culture it enables.

Sources

CNCF — OpenFeature project documentation
LaunchDarkly — Engineering blog and best practices
Unleash — Self-hosted documentation
Airbnb Engineering — Experimentation Platform posts
Stripe — Engineering blog on experimentation
Meta Engineering — Experimentation at scale
Jez Humble and David Farley — "Continuous Delivery" (book reference, chapter on dark launching)

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-05-16 · Written with AI assistance, reviewed by Toc Am.

Get These In Your Inbox

Weekly deep-dives on AI engineering, no fluff. Join the newsletter →

Subscribe (free)

Or grab the book ($39, ~100 pages) · Buy me a coffee

☕ Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter

AmtocSoft Tech Insights