A/B Testing Infrastructure: Complete Setup Guide

Whistl continuously tests intervention variations to maximise effectiveness. This technical guide explains experiment design, feature flags, statistical analysis, sequential testing, and how Whistl runs hundreds of experiments to optimise user outcomes.

Why A/B Testing Matters

Intervention effectiveness varies by individual:

Message tone: Tough Love vs. Supportive coaching
Timing: Immediate vs. delayed intervention
Step ordering: Which negotiation steps work best
Visual design: Goal imagery that motivates
Notification copy: Messages that drive engagement

Whistl runs 50+ concurrent experiments to continuously improve outcomes.

Experiment Architecture

Whistl's experimentation platform supports complex testing:

System Overview

┌─────────────┐    ┌──────────────┐    ┌─────────────┐
│   Client    │    │   Experiment │    │   Analytics │
│   App       │───▶│   Service    │───▶│   Pipeline  │
│             │    │              │    │             │
└─────────────┘    └──────────────┘    └─────────────┘
      │                    │                    │
      │  - Request config  │  - Assign variant  │
      │  - Report event    │  - Log exposure    │
      │                    │                    │
      ▼                    ▼                    ▼
  Feature Flags       Randomization        Statistical
  (Local Cache)       (Hash-based)         Analysis

Feature Flag System

Feature flags control experiment variants:

Flag Configuration

{
  "experiment_id": "intervention_tone_v3",
  "name": "Intervention Tone Optimization",
  "status": "running",
  "start_date": "2026-02-01T00:00:00Z",
  "end_date": null,
  "variants": [
    {
      "id": "control",
      "name": "Current Tone",
      "allocation": 0.25,
      "config": {
        "tone": "balanced",
        "emoji_usage": "moderate"
      }
    },
    {
      "id": "tough_love",
      "name": "Direct Approach",
      "allocation": 0.25,
      "config": {
        "tone": "direct",
        "emoji_usage": "minimal"
      }
    },
    {
      "id": "supportive",
      "name": "Empathetic Approach",
      "allocation": 0.25,
      "config": {
        "tone": "supportive",
        "emoji_usage": "frequent"
      }
    },
    {
      "id": "adaptive",
      "name": "ML-Selected Tone",
      "allocation": 0.25,
      "config": {
        "tone": "ml_selected",
        "emoji_usage": "ml_selected"
      }
    }
  ],
  "targeting": {
    "countries": ["AU", "NZ", "US", "UK"],
    "app_versions": ["1.8.0+"],
    "user_segments": ["active_30d"]
  },
  "primary_metric": "intervention_acceptance_rate",
  "guardrail_metrics": ["app_uninstall_rate", "session_duration"]
}

Client-Side Flag Evaluation

class ExperimentManager {
    private var flags: [String: ExperimentFlag] = [:]
    private var userVariant: [String: String] = [:]
    
    func evaluateFlag(experimentId: String) -> Variant {
        guard let flag = flags[experimentId] else {
            return flag.defaultVariant
        }
        
        // Check if user already assigned
        if let assignedVariant = userVariant[experimentId] {
            return flag.getVariant(assignedVariant)
        }
        
        // Assign variant based on user ID hash
        let variant = assignVariant(user: currentUser, flag: flag)
        userVariant[experimentId] = variant.id
        
        // Log exposure
        logExposure(experimentId: experimentId, variant: variant)
        
        return variant
    }
    
    private func assignVariant(user: User, flag: ExperimentFlag) -> Variant {
        // Hash user ID + experiment ID for consistent assignment
        let hash = hash("\(user.id):\(flag.id)")
        let bucket = hash % 100 / 100.0  // 0.0 - 1.0
        
        // Find variant based on allocation
        var cumulative: Double = 0
        for variant in flag.variants {
            cumulative += variant.allocation
            if bucket < cumulative {
                return variant
            }
        }
        
        return flag.variants.last!
    }
}

Experiment Types

Whistl runs multiple experiment types:

A/B Test (Two Variants)

Variant	Allocation	Description
Control (A)	50%	Current intervention message
Treatment (B)	50%	New message with urgency framing

Multivariate Test (Multiple Factors)

Factor	Variants
Message Length	Short, Medium, Long
Tone	Supportive, Direct, Neutral
CTA Button	"Call Sponsor", "Breathe", "I'm OK"

Tests all combinations: 3 × 3 × 3 = 27 variants

Sequential Test

Variants tested one after another with adaptive allocation:

class SequentialTest {
    func allocateTraffic(variants: [Variant]) -> [Double] {
        // Thompson Sampling for adaptive allocation
        var allocations: [Double] = []
        
        for variant in variants {
            // Sample from posterior distribution
            let sample = variant.successes.betaSample(
                failures: variant.failures
            )
            allocations.append(sample)
        }
        
        // Normalize to probabilities
        let sum = allocations.reduce(0, +)
        return allocations.map { $0 / sum }
    }
}

Statistical Analysis

Experiments use rigorous statistical methods:

Sample Size Calculation

class SampleSizeCalculator {
    func calculate(
        baselineRate: Double,
        minimumDetectableEffect: Double,
        power: Double = 0.8,
        significance: Double = 0.05
    ) -> Int {
        let p1 = baselineRate
        let p2 = baselineRate * (1 + minimumDetectableEffect)
        
        let zAlpha = zScore(1 - significance / 2)  // 1.96 for 95%
        let zBeta = zScore(power)  // 0.84 for 80% power
        
        let pooledP = (p1 + p2) / 2
        
        let numerator = pow(zAlpha * sqrt(2 * pooledP * (1 - pooledP)) + 
                           zBeta * sqrt(p1 * (1 - p1) + p2 * (1 - p2)), 2)
        let denominator = pow(p1 - p2, 2)
        
        return Int(ceil(2 * numerator / denominator))
    }
}

// Example: Baseline 70% acceptance, want to detect 5% improvement
// Result: Need 2,841 users per variant

Significance Testing

class SignificanceTest {
    func test(
        controlSuccesses: Int,
        controlTotal: Int,
        treatmentSuccesses: Int,
        treatmentTotal: Int
    ) -> TestResult {
        let p1 = Double(controlSuccesses) / Double(controlTotal)
        let p2 = Double(treatmentSuccesses) / Double(treatmentTotal)
        
        let pooledP = Double(controlSuccesses + treatmentSuccesses) / 
                      Double(controlTotal + treatmentTotal)
        
        let se = sqrt(pooledP * (1 - pooledP) * 
                     (1.0 / Double(controlTotal) + 1.0 / Double(treatmentTotal)))
        
        let zScore = (p2 - p1) / se
        let pValue = 2 * (1 - normalCDF(abs(zScore)))
        
        return TestResult(
            zScore: zScore,
            pValue: pValue,
            significant: pValue < 0.05,
            confidenceInterval: calculateCI(p1, p2, se)
        )
    }
}

Bayesian Analysis

class BayesianAnalysis {
    func analyze(control: VariantData, treatment: VariantData) -> BayesianResult {
        // Beta distributions for each variant
        let controlDist = BetaDistribution(
            alpha: control.successes + 1,
            beta: control.failures + 1
        )
        
        let treatmentDist = BetaDistribution(
            alpha: treatment.successes + 1,
            beta: treatment.failures + 1
        )
        
        // Monte Carlo simulation
        let samples = 10000
        var treatmentBetter = 0
        
        for _ in 0.. controlSample {
                treatmentBetter += 1
            }
        }
        
        let probabilityTreatmentBetter = Double(treatmentBetter) / Double(samples)
        
        return BayesianResult(
            probabilityTreatmentBetter: probabilityTreatmentBetter,
            credibleInterval: calculateCredibleInterval(treatmentDist),
            expectedLoss: calculateExpectedLoss(controlDist, treatmentDist)
        )
    }
}

Experiment Metrics

Whistl tracks multiple metrics per experiment:

Primary Metrics

Metric	Definition	Target
Intervention Acceptance Rate	% who engage with intervention	>70%
Breathing Completion Rate	% who complete breathing exercise	>50%
Partner Contact Rate	% who contact accountability partner	>30%
Goal Engagement	Dream board views per week	>5

Guardrail Metrics

Metric	Threshold	Action
App Uninstall Rate	<2x control	Stop experiment if exceeded
Session Duration	>0.8x control	Warning if too low
Support Tickets	<3x control	Investigate if elevated
Crash Rate	No increase	Immediate stop if increased

Experiment Lifecycle

Experiments follow a structured process:

Stages

Hypothesis: Define expected outcome
Design: Specify variants, metrics, sample size
Review: Ethics and privacy review
Launch: Deploy to small percentage
Monitor: Watch guardrail metrics
Analyze: Statistical analysis when complete
Decide: Roll out, iterate, or abandon

Example Experiment

// Experiment: Notification Timing
{
  "hypothesis": "Sending notifications at personalized optimal times will increase engagement by 15%",
  
  "variants": [
    {"id": "control", "timing": "immediate", "allocation": 0.5},
    {"id": "treatment", "timing": "ml_optimized", "allocation": 0.5}
  ],
  
  "primary_metric": "notification_open_rate",
  "guardrail_metrics": ["app_uninstall_rate", "notification_disable_rate"],
  
  "sample_size": 10000,  // per variant
  "duration": "14 days",
  
  "success_criteria": {
    "min_improvement": 0.15,
    "significance_level": 0.05,
    "power": 0.8
  }
}

Results Dashboard

Experiment results are visualized for the team:

Dashboard Metrics

Daily active users per variant
Cumulative metric values
Statistical significance over time
Segment breakdowns (iOS/Android, region)
Guardrail metric status

Ethical Considerations

Whistl follows ethical experimentation principles:

Ethics Guidelines

No harm: Experiments must not increase risk
Control is valid: Control must be current best practice
Privacy: No sensitive data used for targeting
Transparency: Users can opt out of experiments
Quick stopping: Harmful variants stopped immediately

Conclusion

Whistl's A/B testing infrastructure enables data-driven optimization of intervention effectiveness. Through rigorous statistical analysis, ethical experimentation, and continuous learning, Whistl gets better at helping users every day.

Every experiment is an opportunity to improve outcomes—Whistl tests to learn, not just to win.

Experience Optimized Protection

Whistl continuously tests and improves intervention effectiveness. Download free and benefit from data-driven optimization.

Download Whistl Free