A/B Testing Infrastructure: Complete Setup Guide

Whistl continuously tests intervention variations to maximise effectiveness. This technical guide explains experiment design, feature flags, statistical analysis, sequential testing, and how Whistl runs hundreds of experiments to optimise user outcomes.

Why A/B Testing Matters

Intervention effectiveness varies by individual:

  • Message tone: Tough Love vs. Supportive coaching
  • Timing: Immediate vs. delayed intervention
  • Step ordering: Which negotiation steps work best
  • Visual design: Goal imagery that motivates
  • Notification copy: Messages that drive engagement

Whistl runs 50+ concurrent experiments to continuously improve outcomes.

Experiment Architecture

Whistl's experimentation platform supports complex testing:

System Overview

┌─────────────┐    ┌──────────────┐    ┌─────────────┐
│   Client    │    │   Experiment │    │   Analytics │
│   App       │───▶│   Service    │───▶│   Pipeline  │
│             │    │              │    │             │
└─────────────┘    └──────────────┘    └─────────────┘
      │                    │                    │
      │  - Request config  │  - Assign variant  │
      │  - Report event    │  - Log exposure    │
      │                    │                    │
      ▼                    ▼                    ▼
  Feature Flags       Randomization        Statistical
  (Local Cache)       (Hash-based)         Analysis

Feature Flag System

Feature flags control experiment variants:

Flag Configuration

{
  "experiment_id": "intervention_tone_v3",
  "name": "Intervention Tone Optimization",
  "status": "running",
  "start_date": "2026-02-01T00:00:00Z",
  "end_date": null,
  "variants": [
    {
      "id": "control",
      "name": "Current Tone",
      "allocation": 0.25,
      "config": {
        "tone": "balanced",
        "emoji_usage": "moderate"
      }
    },
    {
      "id": "tough_love",
      "name": "Direct Approach",
      "allocation": 0.25,
      "config": {
        "tone": "direct",
        "emoji_usage": "minimal"
      }
    },
    {
      "id": "supportive",
      "name": "Empathetic Approach",
      "allocation": 0.25,
      "config": {
        "tone": "supportive",
        "emoji_usage": "frequent"
      }
    },
    {
      "id": "adaptive",
      "name": "ML-Selected Tone",
      "allocation": 0.25,
      "config": {
        "tone": "ml_selected",
        "emoji_usage": "ml_selected"
      }
    }
  ],
  "targeting": {
    "countries": ["AU", "NZ", "US", "UK"],
    "app_versions": ["1.8.0+"],
    "user_segments": ["active_30d"]
  },
  "primary_metric": "intervention_acceptance_rate",
  "guardrail_metrics": ["app_uninstall_rate", "session_duration"]
}

Client-Side Flag Evaluation

class ExperimentManager {
    private var flags: [String: ExperimentFlag] = [:]
    private var userVariant: [String: String] = [:]
    
    func evaluateFlag(experimentId: String) -> Variant {
        guard let flag = flags[experimentId] else {
            return flag.defaultVariant
        }
        
        // Check if user already assigned
        if let assignedVariant = userVariant[experimentId] {
            return flag.getVariant(assignedVariant)
        }
        
        // Assign variant based on user ID hash
        let variant = assignVariant(user: currentUser, flag: flag)
        userVariant[experimentId] = variant.id
        
        // Log exposure
        logExposure(experimentId: experimentId, variant: variant)
        
        return variant
    }
    
    private func assignVariant(user: User, flag: ExperimentFlag) -> Variant {
        // Hash user ID + experiment ID for consistent assignment
        let hash = hash("\(user.id):\(flag.id)")
        let bucket = hash % 100 / 100.0  // 0.0 - 1.0
        
        // Find variant based on allocation
        var cumulative: Double = 0
        for variant in flag.variants {
            cumulative += variant.allocation
            if bucket < cumulative {
                return variant
            }
        }
        
        return flag.variants.last!
    }
}

Experiment Types

Whistl runs multiple experiment types:

A/B Test (Two Variants)

VariantAllocationDescription
Control (A)50%Current intervention message
Treatment (B)50%New message with urgency framing

Multivariate Test (Multiple Factors)

FactorVariants
Message LengthShort, Medium, Long
ToneSupportive, Direct, Neutral
CTA Button"Call Sponsor", "Breathe", "I'm OK"

Tests all combinations: 3 × 3 × 3 = 27 variants

Sequential Test

Variants tested one after another with adaptive allocation:

class SequentialTest {
    func allocateTraffic(variants: [Variant]) -> [Double] {
        // Thompson Sampling for adaptive allocation
        var allocations: [Double] = []
        
        for variant in variants {
            // Sample from posterior distribution
            let sample = variant.successes.betaSample(
                failures: variant.failures
            )
            allocations.append(sample)
        }
        
        // Normalize to probabilities
        let sum = allocations.reduce(0, +)
        return allocations.map { $0 / sum }
    }
}

Statistical Analysis

Experiments use rigorous statistical methods:

Sample Size Calculation

class SampleSizeCalculator {
    func calculate(
        baselineRate: Double,
        minimumDetectableEffect: Double,
        power: Double = 0.8,
        significance: Double = 0.05
    ) -> Int {
        let p1 = baselineRate
        let p2 = baselineRate * (1 + minimumDetectableEffect)
        
        let zAlpha = zScore(1 - significance / 2)  // 1.96 for 95%
        let zBeta = zScore(power)  // 0.84 for 80% power
        
        let pooledP = (p1 + p2) / 2
        
        let numerator = pow(zAlpha * sqrt(2 * pooledP * (1 - pooledP)) + 
                           zBeta * sqrt(p1 * (1 - p1) + p2 * (1 - p2)), 2)
        let denominator = pow(p1 - p2, 2)
        
        return Int(ceil(2 * numerator / denominator))
    }
}

// Example: Baseline 70% acceptance, want to detect 5% improvement
// Result: Need 2,841 users per variant

Significance Testing

class SignificanceTest {
    func test(
        controlSuccesses: Int,
        controlTotal: Int,
        treatmentSuccesses: Int,
        treatmentTotal: Int
    ) -> TestResult {
        let p1 = Double(controlSuccesses) / Double(controlTotal)
        let p2 = Double(treatmentSuccesses) / Double(treatmentTotal)
        
        let pooledP = Double(controlSuccesses + treatmentSuccesses) / 
                      Double(controlTotal + treatmentTotal)
        
        let se = sqrt(pooledP * (1 - pooledP) * 
                     (1.0 / Double(controlTotal) + 1.0 / Double(treatmentTotal)))
        
        let zScore = (p2 - p1) / se
        let pValue = 2 * (1 - normalCDF(abs(zScore)))
        
        return TestResult(
            zScore: zScore,
            pValue: pValue,
            significant: pValue < 0.05,
            confidenceInterval: calculateCI(p1, p2, se)
        )
    }
}

Bayesian Analysis

class BayesianAnalysis {
    func analyze(control: VariantData, treatment: VariantData) -> BayesianResult {
        // Beta distributions for each variant
        let controlDist = BetaDistribution(
            alpha: control.successes + 1,
            beta: control.failures + 1
        )
        
        let treatmentDist = BetaDistribution(
            alpha: treatment.successes + 1,
            beta: treatment.failures + 1
        )
        
        // Monte Carlo simulation
        let samples = 10000
        var treatmentBetter = 0
        
        for _ in 0.. controlSample {
                treatmentBetter += 1
            }
        }
        
        let probabilityTreatmentBetter = Double(treatmentBetter) / Double(samples)
        
        return BayesianResult(
            probabilityTreatmentBetter: probabilityTreatmentBetter,
            credibleInterval: calculateCredibleInterval(treatmentDist),
            expectedLoss: calculateExpectedLoss(controlDist, treatmentDist)
        )
    }
}

Experiment Metrics

Whistl tracks multiple metrics per experiment:

Primary Metrics

MetricDefinitionTarget
Intervention Acceptance Rate% who engage with intervention>70%
Breathing Completion Rate% who complete breathing exercise>50%
Partner Contact Rate% who contact accountability partner>30%
Goal EngagementDream board views per week>5

Guardrail Metrics

MetricThresholdAction
App Uninstall Rate<2x controlStop experiment if exceeded
Session Duration>0.8x controlWarning if too low
Support Tickets<3x controlInvestigate if elevated
Crash RateNo increaseImmediate stop if increased

Experiment Lifecycle

Experiments follow a structured process:

Stages

  1. Hypothesis: Define expected outcome
  2. Design: Specify variants, metrics, sample size
  3. Review: Ethics and privacy review
  4. Launch: Deploy to small percentage
  5. Monitor: Watch guardrail metrics
  6. Analyze: Statistical analysis when complete
  7. Decide: Roll out, iterate, or abandon

Example Experiment

// Experiment: Notification Timing
{
  "hypothesis": "Sending notifications at personalized optimal times will increase engagement by 15%",
  
  "variants": [
    {"id": "control", "timing": "immediate", "allocation": 0.5},
    {"id": "treatment", "timing": "ml_optimized", "allocation": 0.5}
  ],
  
  "primary_metric": "notification_open_rate",
  "guardrail_metrics": ["app_uninstall_rate", "notification_disable_rate"],
  
  "sample_size": 10000,  // per variant
  "duration": "14 days",
  
  "success_criteria": {
    "min_improvement": 0.15,
    "significance_level": 0.05,
    "power": 0.8
  }
}

Results Dashboard

Experiment results are visualized for the team:

Dashboard Metrics

  • Daily active users per variant
  • Cumulative metric values
  • Statistical significance over time
  • Segment breakdowns (iOS/Android, region)
  • Guardrail metric status

Ethical Considerations

Whistl follows ethical experimentation principles:

Ethics Guidelines

  • No harm: Experiments must not increase risk
  • Control is valid: Control must be current best practice
  • Privacy: No sensitive data used for targeting
  • Transparency: Users can opt out of experiments
  • Quick stopping: Harmful variants stopped immediately

Conclusion

Whistl's A/B testing infrastructure enables data-driven optimization of intervention effectiveness. Through rigorous statistical analysis, ethical experimentation, and continuous learning, Whistl gets better at helping users every day.

Every experiment is an opportunity to improve outcomes—Whistl tests to learn, not just to win.

Experience Optimized Protection

Whistl continuously tests and improves intervention effectiveness. Download free and benefit from data-driven optimization.

Download Whistl Free

Related: ML Model Updates | Privacy-Compliant Analytics | 8-Step Negotiation Engine