A/B Testing Engine: Optimising Intervention Effectiveness
Whistl never stops improving. The A/B Testing Engine continuously tests variations of interventions—different wording, timing, and formats—to discover what works best. Winning variations are promoted across the user base while maintaining personalisation. This is evidence-based behaviour change at scale.
Why A/B Testing Matters
Small changes can have big impacts on intervention effectiveness:
Research on Message Framing
- Gain vs. loss framing: Different users respond to different frames (Rothman et al., 2006)
- Message length: Shorter isn't always better (Keller & Lehmann, 2008)
- Tone matching: Coaching style must match user preference (Miller & Rollnick, 2012)
- Timing effects: When you intervene matters as much as what you say (Heron & Smyth, 2010)
The Limits of Expert Design
- Experts can't predict: What researchers think works ≠ what actually works
- Context matters: Effectiveness varies by user, time, situation
- Continuous improvement: What works today may not work tomorrow
- Scale reveals patterns: Large user base enables statistical confidence
What Whistl A/B Tests
The testing engine evaluates variations across multiple dimensions:
Message Wording
| Step | Variant A | Variant B | Variant C |
|---|---|---|---|
| Acknowledge | "I hear you. What's driving this?" | "I understand this is hard. Talk to me." | "You want through. I get it. Why?" |
| Reflect | "Last time you felt this way..." | "Remember what happened last time..." | "Think about the last time..." |
| Breathe | "Let's breathe together." | "Time to breathe. With me." | "Breathe. Just breathe." |
| Visualize | "Picture your goal..." | "Remember what you're saving for..." | "This is what you're working toward..." |
Timing Variations
- Immediate vs. delayed: Intervene right away or wait 30 seconds?
- Breathing duration: 2 minutes vs. 3 minutes vs. 90 seconds
- Step spacing: Show steps one at a time or all together?
- Follow-up timing: Check in after 1 hour or 2 hours?
Visual Format
- Text-only vs. image: Does showing goal images help?
- Progress bar style: Linear vs. circular vs. numeric
- Color schemes: Calming blues vs. urgent reds vs. neutral
- Animation: Animated breathing pacer vs. static
How A/B Testing Works
Whistl's testing engine follows rigorous methodology:
Test Assignment
# User assignment to test variants
def assign_test_variant(user_id, test_id):
# Hash user ID for consistent assignment
hash_value = hash(user_id + test_id) % 100
# Assign to variant based on hash
if hash_value < 33:
return "A" # Control group
elif hash_value < 66:
return "B" # Variant B
else:
return "C" # Variant C
# User always sees same variant for same test
# Different users see different variants
Sample Size Requirements
- Minimum per variant: 100 interventions
- Statistical power: 80% (standard for behavioural research)
- Confidence level: 95% (p < 0.05)
- Minimum effect size: 5% improvement to adopt
Success Metrics
| Metric | Definition | Target |
|---|---|---|
| Intervention Acceptance | User engaged with intervention | >70% |
| Urge Pass Rate | Urge didn't return within 2 hours | >60% |
| Step Completion | User completed the full step | >80% |
| Helpfulness Rating | User rated 4+ stars | >4.0/5.0 |
| No Bypass | User didn't bypass after intervention | >75% |
Current Active Tests
Examples of tests running across the Whistl user base:
Test 1: Acknowledge Message Tone
Test ID: ACK_TONE_001 Status: Running (67% complete) Variant A (Control): "I hear you. What's driving this?" - Sample: 1,234 interventions - Acceptance rate: 89% - Helpfulness: 4.2/5 Variant B: "I understand this is frustrating. Talk to me." - Sample: 1,198 interventions - Acceptance rate: 91% - Helpfulness: 4.4/5 Variant C: "This is hard. I'm here. What's happening?" - Sample: 1,211 interventions - Acceptance rate: 87% - Helpfulness: 4.3/5 Current leader: Variant B (+2% acceptance, +0.2 helpfulness)
Test 2: Breathing Duration
Test ID: BREATHE_DURATION_002 Status: Running (45% complete) Variant A (Control): 2 minutes - Sample: 892 interventions - Completion rate: 78% - Urge pass rate: 54% Variant B: 90 seconds - Sample: 867 interventions - Completion rate: 84% - Urge pass rate: 49% Variant C: 3 minutes - Sample: 901 interventions - Completion rate: 71% - Urge pass rate: 58% Current leader: Variant A (best balance of completion and effectiveness)
Test 3: Visualization Format
Test ID: VISUAL_FORMAT_003 Status: Complete - Variant B Winner Variant A (Control): Text-only goal reminder - Sample: 2,100 interventions - Motivation increase: 45% Variant B: Goal image + progress bar - Sample: 2,087 interventions - Motivation increase: 61% ✓ WINNER Variant C: Goal image + time travel projection - Sample: 2,134 interventions - Motivation increase: 58% Result: Variant B promoted to all users
From Test to Production
When a test completes, winning variants are rolled out:
Rollout Process
- Statistical validation: Confirm significance and effect size
- Segment analysis: Check if winner varies by user type
- Gradual rollout: 10% → 50% → 100% over 1 week
- Monitoring: Watch for unexpected effects
- Documentation: Update intervention library
Personalisation Override
Even winning variants respect personal preferences:
- Coaching style: Tough Love users still get Tough Love variants
- Step order: Personal step ordering takes precedence
- Opt-out: Users can disable experimental features
Ethical Considerations
Whistl's A/B testing follows ethical guidelines:
Ethical Principles
- No harmful variants: All variants must be supportive, not punitive
- Crisis exclusion: Users in crisis don't receive test variants
- Transparency: Users can view active tests in settings
- Opt-out available: Users can choose control variants only
Data Privacy
- Anonymous aggregation: Test results are aggregated, not individual
- No external sharing: Test data stays within Whistl
- Minimal collection: Only data needed for testing is collected
Effectiveness Improvements
A/B testing has driven measurable improvements:
Cumulative Impact (12 Months)
| Metric | Baseline | Current | Improvement |
|---|---|---|---|
| Overall Intervention Acceptance | 64% | 73% | +9% |
| Urge Pass Rate | 52% | 61% | +9% |
| User Satisfaction | 4.1/5 | 4.6/5 | +12% |
| Step Completion Rate | 67% | 78% | +11% |
User Testimonials
"I noticed the messages changed over time. They got... better? More helpful. Didn't realise they were testing stuff." — Marcus, 28
"The breathing timer changed from 2 minutes to something else and back. Asked support—they said they were testing what works. Cool that they care about getting it right." — Sarah, 34
"Love that Whistl is always improving. It's not static software—it's getting smarter." — Jake, 31
Conclusion
Whistl's A/B Testing Engine ensures that every intervention is backed by evidence, not just intuition. By continuously testing and learning, Whistl gets more effective over time—for every user.
This is behaviour change science in action: hypothesis, test, learn, improve. Repeat forever.
Experience Evidence-Based Protection
Whistl's interventions are continuously tested and improved. Download free and benefit from ongoing optimisation.
Download Whistl FreeRelated: 8-Step Negotiation Engine | Step Effectiveness Tracking | Intervention Type Predictor