Reinforcement Learning for Intervention Optimization: How AI Learns What Works for You
Not all interventions work equally well for all people. Reinforcement learning enables Whistl to discover which intervention strategies are most effective for each individual user, continuously adapting based on real-world outcomes.
The Challenge of Intervention Selection
When Whistl detects elevated impulse risk, it has multiple intervention options:
- Gentle reminder: "You seem stressed. Consider waiting before purchasing."
- Hard block: Temporarily blocking access to shopping apps
- Accountability notification: Alerting your chosen accountability partner
- Mindfulness prompt: Guiding a brief breathing exercise
- Goal reminder: Showing your savings goals and progress
- Cooling-off period: Enforcing a 24-hour wait before purchase
Which intervention should Whistl choose? The answer depends on:
- Your personality and preferences
- The specific context and risk level
- What has worked (or failed) in the past
- Time of day, location, and current emotional state
Reinforcement learning solves this personalisation challenge by treating intervention selection as a sequential decision-making problem.
What Is Reinforcement Learning?
Reinforcement learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment and receiving rewards or penalties. Unlike supervised learning (which learns from labelled examples), RL learns from trial and error.
In Whistl's context:
- Agent: The intervention selection system
- Environment: The user and their behavioural context
- Action: Choosing which intervention to deliver
- Reward: Successful impulse resistance (or purchase avoidance)
Contextual Bandits for Intervention Selection
Whistl uses contextual bandits, a simplified form of RL well-suited to intervention selection. Unlike full RL, contextual bandits don't need to consider long-term consequences—each intervention is a independent decision.
The Contextual Bandit Framework
import numpy as np
from sklearn.linear_model import LogisticRegression
class ContextualBandit:
"""
Contextual bandit for intervention selection.
Learns which intervention works best in each context.
"""
def __init__(self, interventions, context_dim):
self.interventions = interventions # List of intervention types
self.n_interventions = len(interventions)
self.context_dim = context_dim
# Separate model for each intervention
self.models = {
intervention: LogisticRegression()
for intervention in interventions
}
# Track rewards for each intervention
self.reward_history = {
intervention: [] for intervention in interventions
}
def select_intervention(self, context):
"""
Select intervention based on current context.
Uses Thompson Sampling for exploration-exploitation balance.
"""
intervention_scores = {}
for intervention in self.interventions:
model = self.models[intervention]
# Get predicted success probability
if hasattr(model, 'coef_'):
# Model has been trained
prob = model.predict_proba([context])[0][1]
# Add exploration noise (Thompson Sampling)
uncertainty = self._estimate_uncertainty(model, context)
prob += np.random.normal(0, uncertainty)
else:
# No training data yet - explore
prob = 0.5
intervention_scores[intervention] = prob
# Select intervention with highest score
best_intervention = max(intervention_scores, key=intervention_scores.get)
return best_intervention
def update(self, context, intervention, reward):
"""
Update model based on intervention outcome.
Args:
context: Feature vector describing the situation
intervention: Which intervention was chosen
reward: 1 if intervention succeeded, 0 if failed
"""
self.reward_history[intervention].append(reward)
# Retrain model for this intervention
X = np.array(self.context_history[intervention])
y = np.array(self.reward_history[intervention])
if len(X) > 10: # Need minimum samples
self.models[intervention].fit(X, y)
def _estimate_uncertainty(self, model, context):
"""Estimate prediction uncertainty for exploration."""
# Simplified: more uncertainty with fewer samples
n_samples = sum(len(v) for v in self.reward_history.values())
return 1.0 / np.sqrt(n_samples + 1)
Thompson Sampling for Exploration-Exploitation
A key challenge in reinforcement learning is balancing:
- Exploitation: Using interventions known to work well
- Exploration: Trying less-tested interventions to discover new insights
Whistl uses Thompson Sampling, which naturally balances this trade-off by sampling from the posterior distribution of each intervention's success rate. Interventions with high uncertainty get explored more often.
Defining Rewards: What Counts as Success?
The reward signal drives all learning in reinforcement learning. Defining it correctly is crucial:
Immediate Rewards
The simplest reward is whether the user made an impulse purchase within a defined window after the intervention:
def calculate_immediate_reward(user_id, intervention_time, window_hours=2):
"""
Calculate reward based on purchase behaviour after intervention.
Returns:
1.0 if no impulse purchase within window
0.0 if impulse purchase occurred
-0.5 if user disabled interventions (negative feedback)
"""
# Check for purchases in the window after intervention
purchases = get_purchases_in_window(
user_id,
intervention_time,
window_hours
)
# Filter for impulse purchases (high-risk categories, unusual amounts)
impulse_purchases = [
p for p in purchases
if is_impulse_purchase(p)
]
if len(impulse_purchases) > 0:
return 0.0 # Intervention failed
else:
return 1.0 # Intervention succeeded
Delayed Rewards
Some intervention effects aren't immediate. A user might resist an impulse today but feel depleted and overspend tomorrow. Whistl tracks delayed rewards over longer time horizons:
- Total spending over 24 hours post-intervention
- Number of impulse urges reported
- User mood and stress levels
- Engagement with the app (continued use indicates satisfaction)
Composite Reward Function
def calculate_composite_reward(user_id, intervention_id, timestamp):
"""
Calculate composite reward from multiple signals.
"""
# Immediate success (no purchase in 2 hours)
immediate = calculate_immediate_reward(user_id, timestamp, window_hours=2)
# Short-term spending (24 hours)
spending_24h = get_total_spending(user_id, timestamp, hours=24)
spending_reward = 1.0 - min(spending_24h / 500, 1.0) # Normalize
# User feedback (explicit rating if provided)
user_feedback = get_user_feedback(user_id, intervention_id)
feedback_reward = user_feedback if user_feedback else 0.5
# Engagement (did user continue using app?)
engagement = get_engagement_metric(user_id, days=7)
engagement_reward = engagement
# Weighted combination
reward = (
0.4 * immediate +
0.3 * spending_reward +
0.2 * feedback_reward +
0.1 * engagement_reward
)
return reward
Personalisation Through Reinforcement Learning
The power of RL becomes apparent as the system learns individual differences:
User A: Responds to Gentle Reminders
Sarah finds hard blocks frustrating and counterproductive. Her RL profile shows:
- Gentle reminder: 87% success rate
- Goal reminder: 82% success rate
- Mindfulness prompt: 76% success rate
- Hard block: 34% success rate (often circumvented)
Whistl learns to favour supportive interventions for Sarah.
User B: Needs Strong Boundaries
Marcus knows he can't trust himself in high-risk situations. His profile:
- Hard block: 91% success rate
- Cooling-off period: 88% success rate
- Accountability notification: 85% success rate
- Gentle reminder: 45% success rate (easily ignored)
Whistl learns that Marcus benefits from firmer boundaries.
Handling Contextual Variation
The same user might respond differently to interventions depending on context. Whistl's contextual bandit learns these nuances:
class ContextualInterventionSelector:
"""
Learn intervention effectiveness across different contexts.
"""
def __init__(self):
# Separate bandit for different context clusters
self.context_clusters = ['low_stress', 'moderate_stress', 'high_stress']
self.bandits = {
cluster: ContextualBandit(
interventions=INTERVENTIONS,
context_dim=CONTEXT_DIM
)
for cluster in self.context_clusters
}
def select(self, user_context):
"""Select intervention based on contextual cluster."""
cluster = self._classify_context(user_context)
bandit = self.bandits[cluster]
return bandit.select_intervention(user_context)
def update(self, user_context, intervention, reward):
"""Update the appropriate contextual bandit."""
cluster = self._classify_context(user_context)
bandit = self.bandits[cluster]
bandit.update(user_context, intervention, reward)
def _classify_context(self, context):
"""Classify context into stress level cluster."""
stress_level = context['stress_level']
if stress_level < 0.3:
return 'low_stress'
elif stress_level < 0.7:
return 'moderate_stress'
else:
return 'high_stress'
# Example learned patterns:
# - In LOW_STRESS contexts: gentle reminders work best (85% success)
# - In MODERATE_STRESS: goal reminders are most effective (78% success)
# - In HIGH_STRESS: hard blocks necessary (gentle reminders drop to 40%)
Continuous Learning and Adaptation
User preferences and circumstances change over time. Whistl's RL system continuously adapts:
Recency Weighting
Recent outcomes matter more than distant history. Whistl uses exponentially weighted rewards:
def calculate_weighted_reward(history, decay_factor=0.95):
"""
Calculate reward with exponential decay for older observations.
Args:
history: List of (context, reward) tuples ordered by time
decay_factor: How quickly to discount old data (0.95 = 5% decay per step)
"""
weighted_sum = 0
total_weight = 0
for i, (context, reward) in enumerate(reversed(history)):
weight = decay_factor ** i
weighted_sum += reward * weight
total_weight += weight
return weighted_sum / total_weight if total_weight > 0 else 0
Detecting Preference Shifts
If an intervention's success rate suddenly drops, Whistl increases exploration to discover new effective strategies:
- Monitor rolling success rates for each intervention
- Detect significant drops using statistical tests
- Temporarily increase exploration when changes detected
- Allow user to explicitly rate interventions
Ethical Considerations in Behavioural RL
Using RL to influence behaviour raises important ethical questions:
Respecting Autonomy
Whistl's RL optimises for user-stated goals, not app engagement or revenue. Users define their own spending limits and financial goals; the RL system learns to support those goals, not override them.
Transparency
Users can see which interventions have worked best for them:
# Example user-facing intervention effectiveness report
intervention_report = {
"gentle_reminder": {
"times_used": 47,
"success_rate": 0.87,
"your_rating": 4.2,
"recommendation": "Highly effective for you"
},
"hard_block": {
"times_used": 12,
"success_rate": 0.34,
"your_rating": 2.1,
"recommendation": "Consider alternatives"
},
"mindfulness_prompt": {
"times_used": 23,
"success_rate": 0.76,
"your_rating": 4.5,
"recommendation": "Effective and well-received"
}
}
Opt-Out and Control
Users can:
- Disable specific intervention types
- Set maximum intervention frequency
- Pause all interventions temporarily
- Reset learned preferences and start fresh
Performance Results
After deploying RL-based intervention selection, Whistl observed significant improvements:
| Metric | Before RL | After RL | Improvement |
|---|---|---|---|
| Intervention Success Rate | 62% | 79% | +27% |
| User Satisfaction | 3.4/5 | 4.3/5 | +26% |
| Intervention Acceptance | 58% | 81% | +40% |
| 30-Day Retention | 67% | 82% | +22% |
"What impressed me most was how Whistl learned. The first few weeks, interventions felt hit-or-miss. But after a month, it was uncanny—it knew exactly which approach would work. Gentle reminders when I was mildly tempted, stronger blocks when I was really struggling. It felt like having a coach who actually understood me."
The Future of RL in Behavioural Support
Whistl's research team is exploring advanced RL techniques:
- Multi-agent RL: Coordinating interventions across accountability partners
- Offline RL: Learning from historical data without online exploration
- Meta-RL: Faster personalisation for new users by learning from similar users
- Constrained RL: Ensuring interventions respect hard ethical boundaries
Getting Started with Whistl
Experience interventions that learn and adapt to you. Whistl's reinforcement learning system discovers the most effective support strategies for your unique psychology and circumstances.
Personalised Intervention Support
Join thousands of Australians using Whistl's adaptive AI to receive interventions that learn what works best for them personally.
Crisis Support Resources
If you're experiencing severe financial distress or gambling-related harm, professional support is available:
- Gambling Help: 1800 858 858 (24/7, free and confidential)
- Lifeline: 13 11 14 (24/7 crisis support)
- Beyond Blue: 1300 22 4636 (mental health support)
- Financial Counselling Australia: 1800 007 007