Back to Blog

Reinforcement Learning for Intervention Optimization: How AI Learns What Works for You

Not all interventions work equally well for all people. Reinforcement learning enables Whistl to discover which intervention strategies are most effective for each individual user, continuously adapting based on real-world outcomes.

The Challenge of Intervention Selection

When Whistl detects elevated impulse risk, it has multiple intervention options:

Which intervention should Whistl choose? The answer depends on:

Reinforcement learning solves this personalisation challenge by treating intervention selection as a sequential decision-making problem.

What Is Reinforcement Learning?

Reinforcement learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment and receiving rewards or penalties. Unlike supervised learning (which learns from labelled examples), RL learns from trial and error.

In Whistl's context:

Contextual Bandits for Intervention Selection

Whistl uses contextual bandits, a simplified form of RL well-suited to intervention selection. Unlike full RL, contextual bandits don't need to consider long-term consequences—each intervention is a independent decision.

The Contextual Bandit Framework

import numpy as np
from sklearn.linear_model import LogisticRegression

class ContextualBandit:
    """
    Contextual bandit for intervention selection.
    Learns which intervention works best in each context.
    """
    def __init__(self, interventions, context_dim):
        self.interventions = interventions  # List of intervention types
        self.n_interventions = len(interventions)
        self.context_dim = context_dim
        
        # Separate model for each intervention
        self.models = {
            intervention: LogisticRegression() 
            for intervention in interventions
        }
        
        # Track rewards for each intervention
        self.reward_history = {
            intervention: [] for intervention in interventions
        }
    
    def select_intervention(self, context):
        """
        Select intervention based on current context.
        Uses Thompson Sampling for exploration-exploitation balance.
        """
        intervention_scores = {}
        
        for intervention in self.interventions:
            model = self.models[intervention]
            
            # Get predicted success probability
            if hasattr(model, 'coef_'):
                # Model has been trained
                prob = model.predict_proba([context])[0][1]
                
                # Add exploration noise (Thompson Sampling)
                uncertainty = self._estimate_uncertainty(model, context)
                prob += np.random.normal(0, uncertainty)
            else:
                # No training data yet - explore
                prob = 0.5
            
            intervention_scores[intervention] = prob
        
        # Select intervention with highest score
        best_intervention = max(intervention_scores, key=intervention_scores.get)
        return best_intervention
    
    def update(self, context, intervention, reward):
        """
        Update model based on intervention outcome.
        
        Args:
            context: Feature vector describing the situation
            intervention: Which intervention was chosen
            reward: 1 if intervention succeeded, 0 if failed
        """
        self.reward_history[intervention].append(reward)
        
        # Retrain model for this intervention
        X = np.array(self.context_history[intervention])
        y = np.array(self.reward_history[intervention])
        
        if len(X) > 10:  # Need minimum samples
            self.models[intervention].fit(X, y)
    
    def _estimate_uncertainty(self, model, context):
        """Estimate prediction uncertainty for exploration."""
        # Simplified: more uncertainty with fewer samples
        n_samples = sum(len(v) for v in self.reward_history.values())
        return 1.0 / np.sqrt(n_samples + 1)

Thompson Sampling for Exploration-Exploitation

A key challenge in reinforcement learning is balancing:

Whistl uses Thompson Sampling, which naturally balances this trade-off by sampling from the posterior distribution of each intervention's success rate. Interventions with high uncertainty get explored more often.

Defining Rewards: What Counts as Success?

The reward signal drives all learning in reinforcement learning. Defining it correctly is crucial:

Immediate Rewards

The simplest reward is whether the user made an impulse purchase within a defined window after the intervention:

def calculate_immediate_reward(user_id, intervention_time, window_hours=2):
    """
    Calculate reward based on purchase behaviour after intervention.
    
    Returns:
        1.0 if no impulse purchase within window
        0.0 if impulse purchase occurred
        -0.5 if user disabled interventions (negative feedback)
    """
    # Check for purchases in the window after intervention
    purchases = get_purchases_in_window(
        user_id, 
        intervention_time, 
        window_hours
    )
    
    # Filter for impulse purchases (high-risk categories, unusual amounts)
    impulse_purchases = [
        p for p in purchases 
        if is_impulse_purchase(p)
    ]
    
    if len(impulse_purchases) > 0:
        return 0.0  # Intervention failed
    else:
        return 1.0  # Intervention succeeded

Delayed Rewards

Some intervention effects aren't immediate. A user might resist an impulse today but feel depleted and overspend tomorrow. Whistl tracks delayed rewards over longer time horizons:

Composite Reward Function

def calculate_composite_reward(user_id, intervention_id, timestamp):
    """
    Calculate composite reward from multiple signals.
    """
    # Immediate success (no purchase in 2 hours)
    immediate = calculate_immediate_reward(user_id, timestamp, window_hours=2)
    
    # Short-term spending (24 hours)
    spending_24h = get_total_spending(user_id, timestamp, hours=24)
    spending_reward = 1.0 - min(spending_24h / 500, 1.0)  # Normalize
    
    # User feedback (explicit rating if provided)
    user_feedback = get_user_feedback(user_id, intervention_id)
    feedback_reward = user_feedback if user_feedback else 0.5
    
    # Engagement (did user continue using app?)
    engagement = get_engagement_metric(user_id, days=7)
    engagement_reward = engagement
    
    # Weighted combination
    reward = (
        0.4 * immediate +
        0.3 * spending_reward +
        0.2 * feedback_reward +
        0.1 * engagement_reward
    )
    
    return reward

Personalisation Through Reinforcement Learning

The power of RL becomes apparent as the system learns individual differences:

User A: Responds to Gentle Reminders

Sarah finds hard blocks frustrating and counterproductive. Her RL profile shows:

Whistl learns to favour supportive interventions for Sarah.

User B: Needs Strong Boundaries

Marcus knows he can't trust himself in high-risk situations. His profile:

Whistl learns that Marcus benefits from firmer boundaries.

Handling Contextual Variation

The same user might respond differently to interventions depending on context. Whistl's contextual bandit learns these nuances:

class ContextualInterventionSelector:
    """
    Learn intervention effectiveness across different contexts.
    """
    def __init__(self):
        # Separate bandit for different context clusters
        self.context_clusters = ['low_stress', 'moderate_stress', 'high_stress']
        self.bandits = {
            cluster: ContextualBandit(
                interventions=INTERVENTIONS,
                context_dim=CONTEXT_DIM
            )
            for cluster in self.context_clusters
        }
    
    def select(self, user_context):
        """Select intervention based on contextual cluster."""
        cluster = self._classify_context(user_context)
        bandit = self.bandits[cluster]
        return bandit.select_intervention(user_context)
    
    def update(self, user_context, intervention, reward):
        """Update the appropriate contextual bandit."""
        cluster = self._classify_context(user_context)
        bandit = self.bandits[cluster]
        bandit.update(user_context, intervention, reward)
    
    def _classify_context(self, context):
        """Classify context into stress level cluster."""
        stress_level = context['stress_level']
        if stress_level < 0.3:
            return 'low_stress'
        elif stress_level < 0.7:
            return 'moderate_stress'
        else:
            return 'high_stress'

# Example learned patterns:
# - In LOW_STRESS contexts: gentle reminders work best (85% success)
# - In MODERATE_STRESS: goal reminders are most effective (78% success)
# - In HIGH_STRESS: hard blocks necessary (gentle reminders drop to 40%)

Continuous Learning and Adaptation

User preferences and circumstances change over time. Whistl's RL system continuously adapts:

Recency Weighting

Recent outcomes matter more than distant history. Whistl uses exponentially weighted rewards:

def calculate_weighted_reward(history, decay_factor=0.95):
    """
    Calculate reward with exponential decay for older observations.
    
    Args:
        history: List of (context, reward) tuples ordered by time
        decay_factor: How quickly to discount old data (0.95 = 5% decay per step)
    """
    weighted_sum = 0
    total_weight = 0
    
    for i, (context, reward) in enumerate(reversed(history)):
        weight = decay_factor ** i
        weighted_sum += reward * weight
        total_weight += weight
    
    return weighted_sum / total_weight if total_weight > 0 else 0

Detecting Preference Shifts

If an intervention's success rate suddenly drops, Whistl increases exploration to discover new effective strategies:

Ethical Considerations in Behavioural RL

Using RL to influence behaviour raises important ethical questions:

Respecting Autonomy

Whistl's RL optimises for user-stated goals, not app engagement or revenue. Users define their own spending limits and financial goals; the RL system learns to support those goals, not override them.

Transparency

Users can see which interventions have worked best for them:

# Example user-facing intervention effectiveness report
intervention_report = {
    "gentle_reminder": {
        "times_used": 47,
        "success_rate": 0.87,
        "your_rating": 4.2,
        "recommendation": "Highly effective for you"
    },
    "hard_block": {
        "times_used": 12,
        "success_rate": 0.34,
        "your_rating": 2.1,
        "recommendation": "Consider alternatives"
    },
    "mindfulness_prompt": {
        "times_used": 23,
        "success_rate": 0.76,
        "your_rating": 4.5,
        "recommendation": "Effective and well-received"
    }
}

Opt-Out and Control

Users can:

Performance Results

After deploying RL-based intervention selection, Whistl observed significant improvements:

Metric Before RL After RL Improvement
Intervention Success Rate 62% 79% +27%
User Satisfaction 3.4/5 4.3/5 +26%
Intervention Acceptance 58% 81% +40%
30-Day Retention 67% 82% +22%
"What impressed me most was how Whistl learned. The first few weeks, interventions felt hit-or-miss. But after a month, it was uncanny—it knew exactly which approach would work. Gentle reminders when I was mildly tempted, stronger blocks when I was really struggling. It felt like having a coach who actually understood me."
— David L., Whistl user since 2025

The Future of RL in Behavioural Support

Whistl's research team is exploring advanced RL techniques:

Getting Started with Whistl

Experience interventions that learn and adapt to you. Whistl's reinforcement learning system discovers the most effective support strategies for your unique psychology and circumstances.

Personalised Intervention Support

Join thousands of Australians using Whistl's adaptive AI to receive interventions that learn what works best for them personally.

Crisis Support Resources

If you're experiencing severe financial distress or gambling-related harm, professional support is available:

Related Articles