Ensemble Methods for Risk Prediction: Why Multiple Models Beat Single Models
Just as diverse groups make better decisions than individuals, combining multiple machine learning models produces more accurate and robust predictions. Discover how Whistl uses ensemble methods—Random Forests, Gradient Boosting, and model stacking—to deliver reliable impulse risk predictions.
The Wisdom of Crowds in Machine Learning
In 1906, statistician Francis Galton observed a remarkable phenomenon at a country fair. Visitors were asked to guess the weight of an ox. Individually, guesses varied wildly. But the average of all guesses was 1,197 pounds—remarkably close to the actual weight of 1,198 pounds.
This "wisdom of crowds" effect applies to machine learning. A single model might be brilliant in some situations and blind in others. But combine multiple models, and their individual errors tend to cancel out while their correct predictions reinforce each other.
At Whistl, ensemble methods are fundamental to our risk prediction system. No single algorithm captures all the complexity of human financial behaviour—but together, multiple models achieve remarkable accuracy.
Why Ensembles Work
Ensemble methods reduce two types of error:
- Bias: Systematic errors from oversimplified assumptions
- Variance: Errors from sensitivity to training data fluctuations
Different models have different bias-variance profiles. By combining them, ensembles achieve a better balance than any single model could.
Random Forests: Diversity Through Bootstrap Aggregation
Random Forests are among the most popular ensemble methods. They combine many decision trees, each trained on a different subset of data and features.
How Random Forests Work
from sklearn.ensemble import RandomForestClassifier
import numpy as np
class ImpulseRiskRandomForest:
def __init__(self, n_trees=100, max_depth=15):
self.n_trees = n_trees
self.max_depth = max_depth
self.trees = []
def fit(self, X, y):
"""
Train Random Forest using bootstrap aggregation (bagging).
Each tree sees a different random subset of data and features.
"""
n_samples = len(X)
for i in range(self.n_trees):
# Bootstrap sample (sample with replacement)
indices = np.random.choice(n_samples, size=n_samples, replace=True)
X_bootstrap = X[indices]
y_bootstrap = y[indices]
# Train decision tree with random feature subset
tree = DecisionTreeClassifier(
max_depth=self.max_depth,
max_features='sqrt', # Random feature subset at each split
random_state=i
)
tree.fit(X_bootstrap, y_bootstrap)
self.trees.append(tree)
def predict_proba(self, X):
"""
Aggregate predictions from all trees.
Final probability = average of individual tree predictions.
"""
predictions = np.zeros((len(X), 2))
for tree in self.trees:
predictions += tree.predict_proba(X)
predictions /= self.n_trees
return predictions
def get_feature_importance(self):
"""Average feature importance across all trees."""
importances = np.zeros(self.n_features)
for tree in self.trees:
importances += tree.feature_importances_
return importances / self.n_trees
Why Random Forests Excel at Risk Prediction
- Handles non-linear relationships: Decision trees capture complex interactions between features
- Robust to outliers: Individual trees might be affected, but the ensemble averages them out
- Feature importance: Provides interpretable rankings of which signals matter most
- Low variance: Averaging many trees reduces overfitting
Gradient Boosting: Learning from Mistakes
While Random Forests train trees independently, Gradient Boosting trains trees sequentially, with each tree learning to correct the mistakes of its predecessors.
The Boosting Process
from sklearn.ensemble import GradientBoostingClassifier
class GradientBoostingRiskPredictor:
def __init__(self, n_estimators=100, learning_rate=0.1):
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.trees = []
self.initial_prediction = None
def fit(self, X, y):
"""
Train Gradient Boosting classifier.
Each tree fits the residuals (errors) of previous trees.
"""
# Initial prediction (log-odds of positive class)
self.initial_prediction = np.log(y.mean() / (1 - y.mean()))
current_predictions = np.full(len(X), self.initial_prediction)
for i in range(self.n_estimators):
# Calculate residuals (negative gradient of loss function)
probabilities = 1 / (1 + np.exp(-current_predictions))
residuals = y - probabilities
# Fit tree to residuals
tree = DecisionTreeRegressor(max_depth=3) # Shallow trees
tree.fit(X, residuals)
self.trees.append(tree)
# Update predictions
current_predictions += self.learning_rate * tree.predict(X)
def predict_proba(self, X):
"""Aggregate predictions from all trees."""
predictions = np.full(len(X), self.initial_prediction)
for tree in self.trees:
predictions += self.learning_rate * tree.predict(X)
probabilities = 1 / (1 + np.exp(-predictions))
return np.column_stack([1 - probabilities, probabilities])
XGBoost: Optimised Gradient Boosting
Whistl uses XGBoost (Extreme Gradient Boosting), an optimised implementation that includes:
- Regularisation: Prevents overfitting with L1 and L2 penalties
- Handling missing values: Learns optimal default directions
- Parallel processing: Faster training through parallelisation
- Tree pruning: Removes branches with negative gain
import xgboost as xgb
# XGBoost configuration for Whistl risk prediction
xgb_params = {
'objective': 'binary:logistic',
'eval_metric': 'auc',
'max_depth': 6,
'learning_rate': 0.05,
'n_estimators': 200,
'subsample': 0.8, # Row subsampling
'colsample_bytree': 0.8, # Column subsampling
'reg_alpha': 0.1, # L1 regularisation
'reg_lambda': 1.0, # L2 regularisation
'scale_pos_weight': 3.0, # Handle class imbalance
'random_state': 42
}
model = xgb.XGBClassifier(**xgb_params)
model.fit(X_train, y_train)
Model Stacking: Learning to Combine Predictions
Model stacking (or stacked generalisation) takes ensembling further: instead of simply averaging predictions, a meta-learner learns the optimal way to combine base model predictions.
Stacking Architecture
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
import numpy as np
class StackedRiskPredictor:
def __init__(self):
# Level 0: Base models (diverse algorithms)
self.base_models = {
'random_forest': RandomForestClassifier(
n_estimators=100, max_depth=15, random_state=42
),
'gradient_boosting': GradientBoostingClassifier(
n_estimators=100, learning_rate=0.1, random_state=42
),
'neural_network': MLPClassifier(
hidden_layer_sizes=(100, 50), random_state=42
),
'logistic_regression': LogisticRegression(random_state=42)
}
# Level 1: Meta-learner (combines base model predictions)
self.meta_learner = LogisticRegression()
def fit(self, X, y):
"""
Train stacked ensemble using cross-validation.
Use out-of-fold predictions to train meta-learner (prevents overfitting).
"""
from sklearn.model_selection import KFold
n_folds = 5
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
# Generate out-of-fold predictions for meta-learner training
n_samples = len(X)
n_base_models = len(self.base_models)
oof_predictions = np.zeros((n_samples, n_base_models))
for fold_idx, (train_idx, val_idx) in enumerate(kf.split(X)):
X_train_fold, X_val_fold = X[train_idx], X[val_idx]
y_train_fold = y[train_idx]
for model_idx, (name, model) in enumerate(self.base_models.items()):
# Train on fold
model.fit(X_train_fold, y_train_fold)
# Predict on validation fold
oof_predictions[val_idx, model_idx] = model.predict_proba(X_val_fold)[:, 1]
# Train meta-learner on out-of-fold predictions
self.meta_learner.fit(oof_predictions, y)
# Retrain all base models on full data
for name, model in self.base_models.items():
model.fit(X, y)
def predict_proba(self, X):
"""Generate predictions using stacked ensemble."""
# Get base model predictions
base_predictions = np.zeros((len(X), len(self.base_models)))
for model_idx, (name, model) in enumerate(self.base_models.items()):
base_predictions[:, model_idx] = model.predict_proba(X)[:, 1]
# Meta-learner combines predictions
final_predictions = self.meta_learner.predict_proba(base_predictions)
return final_predictions
Why Stacking Outperforms Simple Averaging
The meta-learner discovers which models are most reliable in different situations:
- Random Forest might be best for users with regular spending patterns
- Neural Networks might excel for users with complex, non-linear behaviour
- Gradient Boosting might be most accurate for high-risk predictions
The meta-learner learns these patterns and weights models accordingly.
Ensemble Performance in Whistl
Whistl has extensively benchmarked ensemble methods against individual models:
| Model | Precision | Recall | F1 Score | AUC-ROC |
|---|---|---|---|---|
| Logistic Regression | 71.2% | 65.8% | 68.4% | 0.74 |
| Single Decision Tree | 68.5% | 71.2% | 69.8% | 0.71 |
| Random Forest | 84.2% | 79.6% | 81.8% | 0.88 |
| XGBoost | 85.7% | 81.3% | 83.5% | 0.89 |
| Neural Network | 82.1% | 78.9% | 80.5% | 0.86 |
| Stacked Ensemble | 88.4% | 84.7% | 86.5% | 0.92 |
Handling Class Imbalance with Ensembles
Impulse purchases are relatively rare compared to routine transactions. This class imbalance challenges all machine learning models. Ensembles offer several solutions:
Balanced Random Forests
from imblearn.ensemble import BalancedRandomForestClassifier
# Balanced Random Forest automatically handles class imbalance
brf = BalancedRandomForestClassifier(
n_estimators=100,
max_depth=15,
sampling_strategy='auto', # Balance classes in each bootstrap sample
replacement=True,
random_state=42
)
brf.fit(X_train, y_train)
Focal Loss for Hard Examples
Focal loss down-weights easy examples and focuses training on hard-to-classify cases:
def focal_loss(y_true, y_pred, gamma=2.0, alpha=0.25):
"""
Focal loss for handling class imbalance.
Down-weights easy examples, focuses on hard examples.
"""
epsilon = 1e-7
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
# Calculate cross-entropy
ce = -y_true * np.log(y_pred) - (1 - y_true) * np.log(1 - y_pred)
# Calculate focal weight
pt = y_true * y_pred + (1 - y_true) * (1 - y_pred)
focal_weight = alpha * (1 - pt) ** gamma
return np.mean(focal_weight * ce)
Ensemble Interpretability
While ensembles are more complex than single models, they remain interpretable:
Feature Importance Aggregation
def get_ensemble_feature_importance(ensemble, feature_names):
"""
Aggregate feature importance across all models in ensemble.
"""
importance_dict = {}
for name, model in ensemble.base_models.items():
if hasattr(model, 'feature_importances_'):
importance_dict[name] = dict(zip(
feature_names,
model.feature_importances_
))
# Average across models
avg_importance = {}
for feature in feature_names:
avg_importance[feature] = np.mean([
importance_dict[model].get(feature, 0)
for model in importance_dict
])
# Sort by importance
sorted_importance = sorted(
avg_importance.items(),
key=lambda x: x[1],
reverse=True
)
return sorted_importance
# Example output:
# [
# ('stress_level', 0.18),
# ('time_since_payday', 0.15),
# ('location_risk', 0.12),
# ('spending_velocity', 0.11),
# ('category_momentum', 0.09),
# ...
# ]
SHAP Values for Ensemble Predictions
SHAP values work with ensemble models to explain individual predictions:
import shap
# Create SHAP explainer for ensemble
explainer = shap.TreeExplainer(ensemble_model)
shap_values = explainer.shap_values(X_sample)
# Visualise feature contributions
shap.summary_plot(shap_values, X_sample, feature_names=feature_names)
"I was impressed by how consistent Whistl's predictions were. Even when my behaviour was erratic, the app seemed to 'get it'. Later I learned they use ensemble methods—multiple models voting on each prediction. That explains the reliability."
The Future of Ensemble Methods
Whistl continues to advance ensemble techniques:
- Deep ensembles: Combining multiple neural networks with different initialisations
- Snapshot ensembles: Multiple models from different points in training
- Neural Architecture Search: Automatically discovering optimal ensemble configurations
- Online ensembles: Continuously updating ensemble as new data arrives
Getting Started with Whistl
Experience the reliability of ensemble-powered risk prediction. Whistl's multi-model approach delivers consistent, accurate predictions that help you stay on track with your financial goals.
Robust AI-Powered Risk Prediction
Join thousands of Australians using Whistl's ensemble-based prediction system for reliable, accurate impulse risk detection.
Crisis Support Resources
If you're experiencing severe financial distress or gambling-related harm, professional support is available:
- Gambling Help: 1800 858 858 (24/7, free and confidential)
- Lifeline: 13 11 14 (24/7 crisis support)
- Beyond Blue: 1300 22 4636 (mental health support)
- Financial Counselling Australia: 1800 007 007