Problem Description
Solution
To integrate a scikit-learn machine learning model into a QuantConnect algorithm, you generally follow a workflow of Training (using historical data to fit the model) and Inference (using the trained model to predict future price movements).
Below is a complete, robust example of a strategy that uses a Random Forest Classifier to predict whether the SPY will close higher the next day based on recent momentum features.
Strategy Logic
- Goal: Predict if the next day's return will be positive.
- Features:
- 1-Day Return (Momentum)
- 5-Day Return (Weekly Trend)
- Volatility (Standard Deviation of returns)
- Model:
RandomForestClassifierfromscikit-learn. - Training: Retrains the model at the start of every month using a rolling window of historical data.
- Execution: If the model predicts "Up", buy; otherwise, liquidate.
Python Implementation
# region imports
from AlgorithmImports import *
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
# endregion
class ScikitLearnIntegrationAlgorithm(QCAlgorithm):
def initialize(self):
# 1. Set Strategy Settings
self.set_start_date(2018, 1, 1)
self.set_end_date(2023, 1, 1)
self.set_cash(100000)
# 2. Add Assets
self.symbol = self.add_equity("SPY", Resolution.DAILY).symbol
# 3. Initialize Model and State Variables
self.model = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, random_state=42)
self.lookback = 252 # Training data history length (1 year)
self.is_trained = False
# 4. Schedule Training
# Train the model at the start of every month
self.schedule.on(
self.date_rules.month_start(self.symbol),
self.time_rules.after_market_open(self.symbol, 30),
self.train_model
)
# 5. Schedule Prediction/Trading
# Trade every day 30 minutes before close
self.schedule.on(
self.date_rules.every_day(self.symbol),
self.time_rules.before_market_close(self.symbol, 30),
self.trade
)
def get_features_and_labels(self, history_df):
"""
Feature Engineering function.
Constructs input features (X) and target labels (y).
"""
# Ensure we have data
if history_df.empty:
return None, None
# Calculate Returns
# Feature 1: Daily Return
history_df['return_1d'] = history_df['close'].pct_change(1)
# Feature 2: Weekly Return
history_df['return_5d'] = history_df['close'].pct_change(5)
# Feature 3: Volatility (10 day rolling std dev)
history_df['volatility'] = history_df['close'].pct_change(1).rolling(10).std()
# Drop NaNs created by rolling windows
history_df.dropna(inplace=True)
# Create Target: 1 if next day return is positive, else 0
# We shift(-1) to align today's features with tomorrow's outcome for training
history_df['target'] = (history_df['close'].shift(-1) > history_df['close']).astype(int)
# Drop the last row because it won't have a target (no 'tomorrow' yet)
training_df = history_df.dropna()
# Define Feature Columns
feature_cols = ['return_1d', 'return_5d', 'volatility']
X = training_df[feature_cols]
y = training_df['target']
return X, y
def train_model(self):
"""
Fetches history, prepares data, and fits the Random Forest model.
"""
# Request historical data
history = self.history(self.symbol, self.lookback + 10, Resolution.DAILY)
if history.empty:
return
# Prepare features
X, y = self.get_features_and_labels(history)
# Fit the model
if X is not None and not X.empty:
self.model.fit(X, y)
self.is_trained = True
self.debug(f"Model trained on {len(X)} samples at {self.time}")
def trade(self):
"""
Generates features for the current moment and predicts direction.
"""
if not self.is_trained:
return
# Get recent history to generate the *current* feature vector
# We need enough data to calculate the rolling indicators (e.g., 10 days for volatility)
recent_history = self.history(self.symbol, 15, Resolution.DAILY)
if recent_history.empty:
return
# Re-create features on recent data
# Note: We don't need 'target' here, just the features
recent_history['return_1d'] = recent_history['close'].pct_change(1)
recent_history['return_5d'] = recent_history['close'].pct_change(5)
recent_history['volatility'] = recent_history['close'].pct_change(1).rolling(10).std()
# Select the last row (most recent data point)
current_features = recent_history[['return_1d', 'return_5d', 'volatility']].iloc[[-1]]
# Check if we have valid data (no NaNs)
if current_features.isnull().values.any():
return
# Predict
prediction = self.model.predict(current_features)[0]
# Execution Logic
if prediction == 1:
# Model predicts Up -> Go Long
self.set_holdings(self.symbol, 1.0)
self.plot("Trade", "Prediction", 1)
else:
# Model predicts Down -> Liquidate (or go Short)
self.liquidate(self.symbol)
self.plot("Trade", "Prediction", 0)
Key Implementation Details
-
Scheduled Training (
train_model):- Machine learning models in trading need to adapt to changing market regimes. We use
self.scheduleto retrain the model monthly. - We fetch
self.historyinside the training function to get the most recent data available up to that point in the backtest.
- Machine learning models in trading need to adapt to changing market regimes. We use
-
Feature Engineering (
get_features_and_labels):- Look-ahead Bias Prevention: When creating the training set, we align today's features with tomorrow's return using
.shift(-1). - Consistency: The logic used to create features for training must be identical to the logic used for live prediction.
- Look-ahead Bias Prevention: When creating the training set, we align today's features with tomorrow's return using
-
Handling Data Gaps:
- Calculations like
pct_change(5)orrolling(10)introduceNaNvalues at the beginning of the DataFrame. - We use
.dropna()to ensurescikit-learnreceives clean numerical data, otherwise, it will throw an error.
- Calculations like
-
Inference (
trade):- We fetch a small slice of history (15 days) just to calculate the indicators for the current day.
- We use
.iloc[[-1]]to grab the very last row as a DataFrame (keeping the shape(1, n_features)) to pass intomodel.predict().
Q&A: Machine Learning in QuantConnect
Q: Can I use deep learning libraries like TensorFlow or PyTorch?
A: Yes. QuantConnect supports tensorflow and torch. The integration logic is similar: initialize the network in initialize, train it using historical data, and run inference in your trading method. However, deep learning models are computationally heavier and may require using the Train() method to offload training to a separate node if the training takes longer than the 10-minute timeout limit for OnData or Scheduled Events.
Q: How do I save the model so I don't have to retrain every time I restart?
A: In a live trading environment, you can use the ObjectStore to save your trained model (pickled) and reload it. In backtesting, models are usually retrained on the fly because the "current time" changes rapidly.
Q: Why did you use Resolution.DAILY?
A: Daily resolution is used here for clarity and speed. If you use Resolution.MINUTE, the history calls return significantly more data, and feature engineering requires careful handling of intraday timestamps. The logic remains the same, but the data volume increases.
Q: What happens if the model takes too long to train?
A: QuantConnect has a timeout for OnData and Scheduled Events (typically 10 minutes). If your model training is heavy (e.g., large Random Forest or Neural Network), you should use the dedicated self.train(self.my_training_function) method, which runs in a separate thread/process and does not block the trading execution loop.