🚀 AI One-Click Generation of quantconnect Strategy Code

How to Use a Scikit-learn Model for Predictions in a QuantConnect Algorithm

quantconnect | Published on: 2025/12/10
The following content is generated by EasyQuant .

Problem Description

How do I integrate a scikit-learn machine learning model into a trading algorithm?

Solution

To integrate a scikit-learn machine learning model into a QuantConnect algorithm, you generally follow a workflow of Training (using historical data to fit the model) and Inference (using the trained model to predict future price movements).

Below is a complete, robust example of a strategy that uses a Random Forest Classifier to predict whether the SPY will close higher the next day based on recent momentum features.

Strategy Logic

  1. Goal: Predict if the next day's return will be positive.
  2. Features:
    • 1-Day Return (Momentum)
    • 5-Day Return (Weekly Trend)
    • Volatility (Standard Deviation of returns)
  3. Model: RandomForestClassifier from scikit-learn.
  4. Training: Retrains the model at the start of every month using a rolling window of historical data.
  5. Execution: If the model predicts "Up", buy; otherwise, liquidate.

Python Implementation

# region imports
from AlgorithmImports import *
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
# endregion

class ScikitLearnIntegrationAlgorithm(QCAlgorithm):

    def initialize(self):
        # 1. Set Strategy Settings
        self.set_start_date(2018, 1, 1)
        self.set_end_date(2023, 1, 1)
        self.set_cash(100000)

        # 2. Add Assets
        self.symbol = self.add_equity("SPY", Resolution.DAILY).symbol

        # 3. Initialize Model and State Variables
        self.model = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, random_state=42)
        self.lookback = 252  # Training data history length (1 year)
        self.is_trained = False
        
        # 4. Schedule Training
        # Train the model at the start of every month
        self.schedule.on(
            self.date_rules.month_start(self.symbol),
            self.time_rules.after_market_open(self.symbol, 30),
            self.train_model
        )

        # 5. Schedule Prediction/Trading
        # Trade every day 30 minutes before close
        self.schedule.on(
            self.date_rules.every_day(self.symbol),
            self.time_rules.before_market_close(self.symbol, 30),
            self.trade
        )

    def get_features_and_labels(self, history_df):
        """
        Feature Engineering function.
        Constructs input features (X) and target labels (y).
        """
        # Ensure we have data
        if history_df.empty:
            return None, None

        # Calculate Returns
        # Feature 1: Daily Return
        history_df['return_1d'] = history_df['close'].pct_change(1)
        # Feature 2: Weekly Return
        history_df['return_5d'] = history_df['close'].pct_change(5)
        # Feature 3: Volatility (10 day rolling std dev)
        history_df['volatility'] = history_df['close'].pct_change(1).rolling(10).std()

        # Drop NaNs created by rolling windows
        history_df.dropna(inplace=True)

        # Create Target: 1 if next day return is positive, else 0
        # We shift(-1) to align today's features with tomorrow's outcome for training
        history_df['target'] = (history_df['close'].shift(-1) > history_df['close']).astype(int)
        
        # Drop the last row because it won't have a target (no 'tomorrow' yet)
        training_df = history_df.dropna()

        # Define Feature Columns
        feature_cols = ['return_1d', 'return_5d', 'volatility']
        
        X = training_df[feature_cols]
        y = training_df['target']

        return X, y

    def train_model(self):
        """
        Fetches history, prepares data, and fits the Random Forest model.
        """
        # Request historical data
        history = self.history(self.symbol, self.lookback + 10, Resolution.DAILY)
        
        if history.empty:
            return

        # Prepare features
        X, y = self.get_features_and_labels(history)

        # Fit the model
        if X is not None and not X.empty:
            self.model.fit(X, y)
            self.is_trained = True
            self.debug(f"Model trained on {len(X)} samples at {self.time}")

    def trade(self):
        """
        Generates features for the current moment and predicts direction.
        """
        if not self.is_trained:
            return

        # Get recent history to generate the *current* feature vector
        # We need enough data to calculate the rolling indicators (e.g., 10 days for volatility)
        recent_history = self.history(self.symbol, 15, Resolution.DAILY)
        
        if recent_history.empty:
            return

        # Re-create features on recent data
        # Note: We don't need 'target' here, just the features
        recent_history['return_1d'] = recent_history['close'].pct_change(1)
        recent_history['return_5d'] = recent_history['close'].pct_change(5)
        recent_history['volatility'] = recent_history['close'].pct_change(1).rolling(10).std()
        
        # Select the last row (most recent data point)
        current_features = recent_history[['return_1d', 'return_5d', 'volatility']].iloc[[-1]]

        # Check if we have valid data (no NaNs)
        if current_features.isnull().values.any():
            return

        # Predict
        prediction = self.model.predict(current_features)[0]
        
        # Execution Logic
        if prediction == 1:
            # Model predicts Up -> Go Long
            self.set_holdings(self.symbol, 1.0)
            self.plot("Trade", "Prediction", 1)
        else:
            # Model predicts Down -> Liquidate (or go Short)
            self.liquidate(self.symbol)
            self.plot("Trade", "Prediction", 0)

Key Implementation Details

  1. Scheduled Training (train_model):

    • Machine learning models in trading need to adapt to changing market regimes. We use self.schedule to retrain the model monthly.
    • We fetch self.history inside the training function to get the most recent data available up to that point in the backtest.
  2. Feature Engineering (get_features_and_labels):

    • Look-ahead Bias Prevention: When creating the training set, we align today's features with tomorrow's return using .shift(-1).
    • Consistency: The logic used to create features for training must be identical to the logic used for live prediction.
  3. Handling Data Gaps:

    • Calculations like pct_change(5) or rolling(10) introduce NaN values at the beginning of the DataFrame.
    • We use .dropna() to ensure scikit-learn receives clean numerical data, otherwise, it will throw an error.
  4. Inference (trade):

    • We fetch a small slice of history (15 days) just to calculate the indicators for the current day.
    • We use .iloc[[-1]] to grab the very last row as a DataFrame (keeping the shape (1, n_features)) to pass into model.predict().

Q&A: Machine Learning in QuantConnect

Q: Can I use deep learning libraries like TensorFlow or PyTorch?
A: Yes. QuantConnect supports tensorflow and torch. The integration logic is similar: initialize the network in initialize, train it using historical data, and run inference in your trading method. However, deep learning models are computationally heavier and may require using the Train() method to offload training to a separate node if the training takes longer than the 10-minute timeout limit for OnData or Scheduled Events.

Q: How do I save the model so I don't have to retrain every time I restart?
A: In a live trading environment, you can use the ObjectStore to save your trained model (pickled) and reload it. In backtesting, models are usually retrained on the fly because the "current time" changes rapidly.

Q: Why did you use Resolution.DAILY?
A: Daily resolution is used here for clarity and speed. If you use Resolution.MINUTE, the history calls return significantly more data, and feature engineering requires careful handling of intraday timestamps. The logic remains the same, but the data volume increases.

Q: What happens if the model takes too long to train?
A: QuantConnect has a timeout for OnData and Scheduled Events (typically 10 minutes). If your model training is heavy (e.g., large Random Forest or Neural Network), you should use the dedicated self.train(self.my_training_function) method, which runs in a separate thread/process and does not block the trading execution loop.