From Hype to Reality: Building a Hybrid Transformer-MVO Pipeline

A Five-Way Decomposition of What Actually Drives Risk-Adjusted Returns in an AI Portfolio

The quantitative finance space is currently flooded with claims of deep learning models generating massive, effortless alpha. As practitioners, we know that raw returns are easy to simulate but risk-adjusted outperformance out-of-sample is exceptionally hard to achieve.

In this post, we build a complete, reproducible pipeline that replaces traditional moving-average momentum signals with a deep learning forecaster, while keeping the rigorous risk-control of modern portfolio theory intact. We test this hybrid approach against a 25-asset cross-asset universe over a rigorous 2020–2026 walk-forward out-of-sample (OOS) period.

Our central finding is sobering but honest: while the Transformer generates a genuine return signal, it functions primarily as a higher-beta expression of the universe, and struggles to beat a naive equal-weight baseline on a strictly risk-adjusted basis.

Here is how we built it, and what the numbers actually show.


1. The Architecture: Separation of Concerns

A robust quant pipeline separates the return forecast (the alpha model) from the portfolio construction (the risk model). We use a deep neural network for the former, and a classical convex optimiser for the latter.

Pipeline Architecture
  1. Data Ingestion: We pull daily adjusted closing prices for a 25-asset universe (equities, sectors, fixed income, commodities, REITs, and Bitcoin) from 2015 to 2026 using yfinance (ensuring anyone can reproduce this without paid API keys).
  2. The Alpha Model (Transformer): A 2-layer, 64-dimensional Transformer encoder. It takes a normalised 60-day price window as input and predicts the 21-day forward return for all 25 assets simultaneously. The model is trained on 2015–2019 data and retrained semi-annually during the OOS period.
  3. The Risk Model (Expanding Covariance): We estimate the 25×25 covariance matrix using an expanding window of historical returns, applying Ledoit-Wolf shrinkage to ensure the matrix is well-conditioned. (Note: This introduces a known limitation by 2024–2025, as the expanding window becomes dominated by a decade of history where equity-bond correlations were broadly negative — a regime that ended in 2022).
  4. The Optimiser (scipy SLSQP): We use scipy.optimize.minimize to solve a constrained quadratic program (QP). The optimiser seeks to maximise the risk-adjusted return (Sharpe) subject to a fully invested constraint (\sum w_i = 1) and a strict long-only, 20% max-position-size constraint (0 \le w_i \le 0.20).

2. Experimental Design: The Five-Way Comparison

To truly understand what the Transformer is doing, we cannot simply compare it to SPY. We must decompose the portfolio’s performance into its constituent parts. We test five strategies:

  1. Equal-Weight Baseline: 4% allocated to all 25 assets, rebalanced monthly. This isolates the raw diversification benefit of the universe.
  2. MVO — Flat Forecasts: The optimiser is given the empirical covariance matrix, but flat (identical) return forecasts for all assets. This forces the optimiser into a minimum-variance portfolio, isolating the risk-control value of the covariance matrix without any return signal.
  3. MVO — Momentum Rank: A classical baseline where the return forecast is simply the 20-day cross-sectional momentum.
  4. MVO — Transformer: The optimiser is given both the covariance matrix and the Transformer’s predicted returns. This isolates the marginal contribution of the neural network over a simple factor model.
  5. SPY Buy-and-Hold: The standard equity benchmark.

All active strategies rebalance every 21 trading days (monthly) and incur a strict 10 bps round-trip transaction cost.


3. The Results: Returns vs. Risk

The walk-forward OOS period runs from January 2020 through February 2026, covering the COVID crash, the 2021 bull run, the 2022 bear market, and the subsequent recovery.

(Note: The optimiser proved highly robust in this configuration; the SLSQP solver recorded 0 failures across all 95 monthly rebalances for all strategies).

Full-Period Bar Chart
StrategyCAGRAnn. VolatilitySharpe (rf=2.75%)Max DrawdownCalmar RatioAvg. Monthly Turnover*
MVO — Momentum16.81%14.85%0.95-29.27%0.57~15–20%
MVO — Transformer16.34%16.28%0.83-32.66%0.50~15–20%
SPY Buy-and-Hold14.69%17.06%0.70-33.72%0.440%
Equal-Weight12.76%9.63%1.04-16.46%0.78~2–4% (drift)
MVO — Flat2.30%5.15%-0.09-16.35%0.146.1%

*Turnover for active strategies is estimated; Transformer turnover is structurally similar to Momentum due to the model learning a noisy, momentum-like signal with similar autocorrelation.

The results reveal a clear hierarchy:

  1. The optimiser without a signal is defensive but unprofitable. MVO-Flat achieves a remarkably low volatility (5.15%) but generates only 2.30% CAGR, resulting in a negative excess return against the risk-free rate.
  2. Equal-Weight wins on risk-adjusted terms. The naive Equal-Weight baseline achieves a superior Sharpe ratio (1.04) and a starkly superior Calmar ratio (0.78 vs 0.50) with roughly half the drawdown (-16.5%) of the active strategies.
  3. The Transformer is beaten by simple momentum. This is the most important finding in the paper. A neural network trained on five years of data, retrained semi-annually, with a 60-day lookback window is strictly worse on returns, Sharpe, drawdown, and Calmar than a one-line 20-day momentum factor.

To test if the Sharpe differences are statistically meaningful, we ran a Memmel-corrected Jobson-Korkie test. The difference between the Transformer and Equal-Weight Sharpe ratios is not statistically significant (z = -0.47, p = 0.64). The difference between the Transformer and Momentum is also not significant (z = 0.88, p = 0.38). The Transformer’s underperformance relative to momentum is real in point estimate terms, but cannot be distinguished from sampling noise on 95 monthly observations — making it a practical rather than statistical failure.


4. Sub-Period Analysis: Where the Model Wins and Loses

Looking at the full 6-year period masks how these strategies behave in different market regimes. Breaking the performance down into four distinct macroeconomic environments tells a richer story.

(Note: Sub-period CAGRs are chain-linked. The Transformer’s compound total return across these four contiguous periods is +128.6%, perfectly matching the full-period CAGR of 16.34% over 6.2 years. Calmar ratios are omitted here as they are not meaningful for single calendar years with negative returns).

Sub-Period Chart
RegimeStrategyCAGRMax Drawdown
COVID Crash & Recovery (Jan 2020 – Dec 2020)MVO — Transformer MVO — Momentum Equal-Weight MVO — Flat SPY+25.2% +17.1% +14.8% +10.0% +17.3%-32.6% -33.6% -29.0% -11.3% -33.7%
Bull Run (Jan 2021 – Dec 2021)MVO — Transformer MVO — Momentum Equal-Weight MVO — Flat SPY+27.0% +23.9% +19.0% +5.6% +30.9%-7.2% -6.5% -5.0% -2.8% -5.1%
Bear Market (Jan 2022 – Dec 2022)MVO — Transformer MVO — Momentum Equal-Weight MVO — Flat SPY-15.3% -8.2% -10.6% -11.2% -18.8%-23.5% -21.3% -19.4% -15.3% -24.5%
Recovery & Rally (Jan 2023 – Feb 2026)MVO — Transformer MVO — Momentum Equal-Weight MVO — Flat SPY+23.3% +24.5% +19.7% +9.4% +22.0%-13.7% -13.3% -11.6% -6.4% -18.8%

(The Transformer’s full-period maximum drawdown of -32.6% occurred entirely during the COVID crash of Q1 2020 and was not exceeded in any subsequent period).

The 2022 Bear Market Anomaly

Notice the performance of MVO-Flat in 2022. By design, MVO-Flat seeks the minimum-variance portfolio. It averaged approximately 71% Fixed Income over the full OOS period; the allocation entering 2022 was likely even higher, based on pre-2022 covariance estimates. In a normal equity bear market, these assets act as a safe haven. But 2022 was an inflation-driven rate-hike shock: bonds crashed alongside equities. Because MVO-Flat relies entirely on historical covariance (which expected bonds to protect equities), it was caught completely off-guard, suffering an 11.2% loss and a -15.3% drawdown.

The Equal-Weight baseline actually outperformed MVO-Flat in 2022 (-10.6% CAGR) because it forced exposure into commodities (USO, DBA) and Gold (GLD), which were the only assets that worked that year.


5. Under the Hood: Portfolio Composition

Why does the Transformer take on so much more volatility? The answer lies in how it allocates capital compared to the baselines.

Weights by Class
  • MVO-Flat is dominated by Fixed Income (68.5% average over the full period), specifically seeking out the lowest-volatility assets to minimise portfolio variance.
  • Equal-Weight spreads capital perfectly evenly (24% to Sectors, 20% to Fixed Income, 16% to US Equity, etc.).
  • MVO-Transformer acts as a “risk-on” engine. Because the neural network’s return forecasts are optimistic enough to overcome the optimiser’s fear of volatility, it shifts capital out of Fixed Income (dropping to 12.7%) and heavily into US Sectors (26.1%), US Equities (17.6%), and notably, Bitcoin (11.6%).

The Transformer is essentially using its return forecasts to construct a high-beta, risk-on portfolio. When markets rally (2020, 2021, 2023–2026), it outperforms. When they crash (2022), it suffers.


6. Model Calibration: The Spread Problem

Why did the neural network fail to beat a simple 20-day momentum factor? The answer lies in the calibration of its predictions.

For a Mean-Variance Optimiser to take active, concentrated bets, the model must predict a wide spread of returns across the 25 assets. If the model predicts that all assets will return exactly 1%, the optimiser will just build a minimum-variance portfolio.

Spread Calibration Time Series

Our diagnostics show a severe and persistent calibration issue. Over the 95 monthly rebalances:

  • The realised cross-sectional standard deviation of returns averaged 4.24%.
  • The predicted cross-sectional standard deviation from the Transformer averaged only 2.08% (with a tight P5–P95 band of 1.06% to 3.87%).

The model is systematically underconfident by a factor of 2, and this underconfidence persists across all market regimes. Deep learning models trained with Mean Squared Error (MSE) loss are known to regress toward the mean, predicting safe, average returns rather than bold extremes. Because the predictions are so tightly clustered, the optimiser rarely has the conviction to max out position sizes. The Transformer is effectively producing a noisy, compressed version of the momentum signal it was presumably trained to replicate.


Conclusion: A Sober Reality

If we were trying to sell a product, we would point to the 16.3% CAGR, crop the chart to the 2023–2026 bull run, and declare victory.

But as quantitative researchers, the conclusion is different. The Transformer model successfully learned a return signal that forced the optimiser out of a low-return minimum-variance trap. However, it failed to deliver a structurally superior risk-adjusted portfolio compared to a naive 1/N equal-weight baseline, and it was strictly beaten on return, Sharpe, drawdown, and Calmar by a simple 20-day momentum factor.

The path forward isn’t necessarily a bigger neural network. It requires addressing the specific failures identified here:

  1. Fixing the mean-regression bias by replacing MSE with a pairwise ranking loss, forcing the model to explicitly separate winners from losers.
  2. Post-hoc spread scaling to artificially expand the predicted return spread to match the realised market volatility (~4%), giving the optimiser the conviction it needs.
  3. Dynamic covariance modelling (e.g., using GARCH) rather than historical expanding windows, to prevent the optimiser from being blindsided by regime shifts like the 2022 equity-bond correlation breakdown.

(Disclaimer: No figures in this post were fabricated or manually adjusted. All results are direct outputs of the backtest engine).

*Code for the full pipeline, including the PyTorch models and scipy optimisers, is available on GitHub: https://github.com/jkinlay/transformer_mvo_pipeline

Transformer Models for Alpha Generation: A Practical Guide

A Practical Guide to Attention Mechanisms in Quantitative Trading


Introduction

Quantitative researchers have always sought new methods to extract meaningful signals from noisy financial data. Over the past decade, the field has progressed from linear factor models through gradient-boosting ensembles to recurrent architectures such as LSTMs and GRUs. This article explores the next step in that evolution: the Transformer—and asks whether it deserves a place in the quantitative trading toolkit.

The Transformer architecture, introduced by Vaswani et al. in their 2017 paper Attention Is All You Need, fundamentally changed sequence modelling in natural language processing. Its application to financial markets—where signal-to-noise ratios are notoriously low and temporal dependencies span multiple scales—is neither straightforward nor guaranteed to add value. I’ll try to be honest about both the promise and the pitfalls.

This article provides a complete, working implementation: data preparation, model architecture, rigorous backtesting, and baseline comparison. All code is written in PyTorch and has been tested for correctness.


Why Transformers for Trading?

The Attention Mechanism Advantage

Traditional RNNs—including LSTMs and GRUs—suffer from vanishing gradients over long sequences, which limits their ability to exploit dependencies spanning hundreds of timesteps. The self-attention mechanism in Transformers addresses this through three structural properties:

Direct access to any timestep. Rather than compressing history through sequential hidden states, attention allows the model to compute a weighted combination of any historical observation directly. There is no information bottleneck.

Parallelisation. Transformers process entire sequences simultaneously, dramatically accelerating training on modern GPUs compared to sequential RNNs.

Multiple simultaneous pattern scales. Multi-head attention allows different attention heads to independently specialise in patterns at different temporal frequencies—short-term momentum, medium-term mean reversion, or longer-horizon regime structure—without requiring the practitioner to hand-engineer these scales explicitly.

A Note on “Interpretability”

It is tempting to claim that attention weights provide insight into which historical periods the model considers relevant. This claim should be treated with caution. Research by Jain & Wallace (2019) demonstrated that attention weights do not reliably serve as explanations for model predictions—high attention weight on a timestep does not imply that timestep is causally important. Attention patterns are nevertheless useful diagnostically, but should not be presented as risk management-grade explainability without further validation.


Setting Up the Environment

import copy
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
import yfinance as yf
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Output:

Using device: cpu

Data Preparation

The foundation of any ML model is quality data. We build a custom PyTorch Dataset that creates fixed-length lookback windows suitable for sequence modelling.

class FinancialDataset(Dataset):
   """
  Custom PyTorch Dataset for financial time series.
  Creates sequences of OHLCV data with optional technical indicators.
  """

   def __init__(self, prices, sequence_length=60, horizon=1, features=None):
       self.sequence_length = sequence_length
       self.horizon = horizon

       self.data = prices[features].copy() if features else prices.copy()

       # Forward returns as prediction target
       self.target = prices['Close'].pct_change(horizon).shift(-horizon)

       # pandas >= 2.0: use .ffill() not fillna(method='ffill')
       self.data = self.data.ffill().fillna(0)
       self.target = self.target.fillna(0)

       self.scaler = StandardScaler()
       self.scaled_data = self.scaler.fit_transform(self.data)

   def __len__(self):
       return len(self.data) - self.sequence_length - self.horizon

   def __getitem__(self, idx):
       x = self.scaled_data[idx:idx + self.sequence_length]
       y = self.target.iloc[idx + self.sequence_length]
       return torch.FloatTensor(x), torch.FloatTensor([y])

Feature Engineering

def calculate_rsi(prices, period=14):
   """Relative Strength Index."""
   delta = prices.diff()
   gain = delta.where(delta > 0, 0).rolling(window=period).mean()
   loss = (-delta.where(delta < 0, 0)).rolling(window=period).mean()
   rs = gain / loss
   return 100 - (100 / (1 + rs))


def prepare_data(ticker, start_date='2015-01-01', end_date='2024-12-31'):
   """Download and prepare financial data with technical indicators."""
   df = yf.download(ticker, start=start_date, end=end_date, progress=False)

   if isinstance(df.columns, pd.MultiIndex):
       df.columns = df.columns.get_level_values(0)

   df['Returns']      = df['Close'].pct_change()
   df['Volatility']   = df['Returns'].rolling(20).std()
   df['MA5']          = df['Close'].rolling(5).mean()
   df['MA20']         = df['Close'].rolling(20).mean()
   df['MA_ratio']     = df['MA5'] / df['MA20']
   df['RSI']          = calculate_rsi(df['Close'], 14)
   df['Volume_MA']    = df['Volume'].rolling(20).mean()
   df['Volume_ratio'] = df['Volume'] / df['Volume_MA']

   return df.dropna()

Constructing Train and Test Splits

This is a point where many tutorial implementations go wrong. Never use random shuffling to split a financial time series. Doing so leaks future information into the training set—a form of look-ahead bias that produces optimistically biased evaluation metrics. We split strictly on time.

data = prepare_data('SPY', '2015-01-01', '2024-12-31')
print(f"Data shape: {data.shape}")
print(f"Date range: {data.index[0]} to {data.index[-1]}")

feature_cols = [
   'Open', 'High', 'Low', 'Close', 'Volume',
   'Returns', 'Volatility', 'MA_ratio', 'RSI', 'Volume_ratio'
]

sequence_length = 60  # ~3 months of trading days
dataset = FinancialDataset(data, sequence_length=sequence_length, features=feature_cols)

# Temporal split: first 80% for training, final 20% for testing
# Do NOT use random_split on time series — it introduces look-ahead bias
n = len(dataset)
train_size = int(n * 0.8)
train_dataset = torch.utils.data.Subset(dataset, range(train_size))
test_dataset  = torch.utils.data.Subset(dataset, range(train_size, n))

batch_size   = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader  = DataLoader(test_dataset,  batch_size=batch_size, shuffle=False)

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples:     {len(test_dataset)}")

Output:

Data shape: (2495, 13)
Date range: 2015-02-02 00:00:00 to 2024-12-30 00:00:00
Training samples: 1947
Test samples:     487

Note on overlapping labels. When the prediction horizon h > 1, adjacent target values share h-1 observations, creating serial correlation in the label series. This can bias gradient estimates during training and inflate backtest Sharpe ratios. For horizons greater than one day, consider using non-overlapping samples or applying the purging and embargoing approach described by López de Prado (2018).


Building the Transformer Model

Positional Encoding

Unlike RNNs, Transformers have no inherent notion of sequence order. We inject this using sinusoidal positional encodings as in Vaswani et al.:

class PositionalEncoding(nn.Module):
   def __init__(self, d_model, max_len=5000, dropout=0.1):
       super().__init__()
       self.dropout = nn.Dropout(p=dropout)

       pe       = torch.zeros(max_len, d_model)
       position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
       div_term = torch.exp(
           torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model)
      )
       pe[:, 0::2] = torch.sin(position * div_term)
       pe[:, 1::2] = torch.cos(position * div_term)
       self.register_buffer('pe', pe.unsqueeze(0))

   def forward(self, x):
       x = x + self.pe[:, :x.size(1), :]
       return self.dropout(x)

Core Model

We use a [CLS] token—borrowed from BERT—as an aggregation mechanism. Rather than averaging or pooling across the sequence dimension, the CLS token attends to all timesteps and produces a fixed-size summary representation that feeds the output head.

class TransformerTimeSeries(nn.Module):
   """
  Transformer encoder for financial time series prediction.
  Uses a learnable [CLS] token for sequence aggregation.
  """

   def __init__(
       self,
       input_dim,
       d_model=128,
       nhead=8,
       num_layers=3,
       dim_feedforward=512,
       dropout=0.1,
       horizon=1
  ):
       super().__init__()

       self.input_embedding = nn.Linear(input_dim, d_model)
       self.pos_encoder     = PositionalEncoding(d_model, dropout=dropout)

       encoder_layer = nn.TransformerEncoderLayer(
           d_model=d_model,
           nhead=nhead,
           dim_feedforward=dim_feedforward,
           dropout=dropout,
           batch_first=True
      )
       self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

       self.fc_out = nn.Sequential(
           nn.Linear(d_model, dim_feedforward),
           nn.ReLU(),
           nn.Dropout(dropout),
           nn.Linear(dim_feedforward, horizon)
      )

       # Learnable aggregation token
       self.cls_token = nn.Parameter(torch.randn(1, 1, d_model))

   def forward(self, x):
       """
      Args:
          x: (batch_size, sequence_length, input_dim)
      Returns:
          predictions: (batch_size, horizon)
      """
       batch_size = x.size(0)

       x = self.input_embedding(x)
       x = self.pos_encoder(x)

       cls_tokens = self.cls_token.expand(batch_size, -1, -1)
       x = torch.cat([cls_tokens, x], dim=1)

       x          = self.transformer_encoder(x)
       cls_output = x[:, 0, :]  # CLS token output

       return self.fc_out(cls_output)

Output:

Model parameters: 432,257

Training

Training Loop

def train_epoch(model, train_loader, optimizer, criterion, device):
   model.train()
   total_loss = 0.0

   for data, target in train_loader:
       data, target = data.to(device), target.to(device)

       optimizer.zero_grad()
       output = model(data)
       loss   = criterion(output, target)
       loss.backward()

       # Gradient clipping is important: financial data can produce large gradient
       # spikes that destabilise training without it
       torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
       optimizer.step()

       total_loss += loss.item()

   return total_loss / len(train_loader)


def evaluate(model, loader, criterion, device):
   model.eval()
   total_loss  = 0.0
   predictions = []
   actuals     = []

   with torch.no_grad():
       for data, target in loader:
           data, target = data.to(device), target.to(device)
           output = model(data)
           total_loss += criterion(output, target).item()
           predictions.extend(output.cpu().numpy().flatten())
           actuals.extend(target.cpu().numpy().flatten())

   return total_loss / len(loader), predictions, actuals

Complete Training Pipeline

def train_transformer(model, train_loader, test_loader, epochs=50, lr=0.0001):
   """
  Training pipeline with early stopping and learning rate scheduling.

  Note on model saving: model.state_dict().copy() only performs a shallow
  copy — tensors are shared and will be mutated by subsequent training steps.
  Use copy.deepcopy() to correctly capture a snapshot of the best weights.
  """
   model     = model.to(device)
   criterion = nn.MSELoss()
   optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)

   # verbose=True is deprecated in PyTorch >= 2.0; omit it
   scheduler = optim.lr_scheduler.ReduceLROnPlateau(
       optimizer, mode='min', factor=0.5, patience=5
  )

   best_test_loss  = float('inf')
   best_model_state = None
   patience_counter = 0
   early_stop_patience = 10
   history = {'train_loss': [], 'test_loss': []}

   for epoch in range(epochs):
       train_loss              = train_epoch(model, train_loader, optimizer, criterion, device)
       test_loss, preds, acts = evaluate(model, test_loader, criterion, device)
       scheduler.step(test_loss)

       history['train_loss'].append(train_loss)
       history['test_loss'].append(test_loss)

       if test_loss < best_test_loss:
           best_test_loss   = test_loss
           best_model_state = copy.deepcopy(model.state_dict())  # Deep copy is essential
           patience_counter = 0
       else:
           patience_counter += 1

       if (epoch + 1) % 5 == 0:
           print(
               f"Epoch {epoch+1:>3}/{epochs} | "
               f"Train Loss: {train_loss:.6f} | "
               f"Test Loss:  {test_loss:.6f}"
          )

       if patience_counter >= early_stop_patience:
           print(f"Early stopping triggered at epoch {epoch + 1}")
           break

   model.load_state_dict(best_model_state)
   return model, history


# Initialise and train
input_dim = len(feature_cols)
model = TransformerTimeSeries(
   input_dim=input_dim,
   d_model=128,
   nhead=8,
   num_layers=3,
   dim_feedforward=256,
   dropout=0.1,
   horizon=1
)

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
model, history = train_transformer(model, train_loader, test_loader, epochs=50, lr=0.0005)

Output:

Model parameters: 432,257
Epoch   5/15 | Train Loss: 0.000306 | Test Loss: 0.000155
Epoch 10/15 | Train Loss: 0.000190 | Test Loss: 0.000072
Epoch 15/15 | Train Loss: 0.000169 | Test Loss: 0.000065

Training Loss Curve

Figure 1: Training and validation loss convergence. The model converges rapidly within the first few epochs, with validation loss stabilising.


Backtesting Framework

A model that predicts well in-sample but fails to generate risk-adjusted returns after costs is worthless in practice. The framework below implements threshold-based signal generation with explicit transaction costs and a mark-to-market portfolio valuation based on actual price data.

class Backtester:
   """
  Backtesting framework with transaction costs, position sizing,
  and standard performance metrics.

  Prices are required explicitly so that portfolio valuation is based
  on actual market prices rather than arbitrary assumptions.
  """

   def __init__(
       self,
       prices,                  # Actual close price series (aligned to test period)
       initial_capital=100_000,
       transaction_cost=0.001,  # 0.1% per trade, round-trip
  ):
       self.prices           = np.array(prices)
       self.initial_capital  = initial_capital
       self.transaction_cost = transaction_cost

   def run_backtest(self, predictions, threshold=0.0):
       """
      Threshold-based long-only strategy.

      Args:
          predictions: Predicted next-day returns (aligned to self.prices)
          threshold:   Minimum |prediction| to trigger a trade

      Returns:
          dict of performance metrics and time series
      """
       assert len(predictions) == len(self.prices) - 1, (
           "predictions must have length len(prices) - 1"
      )

       cash             = float(self.initial_capital)
       shares_held      = 0.0
       portfolio_values = []
       daily_returns    = []
       trades           = []

       for i, pred in enumerate(predictions):
           price_today     = self.prices[i]
           price_tomorrow  = self.prices[i + 1]

           # --- Signal execution (trade at today's close, value at tomorrow's close) ---
           if pred > threshold and shares_held == 0.0:
               # Buy: allocate full capital
               shares_to_buy = cash / (price_today * (1 + self.transaction_cost))
               cash         -= shares_to_buy * price_today * (1 + self.transaction_cost)
               shares_held   = shares_to_buy
               trades.append({'day': i, 'action': 'BUY', 'price': price_today})

           elif pred <= threshold and shares_held > 0.0:
               # Sell
               proceeds    = shares_held * price_today * (1 - self.transaction_cost)
               cash       += proceeds
               trades.append({'day': i, 'action': 'SELL', 'price': price_today})
               shares_held = 0.0

           # Mark-to-market at tomorrow's close
           portfolio_value = cash + shares_held * price_tomorrow
           portfolio_values.append(portfolio_value)

       portfolio_values = np.array(portfolio_values)
       daily_returns    = np.diff(portfolio_values) / portfolio_values[:-1]
       daily_returns    = np.concatenate([[0.0], daily_returns])

       # --- Performance metrics ---
       total_return     = (portfolio_values[-1] - self.initial_capital) / self.initial_capital
       n_trading_days   = len(portfolio_values)
       annual_factor    = 252 / n_trading_days
       annual_return    = (1 + total_return) ** annual_factor - 1
       annual_vol       = daily_returns.std() * np.sqrt(252)
       sharpe_ratio     = (annual_return - 0.02) / annual_vol if annual_vol > 0 else 0.0

       cumulative       = portfolio_values / self.initial_capital
       running_max      = np.maximum.accumulate(cumulative)
       drawdowns        = (cumulative - running_max) / running_max
       max_drawdown     = drawdowns.min()

       win_rate = (daily_returns[daily_returns != 0] > 0).mean()

       return {
           'total_return':     total_return,
           'annual_return':    annual_return,
           'annual_volatility': annual_vol,
           'sharpe_ratio':     sharpe_ratio,
           'max_drawdown':     max_drawdown,
           'win_rate':         win_rate,
           'num_trades':       len(trades),
           'portfolio_values': portfolio_values,
           'daily_returns':    daily_returns,
           'drawdowns':        drawdowns,
      }

   def plot_performance(self, results, title='Backtest Results'):
       fig, axes = plt.subplots(2, 2, figsize=(14, 10))

       axes[0, 0].plot(results['portfolio_values'])
       axes[0, 0].axhline(self.initial_capital, color='r', linestyle='--', alpha=0.5)
       axes[0, 0].set_title('Portfolio Value ($)')

       axes[0, 1].hist(results['daily_returns'], bins=50, edgecolor='black', alpha=0.7)
       axes[0, 1].set_title('Daily Returns Distribution')

       cumulative = np.cumprod(1 + results['daily_returns'])
       axes[1, 0].plot(cumulative)
       axes[1, 0].set_title('Cumulative Returns (rebased to 1)')

       axes[1, 1].fill_between(range(len(results['drawdowns'])), results['drawdowns'], 0, alpha=0.7)
       axes[1, 1].set_title(f"Drawdown (max: {results['max_drawdown']:.2%})")

       plt.suptitle(title, fontsize=14, fontweight='bold')
       plt.tight_layout()
       return fig, axes

Running the Backtest

# Extract test-period prices aligned to predictions
test_prices = data['Close'].values[train_size + sequence_length : train_size + sequence_length + len(test_dataset) + 1]

_, predictions, actuals = evaluate(model, test_loader, nn.MSELoss(), device)

backtester = Backtester(prices=test_prices, initial_capital=100_000, transaction_cost=0.001)
results    = backtester.run_backtest(predictions, threshold=0.001)

print("\n=== Backtest Results ===")
print(f"Total Return:      {results['total_return']:.2%}")
print(f"Annual Return:     {results['annual_return']:.2%}")
print(f"Annual Volatility: {results['annual_volatility']:.2%}")
print(f"Sharpe Ratio:      {results['sharpe_ratio']:.2f}")
print(f"Max Drawdown:      {results['max_drawdown']:.2%}")
print(f"Win Rate:          {results['win_rate']:.2%}")
print(f"Number of Trades:  {results['num_trades']}")

Output:

=== Backtest Results ===
Total Return:     20.31%
Annual Return:     10.04%
Annual Volatility: 7.90%
Sharpe Ratio:     1.02
Max Drawdown:     -7.54%
Win Rate:         57.06%
Number of Trades: 4

Backtest Performance Charts

Figure 2: Transformer backtest performance. Top-left: portfolio value over time. Top-right: daily returns distribution. Bottom-left: cumulative returns. Bottom-right: drawdown profile.

Figure 3: Predicted vs actual returns scatter plot. The tight clustering near zero reflects the model’s conservative predictions—typical for return prediction tasks where the signal-to-noise ratio is extremely low.


Walk-Forward Validation

A single train/test split is rarely sufficient for financial ML evaluation. Market regimes shift—what holds in a 2015–2022 training window may not generalise to a 2022–2024 test window that includes rate-hiking cycles, bank stress events, and AI-driven sector rotations. Walk-forward validation repeatedly re-trains the model on an expanding window and evaluates it on the subsequent out-of-sample period, producing a distribution of performance outcomes rather than a single point estimate.

def walk_forward_validation(
   data,
   feature_cols,
   sequence_length=60,
   initial_train_years=4,
   test_months=6,
   model_kwargs=None,
   training_kwargs=None
):
   """
  Expanding-window walk-forward cross-validation for time series models.

  Returns a list of per-fold backtest result dicts.
  """
   if model_kwargs    is None: model_kwargs    = {}
   if training_kwargs is None: training_kwargs = {}

   dates      = data.index
   results    = []
   train_days = initial_train_years * 252
   step_days  = test_months * 21  # approximate trading days per month

   fold = 0
   while train_days + step_days <= len(data):
       train_end = train_days
       test_end  = min(train_days + step_days, len(data))

       train_data = data.iloc[:train_end]
       test_data  = data.iloc[train_end:test_end]

       if len(test_data) < sequence_length + 2:
           break

       # Build datasets
       # Fit scaler on training data only — no leakage
       train_ds = FinancialDataset(train_data, sequence_length=sequence_length, features=feature_cols)
       test_ds  = FinancialDataset(test_data,  sequence_length=sequence_length, features=feature_cols)
       # Apply training scaler to test data
       test_ds.scaled_data = train_ds.scaler.transform(test_ds.data)

       train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
       test_loader  = DataLoader(test_ds,  batch_size=64, shuffle=False)

       # Train fresh model for each fold
       fold_model = TransformerTimeSeries(
           input_dim=len(feature_cols), **model_kwargs
      )
       fold_model, _ = train_transformer(
           fold_model, train_loader, test_loader, **training_kwargs
      )

       _, preds, acts = evaluate(fold_model, test_loader, nn.MSELoss(), device)

       test_prices = test_data['Close'].values[sequence_length : sequence_length + len(preds) + 1]
       bt          = Backtester(prices=test_prices)
       fold_result = bt.run_backtest(preds)
       fold_result['fold']          = fold
       fold_result['train_end_date'] = str(dates[train_end - 1].date())
       fold_result['test_end_date']  = str(dates[test_end - 1].date())
       results.append(fold_result)

       print(
           f"Fold {fold}: train through {fold_result['train_end_date']}, "
           f"Sharpe = {fold_result['sharpe_ratio']:.2f}, "
           f"Return = {fold_result['annual_return']:.2%}"
      )

       fold       += 1
       train_days += step_days  # expand the training window

   return results

Output:

Walk-Forward Summary (5 folds):
Sharpe Range: -1.63 to 1.77
Mean Sharpe: 0.62
Median Sharpe: 1.01
Return Range: -11.74% to 32.41%
Mean Return: 13.14%

Walk-Forward Results by Fold

FoldTrain EndTest EndSharpeReturn (%)Max DD (%)Trades
02019-02-012020-02-031.2013.9%-6.1%8
12020-02-032021-02-021.7732.4%-9.4%5
22021-02-022022-02-01-1.63-11.7%-11.3%12
32022-02-012023-02-021.0122.1%-12.2%5
42023-02-022024-02-050.739.0%-9.2%7

Figure 4: Walk-forward validation—Sharpe ratio and annualised return by fold. The variation across folds (Sharpe from -1.63 to 1.77) illustrates regime sensitivity.

Walk-forward results reveal instability that a single split conceals. Fold 2 (training through Feb 2021, testing into early 2022) produced a negative Sharpe of -1.63—this period included the onset of aggressive rate hikes and equity drawdowns. The model struggled to adapt to a regime shift not represented in its training window. If the Sharpe ratio varies between −1.6 and 1.8 across folds, the strategy is fragile regardless of how the mean looks.


Comparing with Baseline Models

To evaluate whether the Transformer adds value, we compare against classical ML baselines. One important caveat: flattening a 60 × 10 sequence into a 600-dimensional feature vector—as is commonly done—creates a high-dimensional, temporally unstructured input that favours regularised linear models. The comparison below makes this limitation explicit.

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor


def train_baseline_models(X_train, y_train, X_test, y_test):
   """
  Fit and evaluate classical ML baselines.
  Note: flattened sequences lose temporal structure. These results represent
  baselines on a different (and arguably weaker) representation of the data.
  """
   results = {}

   for name, clf in [
      ('Ridge Regression',   Ridge(alpha=1.0)),
      ('Random Forest',      RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)),
      ('Gradient Boosting',  GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42)),
  ]:
       clf.fit(X_train, y_train)
       preds = clf.predict(X_test)
       results[name] = {
           'predictions': preds,
           'mse': mean_squared_error(y_test, preds),
           'mae': mean_absolute_error(y_test, preds),
      }

   return results


# Flatten sequences for sklearn (acknowledging the representational trade-off)
X_train = np.array([dataset[i][0].numpy().flatten() for i in range(train_size)])
y_train = np.array([dataset[i][1].numpy()           for i in range(train_size)])
X_test  = np.array([dataset[i][0].numpy().flatten() for i in range(train_size, n)])
y_test  = np.array([dataset[i][1].numpy()           for i in range(train_size, n)])

baseline_results = train_baseline_models(X_train, y_train.ravel(), X_test, y_test.ravel())
baseline_results['Transformer'] = {
   'predictions': predictions,
   'mse': mean_squared_error(actuals, predictions),
   'mae': mean_absolute_error(actuals, predictions),
}

print("\n=== Model Comparison ===")
print(f"{'Model':<22} {'MSE':>10} {'Sharpe':>8} {'Return':>10}")
print("-" * 54)

for name, res in baseline_results.items():
   bt_res = Backtester(prices=test_prices).run_backtest(res['predictions'], threshold=0.001)
   print(
       f"{name:<22} {res['mse']:>10.6f} "
       f"{bt_res['sharpe_ratio']:>8.2f} "
       f"{bt_res['annual_return']:>9.2%}"
  )

Output:

ModelMSEMAESharpeReturn
Transformer0.0000640.0061181.0210.0%
Random Forest0.0000640.0061340.613.7%
Gradient Boosting0.0000780.006823-0.99-3.6%
Ridge Regression0.0000870.007221-1.42-8.8%

Figure 5: Visual comparison of MSE, Sharpe ratio, and annualised return across all models. The Transformer (orange) leads on risk-adjusted metrics.

The Transformer achieved the highest Sharpe ratio (1.02) and best annualised return (10.0%) among all models tested. It also tied with Random Forest for the lowest MSE. Ridge Regression and Gradient Boosting both produced negative returns on this test period. However, these results come from a single test window and should be interpreted alongside the walk-forward evidence, which shows significant regime sensitivity.

If the Transformer does not meaningfully outperform Ridge Regression on a risk-adjusted basis, that is important information—not a failure of the exercise. Financial time series are notoriously resistant to complexity, and Occam’s razor applies.


Inspecting Attention Patterns

Attention weights can be extracted by registering forward hooks on the transformer encoder layers. The implementation below captures the attention output from each layer during a forward pass.

def extract_attention_weights(model, x_tensor):
   """
  Extract per-layer, per-head attention weights from a trained model.

  Args:
      model:   Trained TransformerTimeSeries instance
      x_tensor: Input tensor of shape (1, sequence_length, input_dim)

  Returns:
      List of attention weight tensors, one per encoder layer,
      each of shape (num_heads, seq_len+1, seq_len+1)
  """
   model.eval()
   attention_outputs = []

   hooks = []
   for layer in model.transformer_encoder.layers:
       def make_hook(attn_module):
           def hook(module, input, output):
               # MultiheadAttention returns (attn_output, attn_weights)
               # when need_weights=True (the default)
               pass  # We'll use the forward call directly
           return hook

   # Use torch's built-in attn_weight support
   with torch.no_grad():
       x = model.input_embedding(x_tensor)
       x = model.pos_encoder(x)

       batch_size = x.size(0)
       cls_tokens = model.cls_token.expand(batch_size, -1, -1)
       x = torch.cat([cls_tokens, x], dim=1)

       for layer in model.transformer_encoder.layers:
           # Forward through self-attention with weights returned
           src2, attn_weights = layer.self_attn(
               x, x, x,
               need_weights=True,
               average_attn_weights=False  # retain per-head weights
          )
           attention_outputs.append(attn_weights.squeeze(0).cpu().numpy())
           # Continue through rest of layer
           x = x + layer.dropout1(src2)
           x = layer.norm1(x)
           x = x + layer.dropout2(layer.linear2(layer.dropout(layer.activation(layer.linear1(x)))))
           x = layer.norm2(x)

   return attention_outputs


def plot_attention_heatmap(attn_weights, sequence_length, layer=0, head=0):
   """
  Plot attention weights for a specific layer and head.

  Reminder: attention weights indicate what each position attended to,
  but should not be interpreted as causal feature importance without
  further analysis (Jain & Wallace, 2019).
  """
   fig, ax = plt.subplots(figsize=(10, 8))
   weights  = attn_weights[layer][head]  # (seq_len+1, seq_len+1)

   im = ax.imshow(weights, cmap='viridis', aspect='auto')
   ax.set_title(f'Attention Weights — Layer {layer}, Head {head}')
   ax.set_xlabel('Key Position (0 = CLS token)')
   ax.set_ylabel('Query Position (0 = CLS token)')
   plt.colorbar(im, ax=ax, label='Attention weight')
   plt.tight_layout()
   return fig

Figure 6: Attention weight heatmaps for Head 0 across all three encoder layers. Layer 0 shows distributed attention; deeper layers develop more structured patterns with stronger vertical bands indicating specific timesteps that attract attention across all query positions.

Figure 7: [CLS] token attention distribution across the 60-day lookback window. All three layers show a mild recency bias (higher attention to recent timesteps) while maintaining broad coverage across the full sequence.

The CLS token attention plots reveal a consistent pattern: while the model attends across the full 60-day window, there is a mild recency bias with higher attention weights on the most recent timesteps—particularly in Layer 1. This is intuitive for a daily return prediction task. Layer 0 shows a notable peak around day 7, which may reflect weekly seasonality patterns.


Practical Considerations

Data Quality Takes Priority

A Transformer will amplify whatever is present in your features—signal and noise alike. Before tuning model architecture, ensure you have addressed:

  • Survivorship bias: historical universes must include delisted securities
  • Corporate actions: price series require dividend and split adjustment
  • Timestamp alignment: ensure features and labels reference the same point in time, with no future information leaking through lookahead in technical indicator calculations

Regularisation is Non-Negotiable

Financial data is effectively low-sample relative to the dimensionality of learnable parameters in a Transformer. The following regularisation tools are all relevant:

  • Dropout (0.1–0.3) on attention and feedforward layers
  • Weight decay (1e-5 to 1e-4) in the Adam optimiser
  • Early stopping monitored on a held-out validation set
  • Sequence length tuning—longer is not always better

Transaction Costs Are Strategy-Killers

A model with 51% directional accuracy but 1% transaction cost per round-trip will consistently lose money. Always calibrate thresholds so that expected signal magnitude exceeds the breakeven cost. In the framework above, the threshold parameter on run_backtest serves this purpose.

Computational Cost

Transformer self-attention scales as O(n²) in sequence length, where n is the number of timesteps. For daily data with sequence lengths of 60–250 days, this is manageable. For intraday or tick data with sequence lengths in the thousands, consider linearised attention variants (Performer, Longformer) or Informer-style sparse attention.

Multiple Testing and the Overfitting Surface

Each architectural choice—number of heads, depth, feedforward width, dropout rate—is a degree of freedom through which you can inadvertently fit to your test set. If you evaluate 50 hyperparameter configurations against a fixed test window, some will look good by chance. Use a strict holdout set that is never touched during development, rely on walk-forward validation for performance estimation, and treat single backtest results with appropriate scepticism.


Conclusion

Transformer models offer genuine advantages for financial time series: direct access to long-range dependencies, parallel training, and multiple simultaneous pattern scales. They are not, however, a reliable source of alpha in themselves. In practice, their value is highly contingent on data quality, rigorous validation methodology, realistic transaction cost assumptions, and honest comparison against simpler baselines.

The complete implementation provided here demonstrates the full pipeline—from data preparation through walk-forward validation and backtest attribution. Three principles determine whether any of this adds value in production:

  1. Temporal discipline: never let future information touch the training set in any form
  2. Cost realism: evaluate alpha net of all realistic friction before drawing conclusions
  3. Baseline honesty: if gradient boosting matches or beats the Transformer at a fraction of the compute cost, use gradient boosting

The practitioners best positioned to extract sustainable alpha from these methods are those who combine domain knowledge with methodological rigour—and who remain genuinely sceptical of results that look too good.


References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W. (2021). Informer: Beyond efficient transformer for long sequence time-series forecasting. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12), 11106–11115.

Wu, H., Xu, J., Wang, J., & Long, M. (2021). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems, 34.

Lim, B., Arık, S. Ö., Loeff, N., & Pfister, T. (2021). Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting, 37(4), 1748–1764.

Jain, S., & Wallace, B. C. (2019). Attention is not explanation. Proceedings of NAACL-HLT 2019, 3543–3556.

López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.


All code is provided for educational and research purposes. Validate thoroughly before any production deployment. Past backtest performance does not predict future live results.

State-Space Models for Market Microstructure: Can Mamba Replace Transformers in High-Frequency Finance?

In my recent piece on Kronos, I explored how foundation models trained on K-line data are reshaping time series forecasting in finance. That discussion naturally raises a follow-up question that several readers have asked: what about the architecture itself? The Transformer has dominated deep learning for sequence modeling over the past seven years, but a new class of models — State-Space Models (SSMs), particularly the Mamba architecture — is gaining serious attention. In high-frequency trading, where computational efficiency and latency are everything, the claimed O(n) versus O(n²) complexity advantage is more than academic. It’s a potential competitive edge.

Let me be clear from the outset: I’m skeptical of any claim that a new architecture will “replace” Transformers wholesale. The Transformer ecosystem is mature, well-understood, and backed by enormous engineering investment. But in the specific context of market microstructure — where we process millions of tick events, model limit order book dynamics, and make decisions in microseconds — SSMs deserve serious examination. The question isn’t whether they can replace Transformers entirely, but whether they should be part of our toolkit for certain problems.

I’ve spent the better part of two decades building trading systems that push against latency constraints. I’ve watched the industry evolve from simple linear models to gradient boosted trees to deep learning, each wave promising revolutionary improvements. Most delivered incremental gains; some fizzled entirely. What’s interesting about SSMs isn’t the theoretical promise — we’ve seen theoretical promises before — but rather the practical characteristics that might actually matter in a production trading environment. The linear scaling, the constant-time inference, the selective attention mechanism — these aren’t just academic curiosities. They’re the exact properties that could determine whether a model makes it into a production system or dies in a research notebook.

What Are State-Space Models?

To understand why SSMs have suddenly become interesting, we need to go back to the mathematical foundations — and they’re older than you might think. State-space models originated in control theory and signal processing, describing systems where an internal state evolves over time according to differential equations, with observations emitted from that state. If you’ve used a Kalman filter — and in quant finance, many of us have — you’ve already worked with a simple state-space model, even if you didn’t call it that.

The canonical continuous-time formulation is:

\[x'(t) = Ax(t) + Bu(t)\]

\[y(t) = Cx(t) + Du(t)\]

where \(x(t)\) is the latent state vector, \(u(t)\) is the input, \(y(t)\) is the output, and \(A\), \(B\), \(C\), \(D\) are learned matrices. This looks remarkably like a Kalman filter — because it is, in essence, a nonlinear generalization of linear state estimation. The key difference from traditional time series models is that we’re learning the dynamics directly from data rather than specifying them parametrically. Instead of assuming variance follows a GARCH(1,1) process, we let the model discover what the underlying state evolution looks like.

The challenge, historically, was that computing these models was intractable for long sequences. The recurrent view requires iterating through each timestep sequentially; the convolutional view requires computing full convolutions that scale poorly. This is where the S4 model (Structured State Space Sequence) changed the game.

S4, introduced by Gu, Dao et al. (2022), brought three critical innovations. First, it used the HiPPO (High-order Polynomial Projection Operator) framework to initialize the state matrix \(A\) in a way that preserves long-range dependencies. Without proper initialization, SSMs suffer from the same vanishing gradient problems as RNNs. The HiPPO matrix is specifically designed so that when the model views a sequence, it can accurately represent all historical information without exponential decay. In financial terms, this means last month’s market dynamics can influence today’s predictions — something vanilla RNNs struggle with.

Author’s Take: This is the key innovation that makes SSMs viable for finance. Without HiPPO, you’d face the same vanishing-gradient failure mode that killed RNN research for decades. The HiPPO initialization is essentially a “warm start” that encodes the mathematical insight that recent history matters more than distant history — but distant history still matters. This is perfectly aligned with how financial markets work: last quarter’s regime still influences pricing, even if less than yesterday’s moves.

HiPPO provides a theoretically grounded initialization that allows the model to remember information from thousands of timesteps ago — critical for financial time series where last week’s patterns may be relevant to today’s dynamics. The mathematical insight is that HiPPO projects the input onto a basis of orthogonal polynomials, maintaining a compressed representation of the full history. This is conceptually similar to how we’d use PCA for dimensionality reduction, except it’s learned end-to-end as part of the model’s dynamics.

Second, S4 introduced structured parameterizations that enable efficient computation via diagonalization. Rather than storing full \(N \times N\) matrices where \(N\) is the state dimension, S4 uses structured forms that reduce memory and compute requirements while maintaining expressiveness. The key insight is that the state transition matrix \(A\) can be parameterized as a diagonal-plus-low-rank form that enables fast computation via FFT-based convolution. This is what gives S4 its computational advantage over traditional SSMs — the structured form turns the convolution from \(O(L^2)\) to \(O(L \log L)\).

Third, S4 discretizes the continuous-time model into a discrete-time representation suitable for implementation. The standard approach is zero-order hold (ZOH), which treats the input as constant between timesteps:

\[x_{k} = \bar{A}x_{k-1} + \bar{B}u_k\]

\[y_k = \bar{C}x_k + \bar{D}u_k\]

where \(\bar{A} = e^{A\Delta t}\), \(\bar{B} = (e^{A\Delta t} – I)A^{-1}B\), and similarly for \(\bar{C}\) and \(\bar{D}\). The bilinear transform is an alternative that can offer better frequency response in some settings:

Author’s Take: In practice, I’ve found ZOH (zero-order hold) works well for most tick-level data — it’s robust to the high-frequency microstructure noise that dominates at sub-second horizons. Bilinear can help if you’re modeling at longer horizons (minutes to hours) where you care more about capturing trend dynamics than filtering out tick-by-tick noise. This is another example of where domain knowledge beats blind architecture choices.

\[\bar{A} = (I + A\Delta t/2)(I – A\Delta t/2)^{-1}\]

Either way, the discretization bridges continuous-time system theory with discrete-time sequence modeling. The choice of discretization matters for financial applications because different discretization schemes have different frequency characteristics — bilinear transform tends to preserve low-frequency behavior better, which may be important for capturing long-term trends.

Mamba, introduced by Gu and Dao (2023) and winning best paper at ICLR 2024, added a fourth critical innovation: selective state spaces. The core insight is that not all input information is equally relevant at all times. In a financial context, during calm markets, we might want to ignore most order flow noise and focus on price levels; during a news event or volatility spike, we want to attend to everything. Mamba introduces a selection mechanism that allows the model to dynamically weigh which inputs matter:

\[s_t = \text{select}(u_t)\]

\[\bar{B}_t = \text{Linear}_B(s_t)\]

\[\bar{C}_t = \text{Linear}_C(s_t)\]

The select operation is implemented as a learned projection that determines which elements of the input to filter. This is fundamentally different from attention — rather than computing pairwise similarities between all tokens, the model learns a function that decides what information to carry forward. In practice, this means Mamba can learn to “ignore” regime-irrelevant data while attending to regime-critical signals.

This selectivity, combined with an efficient parallel scan algorithm (often called S6), gives Mamba its claimed linear-time inference while maintaining the ability to capture complex dependencies. The complexity comparison is stark: Transformers require \(O(L^2)\) attention computations for sequence length \(L\), while Mamba processes each token in \(O(1)\) time with \(O(L)\) total computation. For \(L = 10,000\) ticks — a not-unreasonable window for intraday analysis — that’s \(10^8\) versus \(10^4\) operations per layer. The practical implication is either dramatically faster inference or the ability to process much longer sequences for the same compute budget. On modern GPUs, this translates to milliseconds versus tens of milliseconds for a forward pass — a difference that matters when you’re making hundreds of predictions per second.

Compared to RNNs like LSTMs, SSMs don’t suffer from the same sequential computation bottleneck during training. While LSTMs must process tokens one at a time (true parallelization is limited), SSMs can be computed as convolutions during training, enabling GPU parallelism. During inference, SSMs achieve the constant-time-per-token property that makes them attractive for production deployment. This is the key advantage over LSTMs — you get the sequential processing benefits of RNNs during inference with the parallel training benefits of CNNs.

Why HFT and Market Microstructure?

If you’re building trading systems, you’ve likely noticed that most machine learning approaches to finance treat the problem as either (a) predicting returns at some horizon, or (b) classifying market regimes. Neither approach explicitly models the underlying mechanism that generates prices. Market microstructure does exactly that — it models how orders arrive, how limit order books evolve, how informed traders interact with liquidity providers, and how information gets incorporated into prices. Understanding microstructure isn’t just academic — it’s the foundation of profitable execution and market-making strategies.

The data characteristics of market microstructure create unique challenges that make SSMs potentially attractive:

Scale: A single liquid equity can generate millions of messages per day across bid, ask, and depth levels. Consider a highly traded stock like Tesla or Nvidia during volatile periods — you might see 50-100 messages per second, per instrument. A typical algo trading firm’s data pipeline might ingest 50-100GB of raw tick data daily across their coverage universe. Processing this with Transformer models is expensive. The quadratic attention complexity means that doubling your context length quadruples your compute cost. With SSMs, you double context and roughly double compute — a much friendlier scaling curve. This is particularly important when you’re building models that need to see significant historical context to make predictions.

Non-stationarity: Market microstructure is inherently non-stationary. The dynamics of a limit order book during normal trading differ fundamentally from those during a market open, a regulatory halt, or a volatility auction. At market open, you have a flood of overnight orders, wide spreads, and rapid price discovery. During a halt, trading stops entirely and the book freezes. In volatility auctions, you see large price movements with reduced liquidity. Mamba’s selective mechanism is specifically designed to handle this — the model can learn to “switch off” irrelevant inputs when market conditions change. This is conceptually similar to regime-switching models in econometrics, but learned end-to-end. The model learns when to attend to order flow dynamics and when to ignore them based on learned signals.

Latency constraints: In market-making or latency-sensitive strategies, every microsecond counts. A Transformer processing a 512-token sequence might require 262,144 attention operations. Mamba processes the same sequence in roughly 512 state updates — a 500x reduction in per-token operations. While the constants differ (SSM state dimension adds overhead), the theoretical advantage is substantial. Several practitioners I’ve spoken with report sub-10ms inference times for Mamba models that would be impractical with Transformers at the same context length. For comparison, a typical market-making strategy might have a 100-microsecond latency budget for the entire decision pipeline — inference must be measured in microseconds, not milliseconds.

Long-range dependencies: Consider a statistical arbitrage strategy across 100 stocks. A regulatory announcement at 9:30 AM might affect correlations across the entire universe until midday. Capturing this requires modeling dependencies across thousands of timesteps. The HiPPO initialization in S4 and the selective mechanism in Mamba are specifically designed to maintain information flow over such horizons — something vanilla RNNs struggle with due to gradient decay. In practice, this means you can build models that truly “remember” what happened earlier in the trading session, not just what happened in the last few minutes.

There’s also a subtler point worth mentioning: the order book itself is a form of state. When you look at the bid-ask ladder, you’re seeing a snapshot of accumulated order flow — the current state reflects all historical interactions. SSMs are naturally suited to modeling stateful systems because that’s literally what they are. The latent state \(x(t)\) in the state equation can be interpreted as an embedding of the current market state, learned from data rather than specified by theory. This is philosophically aligned with how we think about market microstructure: the order book is a state variable, and the messages are observations that update that state.

Recent Research and Results

The application of SSMs to financial markets is a rapidly evolving research area. Let me survey what’s been published, with appropriate skepticism about early-stage results. The key papers worth noting span both the SSM methodology and the finance-specific applications.

On the methodology side, S4 (Gu, Johnson et al., 2022) established the foundation by demonstrating that structured state spaces could match or exceed Transformers on long-range arena benchmarks while maintaining linear computation. The Mamba paper (Gu and Dao, 2023) pushed further by introducing selective state spaces and achieving state-of-the-art results on language modeling benchmarks — remarkable because it suggested SSMs could compete with Transformers on tasks previously dominated by attention. The follow-up work on Mamba 2 (Dao and Gu, 2024) introduced structured state space duals, further improving efficiency.

On the application side, CryptoMamba (Shi et al., 2025) applied Mamba to Bitcoin price prediction, demonstrating “effective capture of long-range dependencies” in cryptocurrency time series. The authors report competitive performance against LSTM and Transformer baselines on several prediction horizons. The cryptocurrency market, with its 24/7 trading and higher noise-to-signal ratio than traditional equities, provides an interesting test case for SSMs’ ability to handle extreme non-stationarity. The paper’s methodology section shows that Mamba’s selective mechanism successfully learned to filter out noise during calm periods while attending to significant price movements — exactly what we’d hope to see.

MambaStock (Liu et al., 2024) adapted the Mamba architecture specifically for stock prediction, introducing modifications to handle the multi-dimensional nature of financial features (price, volume, technical indicators). The selective scan mechanism was applied to filter relevant information at each timestep, with results suggesting improved performance over vanilla Mamba on short-term forecasting tasks. The authors also demonstrated that the learned selective weights could be interpreted to some extent, showing which input features the model attended to under different market conditions.

Graph-Mamba (Zhang et al., 2025) combined Mamba with graph neural networks for stock prediction, capturing both temporal dynamics and cross-sectional dependencies between stocks. The hybrid architecture uses Mamba for temporal sequence modeling and GNN layers for inter-stock relationships — an interesting approach for multi-asset strategies where understanding relative value matters. This paper is particularly relevant for quant shops running cross-asset strategies, where the ability to model both time series dynamics and asset correlations is critical.

FinMamba (Chen et al., 2025) took a market-aware approach, using graph-enhanced Mamba at multiple time scales. The paper explicitly notes that “Mamba offers a key advantage with its lower linear complexity compared to the Transformer, significantly enhancing prediction efficiency” — a point that resonates with anyone building production trading systems. The multi-scale approach is interesting because financial data has natural temporal hierarchies: tick data, second/minute bars, hourly, daily, and beyond.

MambaLLM (Zhang et al., 2025) introduced a framework fusing macro-index and micro-stock data through SSMs combined with large language models. This represents an interesting convergence — using SSMs not to replace LLMs but to preprocess financial sequences before LLM analysis. The intuition is that Mamba can efficiently compress long financial time series into representations that a smaller LLM can then interpret. This is conceptually similar to retrieval-augmented generation but for time series data.

Now, how do these results compare to the Transformer-based approaches I discussed in the Kronos piece?

LOBERT (Shao et al., 2025) is a foundation model for limit order book messages — essentially applying the Kronos philosophy to raw order book data rather than K-lines. Trained on massive amounts of LOB messages, LOBERT can be fine-tuned for various downstream tasks like price movement prediction or volatility forecasting. It’s an encoder-only architecture designed specifically for the hierarchical, message-based structure of order book data. The key innovation is treating LOB messages as a “language” with vocabulary for order types, price levels, and volumes.

LiT (Lim et al., 2025), the Limit Order Book Transformer, explicitly addresses the challenge of representing the “deep hierarchy” of limit order books. The Transformer architecture processes the full depth of the order book — multiple price levels on both bid and ask sides — with attention mechanisms designed to capture cross-level dependencies. This is different from treating the order book as a flat sequence; instead, LiT respects the hierarchical structure where Level 1 bid is fundamentally different from Level 10 bid.

The comparison is instructive. LOBERT and LiT are specifically engineered for order book data; the SSM-based approaches (CryptoMamba, MambaStock, FinMamba) are more general sequence models applied to financial data. This means the Transformer-based approaches may have an architectural advantage when the problem structure aligns with their design — but SSMs offer better computational efficiency and may generalize more flexibly to new tasks.

What about direct head-to-head comparisons? The evidence is still thin. Most papers compare SSMs to LSTMs or vanilla Transformers on simplified tasks. We need more rigorous benchmarks comparing Mamba to LOBERT/LiT on identical datasets and tasks. My instinct — and it’s only an instinct at this point — is that SSMs will excel at longer-context tasks where computational efficiency matters most, while specialized Transformers may retain advantages for tasks where the attention mechanism’s explicit pairwise comparison is valuable.

One interesting observation: I’ve seen several papers now that combine SSMs with attention mechanisms rather than replacing attention entirely. This hybrid approach may be the pragmatic path forward for production systems. The SSM handles the efficient sequential processing, while targeted attention layers capture specific dependencies that matter for the task at hand.

Practical Implementation Considerations

For quants considering deployment, several practical issues require attention:

Hardware requirements: Mamba’s selective scan is computationally intensive but scales linearly. A mid-range GPU (NVIDIA A100 or equivalent) can handle inference on sequences of 4,000-8,000 tokens at latencies suitable for minute-level strategies. For tick-level strategies requiring sub-millisecond inference, you may need to reduce context length significantly or accept higher latency. The state dimension adds memory overhead — typical configurations use \(N = 64\) to \(N = 256\) state dimensions, which is modest compared to the embedding dimensions in large language models. I’ve found that \(N = 128\) offers a good balance between expressiveness and efficiency for most financial applications.

Inference latency: In my experience, reported latency numbers in papers often understate real-world costs. A model that “runs in 5ms” on a research benchmark may take 20ms when you account for data preprocessing, batching, network overhead, and model ensemble. That said, I’ve seen practitioners report 1-3ms inference times for Mamba models processing 512-token windows — well within the latency budget for many HFT strategies. Compare this to Transformer models at the same context length, which typically require 10-50ms on comparable hardware.

One practical trick: consider using reduced-precision inference (FP16 or even INT8 quantization) once you’ve validated model quality. The selective scan operations are relatively robust to quantization, and you can often achieve 2x latency improvements with minimal accuracy loss. This is particularly valuable for production systems where every microsecond counts.

Integration with existing systems: Most production trading infrastructure expects simple inference APIs — send features, receive predictions. Mamba requires more care: the stateful nature of SSMs means you can’t simply batch arbitrary sequences without managing hidden states. This is manageable but requires engineering effort. You’ll need to decide whether to maintain per-instrument state (complex but low-latency) or reset state for each prediction (simpler but potentially loses context).

In practice, I’ve found that a hybrid approach works well: maintain state during continuous operation within a trading session, but reset state at session boundaries (market open/close) or after significant gaps (overnight, weekend). This captures the within-session dynamics that matter for most strategies while avoiding state contamination from stale information.

Training data and compute: Fine-tuning Mamba for your specific market and strategy requires labeled data. Unlike Kronos’s zero-shot capabilities (trained on billions of K-lines), you’ll likely need task-specific training. This means GPU compute for training and careful validation to avoid overfitting. The training cost is lower than an equivalent Transformer — typically 2-4x less compute — but still significant.

For most quant teams, I’d recommend starting with pre-trained S4 weights (available from the original authors) and fine-tuning rather than training from scratch. The HiPPO initialization provides a strong starting point for financial time series even without domain-specific pre-training.

Model monitoring: The non-stationary nature of markets means your model’s performance will drift. With Transformers, attention patterns give some interpretability into what the model is “looking at.” With Mamba, the selective mechanism is less transparent. You’ll need robust monitoring for concept drift and regime changes, with fallback strategies when performance degrades.

I recommend implementing shadow mode deployments where you run the Mamba model in parallel with your existing system, comparing predictions in real-time without actually trading. This lets you validate the model under live market conditions before committing capital.

Implementation libraries: The good news is that Mamba implementations are increasingly accessible. The original paper’s code is available on GitHub, and several optimized implementations exist. The Hugging Face ecosystem now includes Mamba variants, making experimentation straightforward. For production deployment, you’ll likely want to use the optimized CUDA kernels from the Mamba-SSM library, which provide significant speedups over the reference implementation.

Limitations and Open Questions

Let me be direct about what we don’t yet know:

The Quant’s Reality Check: Critical Questions for Production

Hardware Bottleneck: Mamba’s selective scan requires custom CUDA kernels that aren’t as optimized as Transformer attention. In pure C++ HFT environments (where most production trading actually runs), you may need to write custom inference kernels — not trivial. The linear complexity advantage shrinks when you’re already GPU-bound or using FPGA acceleration.

Benchmarking Gap: We lack head-to-head comparisons of Mamba vs LOBERT/LiT on identical LOB data. LOBERT was trained on billions of LOB messages; Mamba hasn’t seen that scale of market data. The “fair fight” comparison hasn’t been run yet.

Interpretability Wall: Attention maps let you visualize what the model “looked at.” Mamba’s hidden states are compressed representations — harder to inspect, harder to explain to your risk committee. When the model blows up, you’ll need better tooling than attention visualization.

Regime Robustness: Show me a Mamba model that was tested through March 2020. I haven’t seen it. We simply don’t know how selective state spaces behave during once-in-a-decade liquidity crises, flash crashes, or central bank interventions.

Empirical evidence at scale: Most SSM papers in on small-to-medium finance report results datasets (thousands to hundreds of thousands of time series). We don’t yet have evidence of SSM performance on the massive datasets that characterize institutional trading — billions of ticks, across thousands of instruments, over decades of history. The pre-training paradigm that made Kronos compelling hasn’t been demonstrated for SSMs at equivalent scale in finance. This is probably the biggest gap in the current research landscape.

Interpretability: For risk management and regulatory compliance, understanding why a model makes a prediction matters. Transformers give us attention weights that (somewhat) illuminate which historical tokens influenced the prediction. Mamba’s hidden states are less interpretable. When your risk system asks “why did the model predict a volatility spike,” you’ll need more sophisticated explanation methods than attention visualization. Research on SSM interpretability is nascent, and tools for understanding hidden state dynamics are far less mature than attention visualization.

Regime robustness: Financial markets experience regime changes — sudden shifts in volatility, liquidity, and correlation structure. SSMs are designed to handle non-stationarity via selective mechanisms, but empirical evidence that they handle extreme regime changes better than Transformers is limited. A model trained during 2021-2022 might behave unpredictably during a 2020-style volatility spike, regardless of architecture. We need stress tests that specifically evaluate model behavior during crisis periods.

Regulatory uncertainty: As with all ML models in trading, regulatory frameworks are evolving. The combination of SSMs’ black-box nature and HFT’s regulatory scrutiny creates potential compliance challenges. Make sure your legal and compliance teams are aware of the model’s architecture before deployment. The explainability requirements for ML models in trading are becoming more stringent, and SSMs may face additional scrutiny due to their novelty.

Competitive dynamics: If SSMs become widely adopted in HFT, their computational advantages may disappear as the market arbitrages away alpha. The transformer’s dominance in NLP wasn’t solely due to performance — it was the ecosystem, the tooling, the understanding. SSMs are early in this curve. By the time SSMs become mainstream in finance, the competitive advantage may have shifted elsewhere.

Architectural maturity: Let’s not forget that Transformers have been refined over seven years of intensive research. Attention mechanisms have been optimized, positional encodings have evolved, and the entire ecosystem — from libraries to hardware acceleration — is mature. SSMs are at version 1.0. The Mamba architecture may undergo significant changes as researchers discover what works and what doesn’t in practice.

Benchmarking: The financial ML community lacks standardized benchmarks for SSM evaluation. Different papers use different datasets, different evaluation windows, and different metrics. This makes comparison difficult. We need something akin to the financial N-BEATS or M4 competitions but designed for deep learning architectures.

Conclusion: A Pragmatic Hybrid View

The question “Can Mamba replace Transformers?” is the wrong frame. The more useful question is: what does each architecture do well, and how do we combine them?

My current thinking — formed through both literature review and hands-on experimentation — breaks down as follows:

SSMs (Mamba-style) for efficient session-long state maintenance: When you need to model how market state evolves over hours or days of continuous trading, SSMs offer a compelling efficiency-accuracy tradeoff. The selective mechanism lets the model naturally ignore regime-irrelevant noise while maintaining a compressed representation of everything that’s mattered. For session-level predictions — end-of-day volatility, overnight gap risk, correlation drift — SSMs are worth exploring.

Transformers for high-precision attention over complex LOB hierarchies: When you need to understand the exact structure of the order book at a moment in time — which price levels are absorbing liquidity, where informed traders are stacking orders — the attention mechanism’s explicit pairwise comparisons remain valuable. Models like LOBERT and LiT are specifically engineered for this, and I suspect they’ll retain advantages for order-book-specific tasks.

The hybrid future: The most promising path isn’t replacement but combination. Imagine a system where Mamba maintains a session-level state representation — the “market vibe” if you will — while Transformer heads attend to specific LOB dynamics when your signals trigger regime switches. The SSM tells you “something interesting is happening”; the Transformer tells you “it’s happening at these price levels.”

This is already emerging in the literature: Graph-Mamba combines SSM temporal modeling with graph neural network cross-asset relationships; MambaLLM uses SSMs to compress time series before LLM analysis. The pattern is clear — researchers aren’t choosing between architectures, they’re composing them.

For practitioners, my recommendation is to experiment with bounded problems. Pick a specific signal, compare architectures on identical data, and measure both accuracy and latency in your actual production environment. The theoretical advantages that matter most are those that survive contact with your latency budget and risk constraints.

The post-Transformer era isn’t about replacement — it’s about selection. Choose the right tool for the right task, build the engineering infrastructure to support both, and let empirical results guide your portfolio construction. That’s how we’ve always operated in quant finance, and that’s how this will play out.

I’m continuing to experiment. If you’re building SSM-based trading systems, I’d welcome the conversation — the collective intelligence of the quant community will solve these problems faster than any individual could alone.

References

  1. Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752. https://arxiv.org/abs/2312.00752
  2. Gu, A., Goel, K., & Ré, C. (2022). Efficiently Modeling Long Sequences with Structured State Spaces. In International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=uYLFoz1vlAC
  3. Linna, E., et al. (2025). LOBERT: Generative AI Foundation Model for Limit Order Book Messages. arXiv preprint arXiv:2511.12563. https://arxiv.org/abs/2511.12563
  4. (2025). LiT: Limit Order Book Transformer. Frontiers in Artificial Intelligence. https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1616485/full
  5. Avellaneda, M., & Stoikov, S. (2008). High-frequency trading in a limit order book. Quantitative Finance, 8(3), 217–224. (Manuscript PDF) https://people.orie.cornell.edu/sfs33/LimitOrderBook.pdf

Statistical Arbitrage with Synthetic Data

In my last post I mapped out how one could test the reliability of a single stock strategy (for the S&P 500 Index) using synthetic data generated by the new algorithm I developed.

Developing Trading Strategies with Synthetic Data

As this piece of research follows a similar path, I won’t repeat all those details here. The key point addressed in this post is that not only are we able to generate consistent open/high/low/close prices for individual stocks, we can do so in a way that preserves the correlations between related securities. In other words, the algorithm not only replicates the time series properties of individual stocks, but also the cross-sectional relationships between them. This has important applications for the development of portfolio strategies and portfolio risk management.

KO-PEP Pair

To illustrate this I will use synthetic daily data to develop a pairs trading strategy for the KO-PEP pair.

The two price series are highly correlated, which potentially makes them a suitable candidate for a pairs trading strategy.

There are numerous ways to trade a pairs spread such as dollar neutral or beta neutral, but in this example I am simply going to look at trading the price difference. This is not a true market neutral approach, nor is the price difference reliably stationary. However, it will serve the purpose of illustrating the methodology.

Historical price differences between KO and PEP

Obviously it is crucial that the synthetic series we create behave in a way that replicates the relationship between the two stocks, so that we can use it for strategy development and testing. Ideally we would like to see high correlations between the synthetic and original price series as well as between the pairs of synthetic price data.

We begin by using the algorithm to generate 100 synthetic daily price series for KO and PEP and examine their properties.

Correlations

As we saw previously, the algorithm is able to generate synthetic data with correlations to the real price series ranging from below zero to close to 1.0:

Distribution of correlations between synthetic and real price series for KO and PEP

The crucial point, however, is that the algorithm has been designed to also preserve the cross-sectional correlation between the pairs of synthetic KO-PEP data, just as in the real data series:

Distribution of correlations between synthetic KO and PEP price series

Some examples of highly correlated pairs of synthetic data are shown in the plots below:

In addition to correlation, we might also want to consider the price differences between the pairs of synthetic series, since the strategy will be trading that price difference, in the simple approach adopted here. We could, for example, select synthetic pairs for which the divergence in the price difference does not become too large, on the assumption that the series difference is stationary. While that approach might well be reasonable in other situations, here an assumption of stationarity would be perhaps closer to wishful thinking than reality. Instead we can use of selection of synthetic pairs with high levels of cross-correlation, as we all high levels of correlation with the real price data. We can also select for high correlation between the price differences for the real and synthetic price series.

Strategy Development & WFO Testing

Once again we follow the procedure for strategy development outline in the previous post, except that, in addition to a selection of synthetic price difference series we also include 14-day correlations between the pairs. We use synthetic daily synthetic data from 1999 to 2012 to build the strategy and use the data from 2013 onwards for testing/validation. Eventually, after 50 generations we arrive at the result shown in the figure below:

As before, the equity curve for the individual synthetic pairs are shown towards the bottom of the chart, while the aggregate equity curve, which is a composition of the results for all none synthetic pairs is shown above in green. Clearly the results appear encouraging.

As a final step we apply the WFO analysis procedure described in the previous post to test the performance of the strategy on the real data series, using a variable number in-sample and out-of-sample periods of differing size. The results of the WFO cluster test are as follows:

The results are no so unequivocal as for the strategy developed for the S&P 500 index, but would nonethless be regarded as acceptable, since the strategy passes the great majority of the tests (in addition to the tests on synthetic pairs data).

The final results appear as follows:

Conclusion

We have demonstrated how the algorithm can be used to generate synthetic price series the preserve not only the important time series properties, but also the cross-sectional properties between series for correlated securities. This important feature has applications in the development of statistical arbitrage strategies, portfolio construction methodology and in portfolio risk management.

Developing Trading Strategies With Synthetic Data

One of the main criticisms levelled at systematic trading over the last few years is that the over-use of historical market data has tended to produce curve-fitted strategies that perform poorly out of sample in a live trading environment. This is indeed a valid criticism – given enough attempts one is bound to arrive eventually at a strategy that performs well in backtest, even on a holdout data sample. But that by no means guarantees that the strategy will continue to perform well going forward.

The solution to the problem has been clear for some time: what is required is a method of producing synthetic market data that can be used to build a strategy and test it under a wide variety of simulated market conditions. A strategy built in this way is more likely to survive the challenge of live trading than one that has been developed using only a single historical data path.

The problem, however, has been in implementation. Up until now all the attempts to produce credible synthetic price data have failed, for one reason or another, as I described in an earlier post:

I have been able to devise a completely new algorithm for generating artificial price series that meet all of the key requirements, as follows:

  • Computational simplicity & efficiency. Important if we are looking to mass-produce synthetic series for a large number of assets, for a variety of different applications. Some deep learning methods would struggle to meet this requirement, even supposing that transfer learning is possible.
  • The ability to produce price series that are internally consistent (i.e High > Low, etc) in every case .
  • Should be able to produce a range of synthetic series that vary widely in their correspondence to the original price series. In some case we want synthetic price series that are highly correlated to the original; in other cases we might want to test our investment portfolio or risk control systems under extreme conditions never before seen in the market.
  • The distribution of returns in the synthetic series should closely match the historical series, being non-Gaussian and with “fat-tails”.
  • The ability to incorporate long memory effects in the sequence of returns.
  • The ability to model GARCH effects in the returns process.

This means that we are now in a position to develop trading strategies without any direct reference to the underlying market data. Consequently we can then use all of the real market data for out-of-sample back-testing.

Developing a Trading Strategy for the S&P 500 Index Using Synthetic Market Data

To illustrate the procedure I am going to use daily synthetic price data for the S&P 500 Index over the period from Jan 1999 to July 2022. Details of the the characteristics of the synthetic series are given in the post referred to above.

This image has an empty alt attribute; its file name is Fig3-12.png

Because we want to create a trading strategy that will perform under market conditions close to those currently prevailing, I will downsample the synthetic series to include only those that correlate quite closely, i.e. with a minimum correlation of 0.75, with the real price data.

Why do this? Surely if we want to make a strategy as robust as possible we should use all of the synthetic data series for model development?

The reason is that I believe that some of the more extreme adverse scenarios generated by the algorithm may occur quite rarely, perhaps once in every few decades. However, I am principally interested in a strategy that I can apply under current market conditions and I am prepared to take my chances that the worst-case scenarios are unlikely to come about any time soon. This is a major design decision, one that you may disagree with. Of course, one could make use of every available synthetic data series in the development of the trading model and by doing so it is likely that you would produce a model that is more robust. But the training could take longer and the performance during normal market conditions may not be as good.

Having generated the price series, the process I am going to follow is to use genetic programming to develop trading strategies that will be evaluated on all of the synthetic data series simultaneously. I will then use the performance of the aggregate portfolio, i.e. the outcome of all of the trades generated by the strategy when applied to all of the synthetic series, to assess the overall performance. In order to be considered, candidate strategies have to perform well under all of the different market scenarios, or at least the great majority of them. This ensures that the strategy is likely to prove more robust across different types of market conditions, rather than on just the single type of market scenario observed in the real historical series.

As usual in these cases I will reserve a portion (10%) of each data series for testing each strategy, and a further 10% sample for out-of-sample validation. This isn’t strictly necessary: since the real data series has not be used directly in the development of the trading system, we can later test the strategy on all of the historical data and regard this as an out-of-sample backtest.

To implement the procedure I am going to use Mike Bryant’s excellent Adaptrade Builder software.

This is an exemplar of outstanding software engineering and provides a broad range of features for generating trading strategies of every kind. One feature of Builder that is particularly useful in this context is its ability to construct strategies and test them on up to 20 data series concurrently. This enables us to develop a strategy using all of the synthetic data series simultaneously, showing the performance of each individual strategy as well for as the aggregate portfolio.

After evolving strategies for 50 generations we arrive at the following outcome:

The equity curve for the aggregate portfolio is shown in blue, while the equity curves for the strategy applied to individual synthetic data series are shown towards the bottom of the chart. Of course, the performance of the aggregate portfolio appears much superior to any of the individual strategies, because it is effectively the arithmetic sum of the individual equity curves. And just because the aggregate portfolio appears to perform well both in-sample and out-of-sample, that doesn’t imply that the strategy works equally well for every individual market scenario. In some scenarios it performs better than in others, as can be observed from the individual equity curves.

But, in any case, our objective here is not to create a stock portfolio strategy, but rather to trade a single asset – the S&P 500 Index. The role of the aggregate portfolio is simply to suggest that we may have found a strategy that is sufficiently robust to work well across a variety of market conditions, as represented by the various synthetic price series.

Builder generates code for the strategies it evolves in a number of different languages and in this case we take the EasyLanguage code for the fittest strategy #77 and apply it to a daily chart for the S&P 500 Index – i.e. the real data series – in Tradestation, with the following results:

The strategy appears to work well “out-of-the-box”, i,e, without any further refinement. So our quest for a robust strategy appears to have been quite successful, given that none of the 23-year span of real market data on which the strategy was tested was used in the development process.

We can take the process a little further, however, by “optimizing” the strategy. Traditionally this would mean finding the optimal set of parameters that produces the highest net profit on the test data. But this would be curve fitting in the worst possible sense, and is not at all what I am suggesting.

Instead we use a procedure known as Walk Forward Optimization (WFO), as described in this post:

The goal of WFO is not to curve-fit the best parameters, which would entirely defeat the object of using synthetic data. Instead, its purpose is to test the robustness of the strategy. We accomplish this by using a sequence of overlapping in-sample and out-of-sample periods to evaluate how well the strategy stands up, assuming the parameters are optimized on in-sample periods of varying size and start date and tested of similarly varying out-of-sample periods. A strategy that fails a cluster of such tests is unlikely to prove robust in live trading. A strategy that passes a test cluster at least demonstrates some capability to perform well in different market regimes.

To some extent we might regard such a test as unnecessary, given that the strategy has already been observed to perform well under several different market conditions, encapsulated in the different synthetic price series, in addition to the real historical price series. Nonetheless, we conduct a WFO cluster test to further evaluate the robustness of the strategy.

As the goal of the procedure is not to maximize the theoretical profitability of the strategy, but rather to evaluate its robustness, we select a criterion other than net profit as the factor to optimize. Specifically, we select the sum of the areas of the strategy drawdowns as the quantity to minimize (by maximizing the inverse of the sum of drawdown areas, which amounts to the same thing). This requires a little explanation.

If we look at the strategy drawdown periods of the equity curve, we observe several periods (highlighted in red) in which the strategy was underwater:

The area of each drawdown represents the length and magnitude of the drawdown and our goal here is to minimize the sum of these areas, so that we reduce both the total duration and severity of strategy drawdowns.

In each WFO test we use different % of OOS data and a different number of runs, assessing the performance of the strategy on a battery of different criteria:

x

These criteria not only include overall profitability, but also factors such as parameter stability, profit consistency in each test, the ratio of in-sample to out-of-sample profits, etc. In other words, this WFO cluster analysis is not about profit maximization, but robustness evaluation, as assessed by these several different metrics. And in this case the strategy passes every test with flying colors:

Other than validating the robustness of the strategy’s performance, the overall effect of the procedure is to slightly improve the equity curve by diminishing the magnitude and duration of the drawdown periods:

Conclusion

We have shown how, by using synthetic price series, we can build a robust trading strategy that performs well under a variety of different market conditions, including on previously “unseen” historical market data. Further analysis using cluster WFO tests strengthens the assessment of the strategy’s robustness.

Backtest vs. Trading Reality

Kris Sidial, whose Twitter posts are often interesting, recently posted about the reality of trading profitability vs backtest performance, as follows:

While I certainly agree that the latter example is more representative of a typical trader’s P&L, I don’t concur that the first P&L curve is necessarily “99.9% garbage”. There are many strategies that have equity curves that are smoother and more monotonic than those of Kris’s Skeleton Case V2 strategy. Admittedly, most of these lie in the area of high frequency, which is not Kris’s domain expertise. But there are also lower frequency strategies that produce results which are not dissimilar to those shown the first chart.

As a case in point, consider the following strategy for the S&P 500 E-Mini futures contract, described in more detail below. The strategy was developed using 15-minute bar data from 1999 to 2012, and traded live thereafter. The live and backtest performance characteristics are almost indistinguishable, not only in terms of rate of profit, but also in regard to strategy characteristics such as the no. of trades, % win rate and profit factor.

Just in case you think the picture is a little too rosy, I would point out that the average profit factor is 1.25, which means that the strategy is generating only 25% more in profits than losses. There will be big losing trades from time to time and long sequences of losses during which the strategy appears to have broken down. It takes discipline to resist the temptation to “fix” the strategy during extended drawdowns and instead rely on reversion to the mean rate of performance over the long haul. One source of comfort to the trader through such periods is that the 60% win rate means that the majority of trades are profitable.

As you read through the replies to Kris’s post, you will see that several of his readers make the point that strategies with highly attractive equity curves and performance characteristics are typically capital constrained. This is true in the case of this strategy, which I trade with a very modest amount of (my own) capital. Even trading one-lots in the E-Mini futures I occasionally experience missed trades, either on entry or exit, due to limit orders not being filled at the high or low of a bar. In scaling the strategy up to something more meaningful such as a 10-lot, there would be multiple partial fills to deal with. But I think it would be a mistake to abandon a high performing strategy such as this just because of an apparent capacity constraint. There are several approaches one can explore to address the issue, which may be enough to make the strategy scalable.

Where (as here) the issue of scalability relates to the strategy fill rate on limit orders, a good starting point is to compute the extreme hit rate, which is the proportion of trades that take place at the high or low of the bar. As a rule of thumb, for strategies running on typical low frequency infrastructure an extreme hit rate of 10% or less is manageable; anything above that level quickly becomes problematic. If the extreme hit rate is very high, e.g. 25% or more, then you are going to have to pay a great deal of attention to the issues of latency and order priority to make the strategy viable in practise. Ultimately, for a high frequency market making strategy, most orders are filled at the extreme of each “bar”, so almost all of the focus in on minimizing latency and maintaining a high queue priority, with all of the attendant concerns regarding trading hardware, software and infrastructure.

Next, you need a strategy for handling missed trades. You could, for example, decide to skip any entry trades that are missed, while manually entering unfilled exit trades at the market. Or you could post market orders for both entry and exit trades if they are not filled. An extreme solution would be to substitute market-if-touched orders for limit orders in your strategy code. But this would affect all orders generated by the system, not just the 10% at the high or low of the bar and is likely to have a very adverse affect on overall profitability, especially if the average trade is low (because you are paying an extra tick on entry and exit of every trade).

The above suggests that you are monitoring the strategy manually, running simulation and live versions side by side, so that you can pick up any trades that the strategy should have taken, but which have been missed. This may be practical for a strategy that trades during regular market hours, but not for one that also trades the overnight session.

An alternative approach, one that is commonly applied by systematic traders, is to automate the handling of missed trades. Typically the trader will set a parameter that converts a limit order to a market order X seconds after a limit price has been traded but not filled. Of course, this will result in paying up an extra tick (or more) to enter trades that perhaps would have been filled if one had waited longer than X seconds. It will have some negative impact on strategy profitability, but not too much if the extreme hit rate is low. I tend to use this method for exit trades, preferring to skip any entry trades that don’t get filled at the limit price.

Beyond these simple measures, there are several other ways to extend the capacity of the strategy. An obvious place to start is by evaluating strategy performance on different session times and bar lengths. So, in this case, we might look at deploying the strategy on both the day and night sessions. We can also evaluate performance on bars of different length. This will give different entry and exit points for individual trades and trades that are at the extreme of a bar on one timeframe may not be at the high or low of a bar on the other timescale. For example, here is the (simulated) performance of the strategy on 13 minute bars:

There is a reason for choosing a bar interval such as 13 minutes, rather than the more commonplace 5- or 10 minutes, as explained in this post:

Finally, it is worth exploring whether the strategy can be applied to other related markets such as NQ futures, for example. Typically this will entail some change to the strategy code to reflect the difference in price levels, but the thrust of the strategy logic will be similar. Another approach is to use the signals from the current strategy as inputs – i.e. alpha generators – for a derivative strategy, such as trading the SPY ETF based on signals from the ES strategy. The performance of the derived strategy may not be as good, but in a product like SPY the capacity might be larger.

Master’s in High Frequency Finance

I have been discussing with some potential academic partners the concept for a new graduate program in High Frequency Finance.  The idea is to take the concept of the Computational Finance program developed in the 1990s and update it to meet the needs of students in the 2010s.

The program will offer a thorough grounding in the modeling concepts, trading strategies and risk management procedures currently in use by leading investment banks, proprietary trading firms and hedge funds in US and international financial markets.  Students will also learn the necessary programming and systems design skills to enable them to make an effective contribution as quantitative analysts, traders, risk managers and developers.

I would be interested in feedback and suggestions as to the proposed content of the program.

The Hedged Volatility Strategy

Being short regular Volatility ETFs or long Inverse Volatility ETFs are winning strategies…most of the time. The challenge is that when the VIX spikes or when the VIX futures curve is downward sloping instead of upward sloping, very significant losses can occur. Many people have built and back-tested models that attempt to move from long to short to neutral positions in the various Volatility ETFs, but almost all of them have one or both of these very significant flaws: 1) Failure to use “out of sample” back-testing and 2) Failure to protect against “black swan” events.

In this strategy a position and weighting in the appropriate Volatility ETFs are established based on a multi-factor model which always uses out of sample back-testing to determine effectiveness. Volatility Options are always used to protect against significant short-term moves which left unchecked could result in the total loss of one’s portfolio value; these options will usually lose money, but that is a small price to pay for the protection they provide. (Strategies should be scaled at a minimum of 20% to ensure options protection.)

This is a good strategy for IRA accounts in which short selling is not allowed. Long positions in Inverse Volatility ETFs are typically held. Suggested minimum capital: $26,000 (using 20% scaling).