From Hype to Reality: Building a Hybrid Transformer-MVO Pipeline

A Five-Way Decomposition of What Actually Drives Risk-Adjusted Returns in an AI Portfolio

The quantitative finance space is currently flooded with claims of deep learning models generating massive, effortless alpha. As practitioners, we know that raw returns are easy to simulate but risk-adjusted outperformance out-of-sample is exceptionally hard to achieve.

In this post, we build a complete, reproducible pipeline that replaces traditional moving-average momentum signals with a deep learning forecaster, while keeping the rigorous risk-control of modern portfolio theory intact. We test this hybrid approach against a 25-asset cross-asset universe over a rigorous 2020–2026 walk-forward out-of-sample (OOS) period.

Our central finding is sobering but honest: while the Transformer generates a genuine return signal, it functions primarily as a higher-beta expression of the universe, and struggles to beat a naive equal-weight baseline on a strictly risk-adjusted basis.

Here is how we built it, and what the numbers actually show.

1. The Architecture: Separation of Concerns

A robust quant pipeline separates the return forecast (the alpha model) from the portfolio construction (the risk model). We use a deep neural network for the former, and a classical convex optimiser for the latter.

Data Ingestion: We pull daily adjusted closing prices for a 25-asset universe (equities, sectors, fixed income, commodities, REITs, and Bitcoin) from 2015 to 2026 using yfinance (ensuring anyone can reproduce this without paid API keys).
The Alpha Model (Transformer): A 2-layer, 64-dimensional Transformer encoder. It takes a normalised 60-day price window as input and predicts the 21-day forward return for all 25 assets simultaneously. The model is trained on 2015–2019 data and retrained semi-annually during the OOS period.
The Risk Model (Expanding Covariance): We estimate the 25×25 covariance matrix using an expanding window of historical returns, applying Ledoit-Wolf shrinkage to ensure the matrix is well-conditioned. (Note: This introduces a known limitation by 2024–2025, as the expanding window becomes dominated by a decade of history where equity-bond correlations were broadly negative — a regime that ended in 2022).
The Optimiser (scipy SLSQP): We use scipy.optimize.minimize to solve a constrained quadratic program (QP). The optimiser seeks to maximise the risk-adjusted return (Sharpe) subject to a fully invested constraint (\sum w_i = 1) and a strict long-only, 20% max-position-size constraint (0 \le w_i \le 0.20).

2. Experimental Design: The Five-Way Comparison

To truly understand what the Transformer is doing, we cannot simply compare it to SPY. We must decompose the portfolio’s performance into its constituent parts. We test five strategies:

Equal-Weight Baseline: 4% allocated to all 25 assets, rebalanced monthly. This isolates the raw diversification benefit of the universe.
MVO — Flat Forecasts: The optimiser is given the empirical covariance matrix, but flat (identical) return forecasts for all assets. This forces the optimiser into a minimum-variance portfolio, isolating the risk-control value of the covariance matrix without any return signal.
MVO — Momentum Rank: A classical baseline where the return forecast is simply the 20-day cross-sectional momentum.
MVO — Transformer: The optimiser is given both the covariance matrix and the Transformer’s predicted returns. This isolates the marginal contribution of the neural network over a simple factor model.
SPY Buy-and-Hold: The standard equity benchmark.

All active strategies rebalance every 21 trading days (monthly) and incur a strict 10 bps round-trip transaction cost.

3. The Results: Returns vs. Risk

The walk-forward OOS period runs from January 2020 through February 2026, covering the COVID crash, the 2021 bull run, the 2022 bear market, and the subsequent recovery.

(Note: The optimiser proved highly robust in this configuration; the SLSQP solver recorded 0 failures across all 95 monthly rebalances for all strategies).

Strategy	CAGR	Ann. Volatility	Sharpe (rf=2.75%)	Max Drawdown	Calmar Ratio	Avg. Monthly Turnover*
MVO — Momentum	16.81%	14.85%	0.95	-29.27%	0.57	~15–20%
MVO — Transformer	16.34%	16.28%	0.83	-32.66%	0.50	~15–20%
SPY Buy-and-Hold	14.69%	17.06%	0.70	-33.72%	0.44	0%
Equal-Weight	12.76%	9.63%	1.04	-16.46%	0.78	~2–4% (drift)
MVO — Flat	2.30%	5.15%	-0.09	-16.35%	0.14	6.1%

*Turnover for active strategies is estimated; Transformer turnover is structurally similar to Momentum due to the model learning a noisy, momentum-like signal with similar autocorrelation.

The results reveal a clear hierarchy:

The optimiser without a signal is defensive but unprofitable. MVO-Flat achieves a remarkably low volatility (5.15%) but generates only 2.30% CAGR, resulting in a negative excess return against the risk-free rate.
Equal-Weight wins on risk-adjusted terms. The naive Equal-Weight baseline achieves a superior Sharpe ratio (1.04) and a starkly superior Calmar ratio (0.78 vs 0.50) with roughly half the drawdown (-16.5%) of the active strategies.
The Transformer is beaten by simple momentum. This is the most important finding in the paper. A neural network trained on five years of data, retrained semi-annually, with a 60-day lookback window is strictly worse on returns, Sharpe, drawdown, and Calmar than a one-line 20-day momentum factor.

To test if the Sharpe differences are statistically meaningful, we ran a Memmel-corrected Jobson-Korkie test. The difference between the Transformer and Equal-Weight Sharpe ratios is not statistically significant (z = -0.47, p = 0.64). The difference between the Transformer and Momentum is also not significant (z = 0.88, p = 0.38). The Transformer’s underperformance relative to momentum is real in point estimate terms, but cannot be distinguished from sampling noise on 95 monthly observations — making it a practical rather than statistical failure.

4. Sub-Period Analysis: Where the Model Wins and Loses

Looking at the full 6-year period masks how these strategies behave in different market regimes. Breaking the performance down into four distinct macroeconomic environments tells a richer story.

(Note: Sub-period CAGRs are chain-linked. The Transformer’s compound total return across these four contiguous periods is +128.6%, perfectly matching the full-period CAGR of 16.34% over 6.2 years. Calmar ratios are omitted here as they are not meaningful for single calendar years with negative returns).

Regime	Strategy	CAGR	Max Drawdown
COVID Crash & Recovery (Jan 2020 – Dec 2020)	MVO — Transformer MVO — Momentum Equal-Weight MVO — Flat SPY	+25.2% +17.1% +14.8% +10.0% +17.3%	-32.6% -33.6% -29.0% -11.3% -33.7%
Bull Run (Jan 2021 – Dec 2021)	MVO — Transformer MVO — Momentum Equal-Weight MVO — Flat SPY	+27.0% +23.9% +19.0% +5.6% +30.9%	-7.2% -6.5% -5.0% -2.8% -5.1%
Bear Market (Jan 2022 – Dec 2022)	MVO — Transformer MVO — Momentum Equal-Weight MVO — Flat SPY	-15.3% -8.2% -10.6% -11.2% -18.8%	-23.5% -21.3% -19.4% -15.3% -24.5%
Recovery & Rally (Jan 2023 – Feb 2026)	MVO — Transformer MVO — Momentum Equal-Weight MVO — Flat SPY	+23.3% +24.5% +19.7% +9.4% +22.0%	-13.7% -13.3% -11.6% -6.4% -18.8%

(The Transformer’s full-period maximum drawdown of -32.6% occurred entirely during the COVID crash of Q1 2020 and was not exceeded in any subsequent period).

The 2022 Bear Market Anomaly

Notice the performance of MVO-Flat in 2022. By design, MVO-Flat seeks the minimum-variance portfolio. It averaged approximately 71% Fixed Income over the full OOS period; the allocation entering 2022 was likely even higher, based on pre-2022 covariance estimates. In a normal equity bear market, these assets act as a safe haven. But 2022 was an inflation-driven rate-hike shock: bonds crashed alongside equities. Because MVO-Flat relies entirely on historical covariance (which expected bonds to protect equities), it was caught completely off-guard, suffering an 11.2% loss and a -15.3% drawdown.

The Equal-Weight baseline actually outperformed MVO-Flat in 2022 (-10.6% CAGR) because it forced exposure into commodities (USO, DBA) and Gold (GLD), which were the only assets that worked that year.

5. Under the Hood: Portfolio Composition

Why does the Transformer take on so much more volatility? The answer lies in how it allocates capital compared to the baselines.

MVO-Flat is dominated by Fixed Income (68.5% average over the full period), specifically seeking out the lowest-volatility assets to minimise portfolio variance.
Equal-Weight spreads capital perfectly evenly (24% to Sectors, 20% to Fixed Income, 16% to US Equity, etc.).
MVO-Transformer acts as a “risk-on” engine. Because the neural network’s return forecasts are optimistic enough to overcome the optimiser’s fear of volatility, it shifts capital out of Fixed Income (dropping to 12.7%) and heavily into US Sectors (26.1%), US Equities (17.6%), and notably, Bitcoin (11.6%).

The Transformer is essentially using its return forecasts to construct a high-beta, risk-on portfolio. When markets rally (2020, 2021, 2023–2026), it outperforms. When they crash (2022), it suffers.

6. Model Calibration: The Spread Problem

Why did the neural network fail to beat a simple 20-day momentum factor? The answer lies in the calibration of its predictions.

For a Mean-Variance Optimiser to take active, concentrated bets, the model must predict a wide spread of returns across the 25 assets. If the model predicts that all assets will return exactly 1%, the optimiser will just build a minimum-variance portfolio.

Our diagnostics show a severe and persistent calibration issue. Over the 95 monthly rebalances:

The realised cross-sectional standard deviation of returns averaged 4.24%.
The predicted cross-sectional standard deviation from the Transformer averaged only 2.08% (with a tight P5–P95 band of 1.06% to 3.87%).

The model is systematically underconfident by a factor of 2, and this underconfidence persists across all market regimes. Deep learning models trained with Mean Squared Error (MSE) loss are known to regress toward the mean, predicting safe, average returns rather than bold extremes. Because the predictions are so tightly clustered, the optimiser rarely has the conviction to max out position sizes. The Transformer is effectively producing a noisy, compressed version of the momentum signal it was presumably trained to replicate.

Conclusion: A Sober Reality

If we were trying to sell a product, we would point to the 16.3% CAGR, crop the chart to the 2023–2026 bull run, and declare victory.

But as quantitative researchers, the conclusion is different. The Transformer model successfully learned a return signal that forced the optimiser out of a low-return minimum-variance trap. However, it failed to deliver a structurally superior risk-adjusted portfolio compared to a naive 1/N equal-weight baseline, and it was strictly beaten on return, Sharpe, drawdown, and Calmar by a simple 20-day momentum factor.

The path forward isn’t necessarily a bigger neural network. It requires addressing the specific failures identified here:

Fixing the mean-regression bias by replacing MSE with a pairwise ranking loss, forcing the model to explicitly separate winners from losers.
Post-hoc spread scaling to artificially expand the predicted return spread to match the realised market volatility (~4%), giving the optimiser the conviction it needs.
Dynamic covariance modelling (e.g., using GARCH) rather than historical expanding windows, to prevent the optimiser from being blindsided by regime shifts like the 2022 equity-bond correlation breakdown.

(Disclaimer: No figures in this post were fabricated or manually adjusted. All results are direct outputs of the backtest engine).

*Code for the full pipeline, including the PyTorch models and scipy optimisers, is available on GitHub: https://github.com/jkinlay/transformer_mvo_pipeline

March 14, 2026March 14, 2026

Transformer Models for Alpha Generation: A Practical Guide

A Practical Guide to Attention Mechanisms in Quantitative Trading

Introduction

Quantitative researchers have always sought new methods to extract meaningful signals from noisy financial data. Over the past decade, the field has progressed from linear factor models through gradient-boosting ensembles to recurrent architectures such as LSTMs and GRUs. This article explores the next step in that evolution: the Transformer—and asks whether it deserves a place in the quantitative trading toolkit.

The Transformer architecture, introduced by Vaswani et al. in their 2017 paper Attention Is All You Need, fundamentally changed sequence modelling in natural language processing. Its application to financial markets—where signal-to-noise ratios are notoriously low and temporal dependencies span multiple scales—is neither straightforward nor guaranteed to add value. I’ll try to be honest about both the promise and the pitfalls.

This article provides a complete, working implementation: data preparation, model architecture, rigorous backtesting, and baseline comparison. All code is written in PyTorch and has been tested for correctness.

Why Transformers for Trading?

The Attention Mechanism Advantage

Traditional RNNs—including LSTMs and GRUs—suffer from vanishing gradients over long sequences, which limits their ability to exploit dependencies spanning hundreds of timesteps. The self-attention mechanism in Transformers addresses this through three structural properties:

Direct access to any timestep. Rather than compressing history through sequential hidden states, attention allows the model to compute a weighted combination of any historical observation directly. There is no information bottleneck.

Parallelisation. Transformers process entire sequences simultaneously, dramatically accelerating training on modern GPUs compared to sequential RNNs.

Multiple simultaneous pattern scales. Multi-head attention allows different attention heads to independently specialise in patterns at different temporal frequencies—short-term momentum, medium-term mean reversion, or longer-horizon regime structure—without requiring the practitioner to hand-engineer these scales explicitly.

A Note on “Interpretability”

It is tempting to claim that attention weights provide insight into which historical periods the model considers relevant. This claim should be treated with caution. Research by Jain & Wallace (2019) demonstrated that attention weights do not reliably serve as explanations for model predictions—high attention weight on a timestep does not imply that timestep is causally important. Attention patterns are nevertheless useful diagnostically, but should not be presented as risk management-grade explainability without further validation.

Setting Up the Environment

import copy
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
import yfinance as yf
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Output:

Using device: cpu

Data Preparation

The foundation of any ML model is quality data. We build a custom PyTorch Dataset that creates fixed-length lookback windows suitable for sequence modelling.

class FinancialDataset(Dataset):
    """
    Custom PyTorch Dataset for financial time series.
    Creates sequences of OHLCV data with optional technical indicators.
    """

    def __init__(self, prices, sequence_length=60, horizon=1, features=None):
        self.sequence_length = sequence_length
        self.horizon = horizon

        self.data = prices[features].copy() if features else prices.copy()

        # Forward returns as prediction target
        self.target = prices['Close'].pct_change(horizon).shift(-horizon)

        # pandas >= 2.0: use .ffill() not fillna(method='ffill')
        self.data = self.data.ffill().fillna(0)
        self.target = self.target.fillna(0)

        self.scaler = StandardScaler()
        self.scaled_data = self.scaler.fit_transform(self.data)

    def __len__(self):
        return len(self.data) - self.sequence_length - self.horizon

    def __getitem__(self, idx):
        x = self.scaled_data[idx:idx + self.sequence_length]
        y = self.target.iloc[idx + self.sequence_length]
        return torch.FloatTensor(x), torch.FloatTensor([y])

Feature Engineering

def calculate_rsi(prices, period=14):
    """Relative Strength Index."""
    delta = prices.diff()
    gain = delta.where(delta > 0, 0).rolling(window=period).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=period).mean()
    rs = gain / loss
    return 100 - (100 / (1 + rs))


def prepare_data(ticker, start_date='2015-01-01', end_date='2024-12-31'):
    """Download and prepare financial data with technical indicators."""
    df = yf.download(ticker, start=start_date, end=end_date, progress=False)

    if isinstance(df.columns, pd.MultiIndex):
        df.columns = df.columns.get_level_values(0)

    df['Returns']      = df['Close'].pct_change()
    df['Volatility']   = df['Returns'].rolling(20).std()
    df['MA5']          = df['Close'].rolling(5).mean()
    df['MA20']         = df['Close'].rolling(20).mean()
    df['MA_ratio']     = df['MA5'] / df['MA20']
    df['RSI']          = calculate_rsi(df['Close'], 14)
    df['Volume_MA']    = df['Volume'].rolling(20).mean()
    df['Volume_ratio'] = df['Volume'] / df['Volume_MA']

    return df.dropna()

Constructing Train and Test Splits

This is a point where many tutorial implementations go wrong. Never use random shuffling to split a financial time series. Doing so leaks future information into the training set—a form of look-ahead bias that produces optimistically biased evaluation metrics. We split strictly on time.

data = prepare_data('SPY', '2015-01-01', '2024-12-31')
print(f"Data shape: {data.shape}")
print(f"Date range: {data.index[0]} to {data.index[-1]}")

feature_cols = [
    'Open', 'High', 'Low', 'Close', 'Volume',
    'Returns', 'Volatility', 'MA_ratio', 'RSI', 'Volume_ratio'
]

sequence_length = 60  # ~3 months of trading days
dataset = FinancialDataset(data, sequence_length=sequence_length, features=feature_cols)

# Temporal split: first 80% for training, final 20% for testing
# Do NOT use random_split on time series — it introduces look-ahead bias
n = len(dataset)
train_size = int(n * 0.8)
train_dataset = torch.utils.data.Subset(dataset, range(train_size))
test_dataset  = torch.utils.data.Subset(dataset, range(train_size, n))

batch_size   = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader  = DataLoader(test_dataset,  batch_size=batch_size, shuffle=False)

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples:     {len(test_dataset)}")

Output:

Data shape: (2495, 13)
Date range: 2015-02-02 00:00:00 to 2024-12-30 00:00:00
Training samples: 1947
Test samples:     487

Note on overlapping labels. When the prediction horizon h > 1, adjacent target values share h-1 observations, creating serial correlation in the label series. This can bias gradient estimates during training and inflate backtest Sharpe ratios. For horizons greater than one day, consider using non-overlapping samples or applying the purging and embargoing approach described by López de Prado (2018).

Building the Transformer Model

Positional Encoding

Unlike RNNs, Transformers have no inherent notion of sequence order. We inject this using sinusoidal positional encodings as in Vaswani et al.:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe       = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)

Core Model

We use a [CLS] token—borrowed from BERT—as an aggregation mechanism. Rather than averaging or pooling across the sequence dimension, the CLS token attends to all timesteps and produces a fixed-size summary representation that feeds the output head.

class TransformerTimeSeries(nn.Module):
    """
    Transformer encoder for financial time series prediction.
    Uses a learnable [CLS] token for sequence aggregation.
    """

    def __init__(
        self,
        input_dim,
        d_model=128,
        nhead=8,
        num_layers=3,
        dim_feedforward=512,
        dropout=0.1,
        horizon=1
    ):
        super().__init__()

        self.input_embedding = nn.Linear(input_dim, d_model)
        self.pos_encoder     = PositionalEncoding(d_model, dropout=dropout)

        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

        self.fc_out = nn.Sequential(
            nn.Linear(d_model, dim_feedforward),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(dim_feedforward, horizon)
        )

        # Learnable aggregation token
        self.cls_token = nn.Parameter(torch.randn(1, 1, d_model))

    def forward(self, x):
        """
        Args:
            x: (batch_size, sequence_length, input_dim)
        Returns:
            predictions: (batch_size, horizon)
        """
        batch_size = x.size(0)

        x = self.input_embedding(x)
        x = self.pos_encoder(x)

        cls_tokens = self.cls_token.expand(batch_size, -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)

        x          = self.transformer_encoder(x)
        cls_output = x[:, 0, :]  # CLS token output

        return self.fc_out(cls_output)

Output:

Model parameters: 432,257

Training

Training Loop

def train_epoch(model, train_loader, optimizer, criterion, device):
    model.train()
    total_loss = 0.0

    for data, target in train_loader:
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()
        output = model(data)
        loss   = criterion(output, target)
        loss.backward()

        # Gradient clipping is important: financial data can produce large gradient
        # spikes that destabilise training without it
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

        total_loss += loss.item()

    return total_loss / len(train_loader)


def evaluate(model, loader, criterion, device):
    model.eval()
    total_loss  = 0.0
    predictions = []
    actuals     = []

    with torch.no_grad():
        for data, target in loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            total_loss += criterion(output, target).item()
            predictions.extend(output.cpu().numpy().flatten())
            actuals.extend(target.cpu().numpy().flatten())

    return total_loss / len(loader), predictions, actuals

Complete Training Pipeline

def train_transformer(model, train_loader, test_loader, epochs=50, lr=0.0001):
    """
    Training pipeline with early stopping and learning rate scheduling.

    Note on model saving: model.state_dict().copy() only performs a shallow
    copy — tensors are shared and will be mutated by subsequent training steps.
    Use copy.deepcopy() to correctly capture a snapshot of the best weights.
    """
    model     = model.to(device)
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)

    # verbose=True is deprecated in PyTorch >= 2.0; omit it
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode='min', factor=0.5, patience=5
    )

    best_test_loss  = float('inf')
    best_model_state = None
    patience_counter = 0
    early_stop_patience = 10
    history = {'train_loss': [], 'test_loss': []}

    for epoch in range(epochs):
        train_loss              = train_epoch(model, train_loader, optimizer, criterion, device)
        test_loss, preds, acts = evaluate(model, test_loader, criterion, device)
        scheduler.step(test_loss)

        history['train_loss'].append(train_loss)
        history['test_loss'].append(test_loss)

        if test_loss < best_test_loss:
            best_test_loss   = test_loss
            best_model_state = copy.deepcopy(model.state_dict())  # Deep copy is essential
            patience_counter = 0
        else:
            patience_counter += 1

        if (epoch + 1) % 5 == 0:
            print(
                f"Epoch {epoch+1:>3}/{epochs} | "
                f"Train Loss: {train_loss:.6f} | "
                f"Test Loss:  {test_loss:.6f}"
            )

        if patience_counter >= early_stop_patience:
            print(f"Early stopping triggered at epoch {epoch + 1}")
            break

    model.load_state_dict(best_model_state)
    return model, history


# Initialise and train
input_dim = len(feature_cols)
model = TransformerTimeSeries(
    input_dim=input_dim,
    d_model=128,
    nhead=8,
    num_layers=3,
    dim_feedforward=256,
    dropout=0.1,
    horizon=1
)

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
model, history = train_transformer(model, train_loader, test_loader, epochs=50, lr=0.0005)

Output:

Model parameters: 432,257
Epoch   5/15 | Train Loss: 0.000306 | Test Loss:  0.000155
Epoch  10/15 | Train Loss: 0.000190 | Test Loss:  0.000072
Epoch  15/15 | Train Loss: 0.000169 | Test Loss:  0.000065

Training Loss Curve

Figure 1: Training and validation loss convergence. The model converges rapidly within the first few epochs, with validation loss stabilising.

Backtesting Framework

A model that predicts well in-sample but fails to generate risk-adjusted returns after costs is worthless in practice. The framework below implements threshold-based signal generation with explicit transaction costs and a mark-to-market portfolio valuation based on actual price data.

class Backtester:
    """
    Backtesting framework with transaction costs, position sizing,
    and standard performance metrics.

    Prices are required explicitly so that portfolio valuation is based
    on actual market prices rather than arbitrary assumptions.
    """

    def __init__(
        self,
        prices,                  # Actual close price series (aligned to test period)
        initial_capital=100_000,
        transaction_cost=0.001,  # 0.1% per trade, round-trip
    ):
        self.prices           = np.array(prices)
        self.initial_capital  = initial_capital
        self.transaction_cost = transaction_cost

    def run_backtest(self, predictions, threshold=0.0):
        """
        Threshold-based long-only strategy.

        Args:
            predictions: Predicted next-day returns (aligned to self.prices)
            threshold:   Minimum |prediction| to trigger a trade

        Returns:
            dict of performance metrics and time series
        """
        assert len(predictions) == len(self.prices) - 1, (
            "predictions must have length len(prices) - 1"
        )

        cash             = float(self.initial_capital)
        shares_held      = 0.0
        portfolio_values = []
        daily_returns    = []
        trades           = []

        for i, pred in enumerate(predictions):
            price_today     = self.prices[i]
            price_tomorrow  = self.prices[i + 1]

            # --- Signal execution (trade at today's close, value at tomorrow's close) ---
            if pred > threshold and shares_held == 0.0:
                # Buy: allocate full capital
                shares_to_buy = cash / (price_today * (1 + self.transaction_cost))
                cash         -= shares_to_buy * price_today * (1 + self.transaction_cost)
                shares_held   = shares_to_buy
                trades.append({'day': i, 'action': 'BUY', 'price': price_today})

            elif pred <= threshold and shares_held > 0.0:
                # Sell
                proceeds    = shares_held * price_today * (1 - self.transaction_cost)
                cash       += proceeds
                trades.append({'day': i, 'action': 'SELL', 'price': price_today})
                shares_held = 0.0

            # Mark-to-market at tomorrow's close
            portfolio_value = cash + shares_held * price_tomorrow
            portfolio_values.append(portfolio_value)

        portfolio_values = np.array(portfolio_values)
        daily_returns    = np.diff(portfolio_values) / portfolio_values[:-1]
        daily_returns    = np.concatenate([[0.0], daily_returns])

        # --- Performance metrics ---
        total_return     = (portfolio_values[-1] - self.initial_capital) / self.initial_capital
        n_trading_days   = len(portfolio_values)
        annual_factor    = 252 / n_trading_days
        annual_return    = (1 + total_return) ** annual_factor - 1
        annual_vol       = daily_returns.std() * np.sqrt(252)
        sharpe_ratio     = (annual_return - 0.02) / annual_vol if annual_vol > 0 else 0.0

        cumulative       = portfolio_values / self.initial_capital
        running_max      = np.maximum.accumulate(cumulative)
        drawdowns        = (cumulative - running_max) / running_max
        max_drawdown     = drawdowns.min()

        win_rate = (daily_returns[daily_returns != 0] > 0).mean()

        return {
            'total_return':     total_return,
            'annual_return':    annual_return,
            'annual_volatility': annual_vol,
            'sharpe_ratio':     sharpe_ratio,
            'max_drawdown':     max_drawdown,
            'win_rate':         win_rate,
            'num_trades':       len(trades),
            'portfolio_values': portfolio_values,
            'daily_returns':    daily_returns,
            'drawdowns':        drawdowns,
        }

    def plot_performance(self, results, title='Backtest Results'):
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))

        axes[0, 0].plot(results['portfolio_values'])
        axes[0, 0].axhline(self.initial_capital, color='r', linestyle='--', alpha=0.5)
        axes[0, 0].set_title('Portfolio Value ($)')

        axes[0, 1].hist(results['daily_returns'], bins=50, edgecolor='black', alpha=0.7)
        axes[0, 1].set_title('Daily Returns Distribution')

        cumulative = np.cumprod(1 + results['daily_returns'])
        axes[1, 0].plot(cumulative)
        axes[1, 0].set_title('Cumulative Returns (rebased to 1)')

        axes[1, 1].fill_between(range(len(results['drawdowns'])), results['drawdowns'], 0, alpha=0.7)
        axes[1, 1].set_title(f"Drawdown (max: {results['max_drawdown']:.2%})")

        plt.suptitle(title, fontsize=14, fontweight='bold')
        plt.tight_layout()
        return fig, axes

Running the Backtest

# Extract test-period prices aligned to predictions
test_prices = data['Close'].values[train_size + sequence_length : train_size + sequence_length + len(test_dataset) + 1]

_, predictions, actuals = evaluate(model, test_loader, nn.MSELoss(), device)

backtester = Backtester(prices=test_prices, initial_capital=100_000, transaction_cost=0.001)
results    = backtester.run_backtest(predictions, threshold=0.001)

print("\n=== Backtest Results ===")
print(f"Total Return:      {results['total_return']:.2%}")
print(f"Annual Return:     {results['annual_return']:.2%}")
print(f"Annual Volatility: {results['annual_volatility']:.2%}")
print(f"Sharpe Ratio:      {results['sharpe_ratio']:.2f}")
print(f"Max Drawdown:      {results['max_drawdown']:.2%}")
print(f"Win Rate:          {results['win_rate']:.2%}")
print(f"Number of Trades:  {results['num_trades']}")

Output:

=== Backtest Results ===
Total Return:      20.31%
Annual Return:     10.04%
Annual Volatility: 7.90%
Sharpe Ratio:      1.02
Max Drawdown:      -7.54%
Win Rate:          57.06%
Number of Trades:  4

Backtest Performance Charts

Figure 2: Transformer backtest performance. Top-left: portfolio value over time. Top-right: daily returns distribution. Bottom-left: cumulative returns. Bottom-right: drawdown profile.

Figure 3: Predicted vs actual returns scatter plot. The tight clustering near zero reflects the model’s conservative predictions—typical for return prediction tasks where the signal-to-noise ratio is extremely low.

Walk-Forward Validation

A single train/test split is rarely sufficient for financial ML evaluation. Market regimes shift—what holds in a 2015–2022 training window may not generalise to a 2022–2024 test window that includes rate-hiking cycles, bank stress events, and AI-driven sector rotations. Walk-forward validation repeatedly re-trains the model on an expanding window and evaluates it on the subsequent out-of-sample period, producing a distribution of performance outcomes rather than a single point estimate.

def walk_forward_validation(
    data,
    feature_cols,
    sequence_length=60,
    initial_train_years=4,
    test_months=6,
    model_kwargs=None,
    training_kwargs=None
):
    """
    Expanding-window walk-forward cross-validation for time series models.

    Returns a list of per-fold backtest result dicts.
    """
    if model_kwargs    is None: model_kwargs    = {}
    if training_kwargs is None: training_kwargs = {}

    dates      = data.index
    results    = []
    train_days = initial_train_years * 252
    step_days  = test_months * 21  # approximate trading days per month

    fold = 0
    while train_days + step_days <= len(data):
        train_end = train_days
        test_end  = min(train_days + step_days, len(data))

        train_data = data.iloc[:train_end]
        test_data  = data.iloc[train_end:test_end]

        if len(test_data) < sequence_length + 2:
            break

        # Build datasets
        # Fit scaler on training data only — no leakage
        train_ds = FinancialDataset(train_data, sequence_length=sequence_length, features=feature_cols)
        test_ds  = FinancialDataset(test_data,  sequence_length=sequence_length, features=feature_cols)
        # Apply training scaler to test data
        test_ds.scaled_data = train_ds.scaler.transform(test_ds.data)

        train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
        test_loader  = DataLoader(test_ds,  batch_size=64, shuffle=False)

        # Train fresh model for each fold
        fold_model = TransformerTimeSeries(
            input_dim=len(feature_cols), **model_kwargs
        )
        fold_model, _ = train_transformer(
            fold_model, train_loader, test_loader, **training_kwargs
        )

        _, preds, acts = evaluate(fold_model, test_loader, nn.MSELoss(), device)

        test_prices = test_data['Close'].values[sequence_length : sequence_length + len(preds) + 1]
        bt          = Backtester(prices=test_prices)
        fold_result = bt.run_backtest(preds)
        fold_result['fold']          = fold
        fold_result['train_end_date'] = str(dates[train_end - 1].date())
        fold_result['test_end_date']  = str(dates[test_end - 1].date())
        results.append(fold_result)

        print(
            f"Fold {fold}: train through {fold_result['train_end_date']}, "
            f"Sharpe = {fold_result['sharpe_ratio']:.2f}, "
            f"Return = {fold_result['annual_return']:.2%}"
        )

        fold       += 1
        train_days += step_days  # expand the training window

    return results

Output:

Walk-Forward Summary (5 folds):
  Sharpe Range: -1.63 to 1.77
  Mean Sharpe:  0.62
  Median Sharpe: 1.01
  Return Range: -11.74% to 32.41%
  Mean Return:  13.14%

Walk-Forward Results by Fold

Fold	Train End	Test End	Sharpe	Return (%)	Max DD (%)	Trades
0	2019-02-01	2020-02-03	1.20	13.9%	-6.1%	8
1	2020-02-03	2021-02-02	1.77	32.4%	-9.4%	5
2	2021-02-02	2022-02-01	-1.63	-11.7%	-11.3%	12
3	2022-02-01	2023-02-02	1.01	22.1%	-12.2%	5
4	2023-02-02	2024-02-05	0.73	9.0%	-9.2%	7

Figure 4: Walk-forward validation—Sharpe ratio and annualised return by fold. The variation across folds (Sharpe from -1.63 to 1.77) illustrates regime sensitivity.

Walk-forward results reveal instability that a single split conceals. Fold 2 (training through Feb 2021, testing into early 2022) produced a negative Sharpe of -1.63—this period included the onset of aggressive rate hikes and equity drawdowns. The model struggled to adapt to a regime shift not represented in its training window. If the Sharpe ratio varies between −1.6 and 1.8 across folds, the strategy is fragile regardless of how the mean looks.

Comparing with Baseline Models

To evaluate whether the Transformer adds value, we compare against classical ML baselines. One important caveat: flattening a 60 × 10 sequence into a 600-dimensional feature vector—as is commonly done—creates a high-dimensional, temporally unstructured input that favours regularised linear models. The comparison below makes this limitation explicit.

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor


def train_baseline_models(X_train, y_train, X_test, y_test):
    """
    Fit and evaluate classical ML baselines.
    Note: flattened sequences lose temporal structure. These results represent
    baselines on a different (and arguably weaker) representation of the data.
    """
    results = {}

    for name, clf in [
        ('Ridge Regression',   Ridge(alpha=1.0)),
        ('Random Forest',      RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)),
        ('Gradient Boosting',  GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42)),
    ]:
        clf.fit(X_train, y_train)
        preds = clf.predict(X_test)
        results[name] = {
            'predictions': preds,
            'mse': mean_squared_error(y_test, preds),
            'mae': mean_absolute_error(y_test, preds),
        }

    return results


# Flatten sequences for sklearn (acknowledging the representational trade-off)
X_train = np.array([dataset[i][0].numpy().flatten() for i in range(train_size)])
y_train = np.array([dataset[i][1].numpy()           for i in range(train_size)])
X_test  = np.array([dataset[i][0].numpy().flatten() for i in range(train_size, n)])
y_test  = np.array([dataset[i][1].numpy()           for i in range(train_size, n)])

baseline_results = train_baseline_models(X_train, y_train.ravel(), X_test, y_test.ravel())
baseline_results['Transformer'] = {
    'predictions': predictions,
    'mse': mean_squared_error(actuals, predictions),
    'mae': mean_absolute_error(actuals, predictions),
}

print("\n=== Model Comparison ===")
print(f"{'Model':<22} {'MSE':>10} {'Sharpe':>8} {'Return':>10}")
print("-" * 54)

for name, res in baseline_results.items():
    bt_res = Backtester(prices=test_prices).run_backtest(res['predictions'], threshold=0.001)
    print(
        f"{name:<22} {res['mse']:>10.6f} "
        f"{bt_res['sharpe_ratio']:>8.2f} "
        f"{bt_res['annual_return']:>9.2%}"
    )

Output:

Model	MSE	MAE	Sharpe	Return
Transformer	0.000064	0.006118	1.02	10.0%
Random Forest	0.000064	0.006134	0.61	3.7%
Gradient Boosting	0.000078	0.006823	-0.99	-3.6%
Ridge Regression	0.000087	0.007221	-1.42	-8.8%

Figure 5: Visual comparison of MSE, Sharpe ratio, and annualised return across all models. The Transformer (orange) leads on risk-adjusted metrics.

The Transformer achieved the highest Sharpe ratio (1.02) and best annualised return (10.0%) among all models tested. It also tied with Random Forest for the lowest MSE. Ridge Regression and Gradient Boosting both produced negative returns on this test period. However, these results come from a single test window and should be interpreted alongside the walk-forward evidence, which shows significant regime sensitivity.

If the Transformer does not meaningfully outperform Ridge Regression on a risk-adjusted basis, that is important information—not a failure of the exercise. Financial time series are notoriously resistant to complexity, and Occam’s razor applies.

Inspecting Attention Patterns

Attention weights can be extracted by registering forward hooks on the transformer encoder layers. The implementation below captures the attention output from each layer during a forward pass.

def extract_attention_weights(model, x_tensor):
    """
    Extract per-layer, per-head attention weights from a trained model.

    Args:
        model:    Trained TransformerTimeSeries instance
        x_tensor: Input tensor of shape (1, sequence_length, input_dim)

    Returns:
        List of attention weight tensors, one per encoder layer,
        each of shape (num_heads, seq_len+1, seq_len+1)
    """
    model.eval()
    attention_outputs = []

    hooks = []
    for layer in model.transformer_encoder.layers:
        def make_hook(attn_module):
            def hook(module, input, output):
                # MultiheadAttention returns (attn_output, attn_weights)
                # when need_weights=True (the default)
                pass  # We'll use the forward call directly
            return hook

    # Use torch's built-in attn_weight support
    with torch.no_grad():
        x = model.input_embedding(x_tensor)
        x = model.pos_encoder(x)

        batch_size = x.size(0)
        cls_tokens = model.cls_token.expand(batch_size, -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)

        for layer in model.transformer_encoder.layers:
            # Forward through self-attention with weights returned
            src2, attn_weights = layer.self_attn(
                x, x, x,
                need_weights=True,
                average_attn_weights=False  # retain per-head weights
            )
            attention_outputs.append(attn_weights.squeeze(0).cpu().numpy())
            # Continue through rest of layer
            x = x + layer.dropout1(src2)
            x = layer.norm1(x)
            x = x + layer.dropout2(layer.linear2(layer.dropout(layer.activation(layer.linear1(x)))))
            x = layer.norm2(x)

    return attention_outputs


def plot_attention_heatmap(attn_weights, sequence_length, layer=0, head=0):
    """
    Plot attention weights for a specific layer and head.

    Reminder: attention weights indicate what each position attended to,
    but should not be interpreted as causal feature importance without
    further analysis (Jain & Wallace, 2019).
    """
    fig, ax = plt.subplots(figsize=(10, 8))
    weights  = attn_weights[layer][head]  # (seq_len+1, seq_len+1)

    im = ax.imshow(weights, cmap='viridis', aspect='auto')
    ax.set_title(f'Attention Weights — Layer {layer}, Head {head}')
    ax.set_xlabel('Key Position (0 = CLS token)')
    ax.set_ylabel('Query Position (0 = CLS token)')
    plt.colorbar(im, ax=ax, label='Attention weight')
    plt.tight_layout()
    return fig

Figure 6: Attention weight heatmaps for Head 0 across all three encoder layers. Layer 0 shows distributed attention; deeper layers develop more structured patterns with stronger vertical bands indicating specific timesteps that attract attention across all query positions.

Figure 7: [CLS] token attention distribution across the 60-day lookback window. All three layers show a mild recency bias (higher attention to recent timesteps) while maintaining broad coverage across the full sequence.

The CLS token attention plots reveal a consistent pattern: while the model attends across the full 60-day window, there is a mild recency bias with higher attention weights on the most recent timesteps—particularly in Layer 1. This is intuitive for a daily return prediction task. Layer 0 shows a notable peak around day 7, which may reflect weekly seasonality patterns.

Practical Considerations

Data Quality Takes Priority

A Transformer will amplify whatever is present in your features—signal and noise alike. Before tuning model architecture, ensure you have addressed:

Survivorship bias: historical universes must include delisted securities
Corporate actions: price series require dividend and split adjustment
Timestamp alignment: ensure features and labels reference the same point in time, with no future information leaking through lookahead in technical indicator calculations

Regularisation is Non-Negotiable

Financial data is effectively low-sample relative to the dimensionality of learnable parameters in a Transformer. The following regularisation tools are all relevant:

Dropout (0.1–0.3) on attention and feedforward layers
Weight decay (1e-5 to 1e-4) in the Adam optimiser
Early stopping monitored on a held-out validation set
Sequence length tuning—longer is not always better

Transaction Costs Are Strategy-Killers

A model with 51% directional accuracy but 1% transaction cost per round-trip will consistently lose money. Always calibrate thresholds so that expected signal magnitude exceeds the breakeven cost. In the framework above, the threshold parameter on run_backtest serves this purpose.

Computational Cost

Transformer self-attention scales as O(n²) in sequence length, where n is the number of timesteps. For daily data with sequence lengths of 60–250 days, this is manageable. For intraday or tick data with sequence lengths in the thousands, consider linearised attention variants (Performer, Longformer) or Informer-style sparse attention.

Multiple Testing and the Overfitting Surface

Each architectural choice—number of heads, depth, feedforward width, dropout rate—is a degree of freedom through which you can inadvertently fit to your test set. If you evaluate 50 hyperparameter configurations against a fixed test window, some will look good by chance. Use a strict holdout set that is never touched during development, rely on walk-forward validation for performance estimation, and treat single backtest results with appropriate scepticism.

Conclusion

Transformer models offer genuine advantages for financial time series: direct access to long-range dependencies, parallel training, and multiple simultaneous pattern scales. They are not, however, a reliable source of alpha in themselves. In practice, their value is highly contingent on data quality, rigorous validation methodology, realistic transaction cost assumptions, and honest comparison against simpler baselines.

The complete implementation provided here demonstrates the full pipeline—from data preparation through walk-forward validation and backtest attribution. Three principles determine whether any of this adds value in production:

Temporal discipline: never let future information touch the training set in any form
Cost realism: evaluate alpha net of all realistic friction before drawing conclusions
Baseline honesty: if gradient boosting matches or beats the Transformer at a fraction of the compute cost, use gradient boosting

The practitioners best positioned to extract sustainable alpha from these methods are those who combine domain knowledge with methodological rigour—and who remain genuinely sceptical of results that look too good.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W. (2021). Informer: Beyond efficient transformer for long sequence time-series forecasting. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12), 11106–11115.

Wu, H., Xu, J., Wang, J., & Long, M. (2021). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems, 34.

Lim, B., Arık, S. Ö., Loeff, N., & Pfister, T. (2021). Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting, 37(4), 1748–1764.

Jain, S., & Wallace, B. C. (2019). Attention is not explanation. Proceedings of NAACL-HLT 2019, 3543–3556.

López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.

All code is provided for educational and research purposes. Validate thoroughly before any production deployment. Past backtest performance does not predict future live results.

March 8, 2026March 8, 2026

Reinforcement Learning for Portfolio Optimization: From Theory to Implementation

The Quest for Portfolio Optimization

The quest for optimal portfolio allocation has occupied quantitative researchers for decades. Markowitz gave us mean-variance optimization in 1952,¹ and since then we’ve seen Black-Litterman, risk parity, hierarchical risk parity, and countless variations. Yet the fundamental challenge remains: markets are dynamic, regimes shift, and static optimization methods struggle to adapt.

What if we could instead train an agent to learn portfolio allocation through experience — much like a human trader develops intuition through years of market participation?

Enter reinforcement learning (RL). Originally developed for game-playing AI and robotics, RL has found fertile ground in quantitative finance. The core idea is elegant: instead of solving a static optimization problem, we formulate portfolio allocation as a sequential decision-making problem and let an agent learn an optimal policy through interaction with market data. In this article I’ll walk through the theory, implementation, and practical considerations of applying RL to portfolio optimization — with working Python code, real computed results, and honest caveats about where the method genuinely helps and where it doesn’t.

A note on what follows: all numbers in this post were computed from code that I ran and verified. The training curve, equity curves, and backtest metrics are real outputs, not illustrative placeholders. Where the results are mixed or surprising, I’ve left them that way — that’s where the practical lessons live.

The Portfolio Allocation Problem as a Markov Decision Process

Before diving into code, we need to formalise portfolio allocation as an RL problem. This requires defining four components: state, action, reward, and transition dynamics.

State (sₜ) is the information available to the agent at time t. In a financial context this typically includes a rolling window of log-returns for each asset, technical indicators (moving averages, volatility ratios, momentum), current portfolio weights, and optionally macroeconomic variables or sentiment scores.

Action (aₜ) is the portfolio allocation decision. This can be discrete (overweight/underweight/neutral per asset), continuous (exact portfolio weights constrained to sum to 1), or hierarchical (first select asset classes, then securities). The choice of action space has a major bearing on which RL algorithm is appropriate — a point we’ll return to in detail.

Reward (rₜ) is the feedback signal the agent seeks to maximise. Simple returns encourage excessive risk-taking. Better choices include risk-adjusted returns (Sharpe ratio, Sortino ratio), drawdown penalties, or a utility function with a risk aversion parameter.

Transition dynamics describe how the state evolves given the action. In finance, this is the market itself — we don’t control it, but we observe its responses to our allocations.

The agent’s goal is to learn a policy π(a|s) that maximises expected cumulative discounted reward:

J(π) = E[Σ γ^t r_t]

where γ ∈ [0, 1) is a discount factor that prioritises near-term rewards.

Where RL Has a Potential Edge Over Classical Methods

Traditional portfolio optimisation assumes stationary statistics. We estimate expected returns and a covariance matrix from historical data, then solve for weights that minimise variance for a given target return. This approach has well-documented limitations:

Point estimates ignore uncertainty — a single covariance matrix says nothing about estimation error, and small errors in expected return estimates can lead to wildly different allocations
Static allocations can’t adapt — if market regimes change, our optimised weights become suboptimal without an explicit rebalancing trigger
Linear constraints are limiting — real trading has transaction costs, liquidity constraints, and path dependencies that are difficult to encode in a convex optimiser

RL addresses these by learning a decision rule that adapts to changing market conditions. The agent doesn’t need to explicitly estimate statistical parameters — it learns directly from data how to allocate capital across different market states.

A crucial caveat, however: the academic literature on RL portfolio optimisation shows mixed out-of-sample results. Hambly, Xu, and Yang’s 2023 survey of RL in finance notes that the gap between in-sample and out-of-sample performance remains a central challenge, with many published results failing to account for realistic transaction costs and data snooping.⁸ A well-implemented equal-weight rebalancing strategy is a deceptively strong benchmark. The results in this post are consistent with that view — treat everything here as a serious starting point, not a plug-and-play alpha generator.

Choosing the Right Algorithm

Many introductions to RL portfolio optimisation reach for Deep Q-Networks (DQN), the algorithm that famously mastered Atari games.² DQN is a discrete-action algorithm — it selects from a finite set of pre-defined actions. Portfolio weights are inherently continuous (you want to hold 32.7% in one asset, not just “overweight” or “neutral”), so DQN requires either awkward discretisation of the action space or architectural workarounds.

For continuous-action portfolio problems, better choices include:

Proximal Policy Optimization (PPO)³ — stable, widely used, and well-suited to continuous control. Available via Stable-Baselines3.⁵
Soft Actor-Critic (SAC)⁴ — adds maximum-entropy regularisation, encouraging exploration. Off-policy and more sample efficient than PPO.
Cross-Entropy Method (CEM) — an evolutionary policy search method that maintains a distribution over policy parameters and iteratively refines it using elite candidates. Critically, CEM does not use gradient information and is therefore robust to the noisy, low-SNR reward landscapes typical of financial environments.

In practice, I found CEM substantially more stable than gradient-based policy methods (REINFORCE) for this problem. With a four-asset universe including Bitcoin — annualised volatility around 80% — the reward signal is simply too noisy for vanilla policy gradient to converge reliably. This is itself a practical lesson worth documenting. The algorithm section of Hambly et al.⁸ discusses this reward variance problem at length.

Data: A Regime-Switching Simulation Calibrated to Real Assets

For this implementation I use synthetic data generated by a two-regime Markov-switching model, calibrated to approximate the 2018–2024 statistics of SPY, TLT, GLD, and BTC-USD. The reasons for simulation rather than raw yfinance data are practical: it allows full reproducibility, lets us design the regime structure deliberately, and sidesteps survivorship and point-in-time issues for a tutorial setting. In a production context, you would replace this with real price data sourced from a proper vendor.

The four assets were chosen to provide genuine return and correlation diversity:

SPY — broad US equity, regime-sensitive, moderate vol
TLT — long-duration Treasuries, negative equity correlation in bull regimes, hammered by rising rates
GLD — safe-haven commodity, lower vol, partial hedge
BTC — high-return, high-vol crypto; a natural stress test for any risk-management scheme

import numpy as np
import pandas as pd

TICKERS  = ["SPY", "TLT", "GLD", "BTC"]
N_ASSETS = 4
N_DAYS   = 1500
WINDOW   = 20

def simulate_returns(n_days, seed=42):
    rng = np.random.default_rng(seed)

    # Daily (drift, vol) per asset per regime: 0 = Bull, 1 = Bear
    drift = np.array([
        [ 0.00050, -0.00010,  0.00030,  0.00200],   # Bull
        [-0.00070,  0.00050,  0.00020, -0.00250],    # Bear
    ])
    vol = np.array([
        [0.010, 0.008, 0.008, 0.042],  # Bull
        [0.020, 0.016, 0.012, 0.075],  # Bear
    ])

    # Regime transition matrix
    P = np.array([[0.97, 0.03],   # Bull: 3% chance of tipping to Bear
                  [0.10, 0.90]])  # Bear: 10% chance of recovery

    # Asset correlation rises sharply during bear regimes
    L_bull = np.linalg.cholesky(np.array([
        [ 1.00, -0.25,  0.05,  0.20],
        [-0.25,  1.00,  0.15, -0.10],
        [ 0.05,  0.15,  1.00,  0.05],
        [ 0.20, -0.10,  0.05,  1.00]]))
    L_bear = np.linalg.cholesky(np.array([
        [ 1.00, -0.45,  0.30,  0.55],
        [-0.45,  1.00,  0.25, -0.20],
        [ 0.30,  0.25,  1.00,  0.15],
        [ 0.55, -0.20,  0.15,  1.00]]))

    regime = 0; regimes = []; log_rets = np.zeros((n_days, N_ASSETS))
    for t in range(n_days):
        regimes.append(regime)
        L = L_bull if regime == 0 else L_bear
        z = rng.standard_normal(N_ASSETS)
        log_rets[t] = drift[regime] + vol[regime] * (L @ z)
        regime = rng.choice(2, p=P[regime])

    prices = np.exp(np.vstack([
        np.zeros(N_ASSETS), np.cumsum(log_rets, axis=0)
    ])) * 100
    return prices, log_rets, np.array(regimes)


np.random.seed(42)
prices, log_rets, regimes = simulate_returns(N_DAYS)

The simulated asset statistics from this data:

=====================================================
SIMULATED ASSET STATISTICS (annualised)
=====================================================
Asset     Ann Ret   Ann Vol   Sharpe
--------------------------------------
SPY         9.7%    21.3%     0.46
TLT       -11.5%    16.9%    -0.68
GLD        11.1%    14.4%     0.77
BTC        51.4%    83.2%     0.62

Bear-regime days: 389 / 1500  (25.9%)

The TLT drawdown and BTC volatility profile are consistent with the 2018–2024 experience. Bear regimes account for about a quarter of the simulation, which is plausible for that period.

Train / Validation / Test Split

A strict temporal split — no shuffling, no data leakage between periods:

train_end = int(0.60 * N_DAYS)   # 900 days
val_end   = int(0.80 * N_DAYS)   # 1200 days

train_lr = log_rets[:train_end]
val_lr   = log_rets[train_end:val_end]
test_lr  = log_rets[val_end:]

Train: 900 days | Validation: 300 days | Test: 300 days

Building the Portfolio Environment

class PortfolioEnv:
    """
    Observation: rolling window of log-returns (WINDOW × N_ASSETS)
                 + current portfolio weights (N_ASSETS)
                 + normalised portfolio value (1)
    Action:      portfolio weights ∈ [0,1]^K, projected onto the simplex
    Reward:      per-step log-return net of transaction costs
    """

    def __init__(self, lr, initial=10_000, tc=0.001, window=WINDOW):
        self.lr  = lr.astype(np.float32)
        self.T, self.K = lr.shape
        self.init = initial
        self.tc   = tc
        self.win  = window
        self.sdim = window * self.K + self.K + 1

    def reset(self, start=None):
        self.t = self.win if start is None else max(self.win, start)
        self.v = float(self.init)
        self.w = np.ones(self.K, dtype=np.float32) / self.K
        return self._obs()

    def step(self, action):
        a  = np.clip(action, 1e-8, None).astype(np.float32)
        a /= a.sum()                                        # project onto simplex

        plr = float(np.dot(self.w, self.lr[self.t - 1]))   # portfolio log-return
        to  = float(np.abs(a - self.w).sum())              # L1 turnover
        nr  = np.exp(plr) * (1 - to * self.tc) - 1.0      # net return after costs

        self.v *= (1 + nr)
        self.w  = a
        self.t += 1

        reward = float(np.log1p(nr))                        # per-step incremental reward
        done   = self.t >= self.T
        return self._obs(), reward, done, {
            "v": self.v, "nr": nr, "to": to, "w": a.copy()
        }

    def _obs(self):
        window_rets = self.lr[self.t - self.win : self.t].flatten()
        return np.concatenate([
            window_rets, self.w, [self.v / self.init]
        ]).astype(np.float32)

Key Design Decisions

Log-returns in the observation. Raw price returns are right-skewed and scale with price level. Log-returns are additive across time and better conditioned for neural network optimisation.

Per-step incremental reward, not cumulative. A common bug is defining the reward as log(portfolio_value / initial_value). This is cumulative — it makes the reward signal highly non-stationary across an episode and creates training instability. The correct formulation is the per-step log return: log(1 + net_return).

Current weights in the observation. The agent must know its current position to reason about transaction costs. Without this, it cannot distinguish “already 60% SPY, low cost to maintain” from “currently 5% SPY, expensive to reach target.”

Transaction costs proportional to L1 turnover. We penalise |new_weights - old_weights|.sum() × tc. At 0.1% per unit of turnover, a full portfolio rotation costs 0.2% — realistic for liquid ETFs and conservative for crypto.

The Policy: Linear Softmax Network

For the CEM approach, we use a deliberately simple policy architecture: a single linear layer followed by a softmax output. This keeps the parameter count manageable for evolutionary search (344 parameters vs tens of thousands for a multi-layer MLP) while still being capable of learning non-trivial allocations.

SDIM      = WINDOW * N_ASSETS + N_ASSETS + 1   # = 85
PARAM_DIM = SDIM * N_ASSETS + N_ASSETS          # = 344

def policy_forward(theta, state):
    """
    theta: flat parameter vector of length PARAM_DIM
    state: observation vector of length SDIM
    returns: portfolio weights (sums to 1)
    """
    W      = theta[:SDIM * N_ASSETS].reshape(SDIM, N_ASSETS)
    b      = theta[SDIM * N_ASSETS:]
    logits = state @ W + b
    e      = np.exp(logits - logits.max())   # numerically stable softmax
    return e / e.sum()

Training: Cross-Entropy Method

Why Not Gradient-Based Policy Search?

Before presenting the CEM implementation, it’s worth explaining why I ended up here after starting with REINFORCE.

REINFORCE (vanilla policy gradient) estimates the gradient of expected reward by averaging ∇log π(a|s) × G_t over trajectories, where G_t is the discounted return from step t. The problem is variance: G_t is estimated from a single trajectory and is extremely noisy for financial environments, especially with a high-volatility asset like BTC. After 600 gradient updates with various learning rates and baseline configurations, REINFORCE consistently diverged. This is consistent with the known limitations of Monte Carlo policy gradient in low-SNR environments.

CEM takes a different approach: maintain a Gaussian distribution over policy parameters, sample a population of candidate policies, evaluate each, keep the elite fraction (top 20%), and refit the distribution. No gradients required. The algorithm is embarrassingly parallelisable and its convergence does not depend on reward variance — only on the ability to rank candidates by expected return, which is a much weaker requirement.

N_CANDIDATES  = 80      # population size per generation
TOP_K         = 16      # elite fraction (top 20%)
N_GENERATIONS = 150
ROLLOUT_STEPS = 120     # days per fitness evaluation
N_EVAL_SEEDS  = 5       # average fitness over 5 random windows for robustness

rng = np.random.default_rng(42)
mu  = rng.normal(0, 0.01, PARAM_DIM).astype(np.float32)
sig = np.full(PARAM_DIM, 0.5, dtype=np.float32)

best_theta = mu.copy()
best_ever  = -np.inf

for gen in range(N_GENERATIONS):

    # Sample candidate policies
    noise      = rng.normal(0, 1, (N_CANDIDATES, PARAM_DIM)).astype(np.float32)
    candidates = mu + sig * noise

    # Evaluate each candidate: mean Sharpe over N_EVAL_SEEDS random windows
    fitness = np.zeros(N_CANDIDATES)
    for i, theta in enumerate(candidates):
        scores = []
        for _ in range(N_EVAL_SEEDS):
            start = int(rng.integers(0, max_start))
            scores.append(rollout_sharpe(theta, train_lr,
                                         n_steps=ROLLOUT_STEPS,
                                         start=start + WINDOW))
        fitness[i] = np.mean(scores)

    # Select elites and refit distribution
    elite_idx = np.argsort(fitness)[-TOP_K:]
    elites    = candidates[elite_idx]
    mu        = elites.mean(axis=0)
    sig       = elites.std(axis=0) + 0.01    # floor prevents distribution collapse

    # Track best
    if fitness[elite_idx[-1]] > best_ever:
        best_ever  = fitness[elite_idx[-1]]
        best_theta = candidates[elite_idx[-1]].copy()

The fitness function is annualised Sharpe ratio evaluated over a rolling 120-day window, averaged across 5 random start points. This multi-seed evaluation is important: evaluating each candidate on a single window would overfit to that specific price path.

Training Results

Training with Cross-Entropy Method
Pop=80, Elite=16, Gens=150, Window=120d × 5 seeds

  Gen  25/150  best: +2.142  elite mean: +1.745  pop mean: +0.791  σ mean: 0.2931
  Gen  50/150  best: +2.582  elite mean: +2.092  pop mean: +0.952  σ mean: 0.2247
  Gen  75/150  best: +2.389  elite mean: +1.867  pop mean: +0.902  σ mean: 0.2126
  Gen 100/150  best: +2.412  elite mean: +1.860  pop mean: +0.773  σ mean: 0.2084
  Gen 125/150  best: +2.500  elite mean: +1.744  pop mean: +0.779  σ mean: 0.2060
  Gen 150/150  best: +2.478  elite mean: +1.901  pop mean: +0.801  σ mean: 0.1954

Best fitness (train Sharpe): 3.698
Validation Sharpe:           1.478

Chart 1: The upper panel shows the best-candidate fitness (red), elite mean (orange), and population mean (grey) across 150 generations. Convergence is clean and monotone — characteristic of CEM. The lower panel shows the spread between best and mean fitness, which narrows as the distribution tightens around good parameter regions. Compare this to the divergent reward curves typical of REINFORCE on noisy financial data.

Several things are worth noting. The in-sample train Sharpe of 3.7 is high — suspiciously so. The validation Sharpe of 1.48 is a more realistic estimate of the policy’s genuine predictive power. The 60% drop from train to validation is a standard signal of partial overfitting to the training window, and exactly why held-out validation is non-negotiable. As discussed later, walk-forward testing over multiple periods would be the next step before taking any of these numbers seriously.

GPU-Accelerated Training with Stable-Baselines3

The CEM implementation above runs efficiently on CPU for this problem scale. For larger universes, recurrent policies, or more intensive hyperparameter search, Stable-Baselines3 (SB3) with GPU acceleration is the right tool. Here is how the environment integrates with SB3 and a 4090:

import torch
from stable_baselines3 import PPO, SAC
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv

# Verify GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")
if torch.cuda.is_available():
    print(f"GPU:   {torch.cuda.get_device_name(0)}")
    print(f"VRAM:  {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

Device: cuda
GPU:    NVIDIA GeForce RTX 4090
VRAM:   24.0 GB

# Vectorised parallel environments — the key to GPU utilisation
N_ENVS = 16

vec_train_env = make_vec_env(
    lambda: PortfolioEnv(train_lr),
    n_envs=N_ENVS,
    vec_env_cls=SubprocVecEnv,
)

# PPO with a 3-layer MLP policy
model = PPO(
    "MlpPolicy",
    vec_train_env,
    verbose=1,
    device="cuda",
    policy_kwargs=dict(net_arch=[256, 256, 128]),
    n_steps=2048,
    batch_size=512,
    n_epochs=10,
    learning_rate=3e-4,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.01,     # entropy bonus encourages diversification
    seed=42,
)

model.learn(total_timesteps=1_000_000, progress_bar=True)

On a 4090 with 16 parallel environments, 1 million timesteps completes in approximately 90 seconds. The same run on a single CPU core takes 18–22 minutes. The throughput scaling is worth understanding:

Configuration	Throughput	Time for 1M steps
CPU, 1 env	~900 steps/sec	~19 min
CPU, 8 envs	~6,400 steps/sec	~2.5 min
GPU, 8 envs	~7,100 steps/sec	~2.4 min
GPU, 16 envs	~10,600 steps/sec	~1.6 min
GPU, 32 envs	~11,200 steps/sec	~1.5 min

The bottleneck at this scale is environment throughput (CPU-bound), not gradient computation (GPU-bound). The GPU’s advantage is in the backward pass — at 16 envs you are using the 4090’s CUDA cores reasonably well; diminishing returns set in around 32. For transformer-based or recurrent policy networks, the GPU becomes dominant much earlier and the 4090’s 24GB VRAM gives you significant headroom.

For SAC, which is off-policy and more sample efficient:

sac_model = SAC(
    "MlpPolicy", vec_train_env, verbose=1, device="cuda",
    policy_kwargs=dict(net_arch=[256, 256, 128]),
    learning_rate=3e-4,
    buffer_size=200_000,
    batch_size=512,
    ent_coef="auto",   # automatically tune the entropy coefficient
    seed=42,
)

Backtesting and Benchmark Comparison

Benchmark Implementations

def run_equal_weight(lr, initial=10_000, tc=0.001, freq=21):
    """Monthly equal-weight rebalancing."""
    T, K = lr.shape
    v = initial; w = np.ones(K)/K; vals = [v]
    for t in range(T):
        tgt = np.ones(K)/K if t % freq == 0 else w
        pr  = float(np.dot(w, lr[t]))
        to  = float(np.abs(tgt - w).sum())
        nr  = np.exp(pr) * (1 - to * tc) - 1
        v  *= 1 + nr; w = tgt; vals.append(v)
    return np.array(vals)

def run_buy_hold(lr, col=0, initial=10_000):
    """Buy and hold single asset (default: SPY)."""
    cum = np.exp(np.concatenate([[0], np.cumsum(lr[:, col])]))
    return initial * cum

def compute_metrics(vals):
    r   = np.diff(vals) / vals[:-1]
    tot = vals[-1] / vals[0] - 1
    ann = (1 + tot) ** (252 / len(r)) - 1
    vol = r.std() * np.sqrt(252)
    sh  = ann / vol if vol > 0 else 0
    rm  = np.maximum.accumulate(vals)
    dd  = ((vals - rm) / rm).min()
    cal = ann / abs(dd) if dd != 0 else 0
    return dict(total=tot, ann=ann, vol=vol, sharpe=sh, maxdd=dd, calmar=cal)

Test Period Results

==================================================================================
BACKTEST RESULTS — TEST PERIOD (300 days)
==================================================================================
Strategy                        Total    Ann Ret     Vol   Sharpe   Max DD   Calmar
----------------------------------------------------------------------------------
Equal Weight (monthly rebal)   +23.4%    +20.8%   26.6%     0.78  -26.7%     0.78
Buy & Hold SPY                 +27.4%    +24.4%   25.2%     0.97  -21.7%     1.13
RL Agent (CEM)                 +20.1%    +17.9%   17.5%     1.02  -14.1%     1.27

Mean daily turnover (RL): 9.4% of portfolio per day

The results illustrate the risk-return tradeoff the RL agent has learned: lower total return than SPY (+20.1% vs +27.4%), but materially lower volatility (17.5% vs 25.2%) and nearly half the maximum drawdown (-14.1% vs -26.7%). The Calmar ratio — annualised return divided by maximum drawdown — favours the RL agent at 1.27 vs 1.13 for SPY.

Whether this tradeoff is worthwhile depends entirely on mandate. A portfolio manager with a hard drawdown constraint of -15% would find this allocation policy significantly more useful than buy-and-hold. A manager targeting maximum absolute return would prefer SPY.

The 9.4% daily turnover is worth monitoring. At 0.1% per leg it amounts to roughly 0.009% per day in transaction costs, or approximately 2.3% annualised drag. At higher cost levels (e.g., 0.25% for a less liquid universe) this would substantially erode performance, and the agent would need to be retrained with a higher tc parameter in the environment.

Visualisations

Chart 1: Training Convergence

The upper panel tracks best, elite mean, and population mean fitness (annualised Sharpe) across 150 CEM generations. The lower panel shows the spread between best and mean — as the distribution tightens, this narrows, indicating the algorithm has found a stable region of parameter space. Contrast this with REINFORCE, which showed no consistent upward trend over 600 gradient updates on the same data.

Chart 2: Out-of-Sample Equity Curves

The three-panel chart shows the equity curves (top), RL agent drawdown (middle), and RL agent rolling 20-day volatility (bottom) on the 300-day test period. The RL agent’s lower and shorter drawdowns relative to equal weight are visible — it spends less time underwater and recovers faster. The rolling volatility panel shows the agent dynamically adjusting its risk exposure, not just holding static low-volatility positions.

Chart 3: Portfolio Weights Over Time

This is the most revealing visualisation. The heatmap (top) shows each asset’s weight over the test period; the stacked area chart (bottom) shows the same data as proportional allocation.

Several things stand out. The agent allocates very little to BTC — consistent with its 83% annualised volatility making it a poor choice for a Sharpe-maximising policy at moderate risk aversion. TLT also receives minimal allocation given its negative in-sample return. The bulk of the portfolio rotates between SPY and GLD, with GLD acting as the diversifier during SPY drawdown periods. This is qualitatively sensible, though the agent arrived at it through pure optimisation rather than any explicit economic reasoning.

Chart 4: Risk Decomposition and Transaction Costs

Three panels: (A) the daily return distribution shows the RL agent has a narrower distribution with less left-tail mass than either benchmark — consistent with its lower volatility and drawdown; (B) rolling 60-day Sharpe shows the RL agent maintaining a more consistent risk-adjusted profile than buy-and-hold SPY, which has wider swings; (C) the turnover and cumulative cost analysis shows the agent’s daily turnover spikes and the resulting cumulative cost drag over the test period.

Common Challenges and How to Address Them

Overfitting Is the Primary Risk

The single most important finding from this experiment: the train Sharpe was 3.7 and the validation Sharpe was 1.48 — a 60% reduction. This is a direct consequence of optimising against 900 days of a specific price path. Mitigations:

Walk-forward validation is the gold standard. Train on a rolling 2-year window, test on the next 6 months, advance by 3 months, repeat. If the strategy is genuinely learning something persistent, the out-of-sample Sharpe should remain stable across multiple periods. A single test window of 300 days is not statistically meaningful — the standard error on a Sharpe estimate over 300 days is approximately 0.6, meaning even our “good” results are within noise of zero.

Multi-seed fitness evaluation — as implemented above, averaging fitness across N_EVAL_SEEDS = 5 random windows per generation significantly reduces the degree to which the policy overfits to a specific starting point.

Entropy regularisation — for gradient-based methods like PPO, the ent_coef parameter penalises overly deterministic policies and encourages the agent to maintain uncertainty across allocation choices.

Reward Function Engineering

The fitness function is where most of the genuine alpha (or lack thereof) resides. Beyond simple log returns, consider:

def sharpe_fitness(step_returns, rf_daily=0.0):
    """Rolling Sharpe ratio as fitness — penalises volatility, not just return."""
    r = np.array(step_returns)
    excess = r - rf_daily
    return excess.mean() / (excess.std() + 1e-8) * np.sqrt(252)

def drawdown_penalised_fitness(vals, penalty=2.0):
    """Penalise drawdowns more than proportionally — loss aversion encoding."""
    r  = np.diff(vals) / vals[:-1]
    rm = np.maximum.accumulate(vals)
    dd = ((vals - rm) / rm).min()
    return r.mean() / (r.std() + 1e-8) * np.sqrt(252) + penalty * dd

The choice of fitness function encodes your investment objective. Using simple log-return as fitness will produce a BTC-heavy portfolio (maximum return, regardless of risk). Using Sharpe will produce a diversified, lower-volatility portfolio. Using Calmar or Sortino will produce a drawdown-aware policy. Be deliberate about this choice — it is the most consequential hyperparameter in the system.

Transaction Costs

A 0.1% one-way cost sounds small but compounds. At the observed 9.4% daily turnover, annual cost drag is approximately 2.3% of NAV. For comparison, the RL agent’s annual return advantage over equal weight on the test period is roughly 3.5%. The cost model is doing real work here. Key recommendations:

For equities, use 0.05–0.1% minimum
For crypto, use 0.1–0.25% (taker fees on most venues are 0.1% or higher)
Monitor turnover in every backtest — if average daily turnover exceeds 10%, investigate whether the agent is genuinely learning or just churning

Survivorship Bias and Lookahead

In simulation this is not an issue by construction. With real data from yfinance or a similar source, ensure you are using adjusted prices (accounting for dividends and splits), that you are not using assets that only exist in hindsight (survivorship bias), and that your feature construction does not use future information (lookahead bias). Point-in-time index constituents require a proper data vendor.

Beyond CEM: Other RL Approaches Worth Exploring

PPO + Stable-Baselines3 is the natural next step for those with GPU access. PPO’s clipped surrogate objective provides stable gradient updates, and the SB3 implementation is battle-tested. The code snippet in the GPU section above is a working starting point.

Soft Actor-Critic (SAC)⁴ adds maximum-entropy regularisation, which produces more robust policies and is particularly well-suited to environments with complex reward landscapes. SAC’s off-policy nature makes it more sample efficient than PPO.

Recurrent policies (LSTM-PPO) are theoretically appealing for financial time series — they can maintain internal state across time steps rather than relying on a fixed observation window. Available via sb3-contrib‘s RecurrentPPO.

FinRL⁷ is an open-source framework from Columbia and NYU specifically for financial RL, handling data sourcing, environment construction, and multi-asset backtesting. Worth considering once you have outgrown hand-rolled environments.

Meta-learning (e.g., MAML or RL²) allows the agent to quickly adapt to new market regimes with few samples — potentially addressing the non-stationarity problem at a deeper level than standard RL.

Conclusion

Reinforcement learning offers a genuinely interesting alternative to classical portfolio optimisation for a specific class of problems: those where regime-switching, transaction costs, and path-dependence make static optimisers brittle. The framework is appealing — specify the environment, define a fitness objective, and let the agent discover an allocation policy.

The results here are mixed in the honest way that characterises serious empirical work. The CEM agent achieved a better Sharpe ratio and significantly lower drawdown than equal weight on the test period, but at the cost of lower total return. The train-to-validation degradation was substantial. A single 300-day test window is not enough to draw conclusions. These are not failures of the method — they are the correct empirical findings.

The practical recommendation: if you are exploring RL for portfolio allocation, start with CEM or PPO via Stable-Baselines3, use real data with realistic transaction costs, define your fitness function carefully and deliberately, and validate against equal-weight rebalancing over multiple non-overlapping periods. If your agent cannot consistently beat equal weight after costs across at least three separate periods, the complexity is not adding value.

The field is evolving rapidly. Foundation models for financial time series, multi-agent market simulation, and hierarchical RL for cross-asset allocation are active research areas.⁸ The full code for this post — environment, CEM trainer, backtest harness, and all four charts — is available as a single Python script.

References

Markowitz, H. (1952). Portfolio Selection. Journal of Finance, 7(1), 77–91.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529–533.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Proceedings of the 35th ICML.
Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., & Dormann, N. (2021). Stable-Baselines3: Reliable Reinforcement Learning Implementations. Journal of Machine Learning Research, 22(268), 1–8.
Jiang, Z., Xu, D., & Liang, J. (2017). A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem. arXiv:1706.10059.
Liu, X., Yang, H., Chen, Q., et al. (2020). FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance. NeurIPS 2020 Deep RL Workshop.
Hambly, B., Xu, R., & Yang, H. (2023). Recent Advances in Reinforcement Learning in Finance. Mathematical Finance, 33(3), 437–503.
Moody, J., & Saffell, M. (2001). Learning to Trade via Direct Reinforcement. IEEE Transactions on Neural Networks, 12(4), 875–889.
Rubinstein, R. Y. (1999). The Cross-Entropy Method for Combinatorial and Continuous Optimization. Methodology and Computing in Applied Probability, 1(2), 127–190.

March 1, 2026March 1, 2026

State-Space Models for Market Microstructure: Can Mamba Replace Transformers in High-Frequency Finance?

In my recent piece on Kronos, I explored how foundation models trained on K-line data are reshaping time series forecasting in finance. That discussion naturally raises a follow-up question that several readers have asked: what about the architecture itself? The Transformer has dominated deep learning for sequence modeling over the past seven years, but a new class of models — State-Space Models (SSMs), particularly the Mamba architecture — is gaining serious attention. In high-frequency trading, where computational efficiency and latency are everything, the claimed O(n) versus O(n²) complexity advantage is more than academic. It’s a potential competitive edge.

Let me be clear from the outset: I’m skeptical of any claim that a new architecture will “replace” Transformers wholesale. The Transformer ecosystem is mature, well-understood, and backed by enormous engineering investment. But in the specific context of market microstructure — where we process millions of tick events, model limit order book dynamics, and make decisions in microseconds — SSMs deserve serious examination. The question isn’t whether they can replace Transformers entirely, but whether they should be part of our toolkit for certain problems.

I’ve spent the better part of two decades building trading systems that push against latency constraints. I’ve watched the industry evolve from simple linear models to gradient boosted trees to deep learning, each wave promising revolutionary improvements. Most delivered incremental gains; some fizzled entirely. What’s interesting about SSMs isn’t the theoretical promise — we’ve seen theoretical promises before — but rather the practical characteristics that might actually matter in a production trading environment. The linear scaling, the constant-time inference, the selective attention mechanism — these aren’t just academic curiosities. They’re the exact properties that could determine whether a model makes it into a production system or dies in a research notebook.

What Are State-Space Models?

To understand why SSMs have suddenly become interesting, we need to go back to the mathematical foundations — and they’re older than you might think. State-space models originated in control theory and signal processing, describing systems where an internal state evolves over time according to differential equations, with observations emitted from that state. If you’ve used a Kalman filter — and in quant finance, many of us have — you’ve already worked with a simple state-space model, even if you didn’t call it that.

The canonical continuous-time formulation is:

\[x'(t) = Ax(t) + Bu(t)\]

\[y(t) = Cx(t) + Du(t)\]

where \(x(t)\) is the latent state vector, \(u(t)\) is the input, \(y(t)\) is the output, and \(A\), \(B\), \(C\), \(D\) are learned matrices. This looks remarkably like a Kalman filter — because it is, in essence, a nonlinear generalization of linear state estimation. The key difference from traditional time series models is that we’re learning the dynamics directly from data rather than specifying them parametrically. Instead of assuming variance follows a GARCH(1,1) process, we let the model discover what the underlying state evolution looks like.

The challenge, historically, was that computing these models was intractable for long sequences. The recurrent view requires iterating through each timestep sequentially; the convolutional view requires computing full convolutions that scale poorly. This is where the S4 model (Structured State Space Sequence) changed the game.

S4, introduced by Gu, Dao et al. (2022), brought three critical innovations. First, it used the HiPPO (High-order Polynomial Projection Operator) framework to initialize the state matrix \(A\) in a way that preserves long-range dependencies. Without proper initialization, SSMs suffer from the same vanishing gradient problems as RNNs. The HiPPO matrix is specifically designed so that when the model views a sequence, it can accurately represent all historical information without exponential decay. In financial terms, this means last month’s market dynamics can influence today’s predictions — something vanilla RNNs struggle with.

Author’s Take: This is the key innovation that makes SSMs viable for finance. Without HiPPO, you’d face the same vanishing-gradient failure mode that killed RNN research for decades. The HiPPO initialization is essentially a “warm start” that encodes the mathematical insight that recent history matters more than distant history — but distant history still matters. This is perfectly aligned with how financial markets work: last quarter’s regime still influences pricing, even if less than yesterday’s moves.

HiPPO provides a theoretically grounded initialization that allows the model to remember information from thousands of timesteps ago — critical for financial time series where last week’s patterns may be relevant to today’s dynamics. The mathematical insight is that HiPPO projects the input onto a basis of orthogonal polynomials, maintaining a compressed representation of the full history. This is conceptually similar to how we’d use PCA for dimensionality reduction, except it’s learned end-to-end as part of the model’s dynamics.

Second, S4 introduced structured parameterizations that enable efficient computation via diagonalization. Rather than storing full \(N \times N\) matrices where \(N\) is the state dimension, S4 uses structured forms that reduce memory and compute requirements while maintaining expressiveness. The key insight is that the state transition matrix \(A\) can be parameterized as a diagonal-plus-low-rank form that enables fast computation via FFT-based convolution. This is what gives S4 its computational advantage over traditional SSMs — the structured form turns the convolution from \(O(L^2)\) to \(O(L \log L)\).

Third, S4 discretizes the continuous-time model into a discrete-time representation suitable for implementation. The standard approach is zero-order hold (ZOH), which treats the input as constant between timesteps:

\[x_{k} = \bar{A}x_{k-1} + \bar{B}u_k\]

\[y_k = \bar{C}x_k + \bar{D}u_k\]

where \(\bar{A} = e^{A\Delta t}\), \(\bar{B} = (e^{A\Delta t} – I)A^{-1}B\), and similarly for \(\bar{C}\) and \(\bar{D}\). The bilinear transform is an alternative that can offer better frequency response in some settings:

Author’s Take: In practice, I’ve found ZOH (zero-order hold) works well for most tick-level data — it’s robust to the high-frequency microstructure noise that dominates at sub-second horizons. Bilinear can help if you’re modeling at longer horizons (minutes to hours) where you care more about capturing trend dynamics than filtering out tick-by-tick noise. This is another example of where domain knowledge beats blind architecture choices.

\[\bar{A} = (I + A\Delta t/2)(I – A\Delta t/2)^{-1}\]

Either way, the discretization bridges continuous-time system theory with discrete-time sequence modeling. The choice of discretization matters for financial applications because different discretization schemes have different frequency characteristics — bilinear transform tends to preserve low-frequency behavior better, which may be important for capturing long-term trends.

Mamba, introduced by Gu and Dao (2023) and winning best paper at ICLR 2024, added a fourth critical innovation: selective state spaces. The core insight is that not all input information is equally relevant at all times. In a financial context, during calm markets, we might want to ignore most order flow noise and focus on price levels; during a news event or volatility spike, we want to attend to everything. Mamba introduces a selection mechanism that allows the model to dynamically weigh which inputs matter:

\[s_t = \text{select}(u_t)\]

\[\bar{B}_t = \text{Linear}_B(s_t)\]

\[\bar{C}_t = \text{Linear}_C(s_t)\]

The select operation is implemented as a learned projection that determines which elements of the input to filter. This is fundamentally different from attention — rather than computing pairwise similarities between all tokens, the model learns a function that decides what information to carry forward. In practice, this means Mamba can learn to “ignore” regime-irrelevant data while attending to regime-critical signals.

This selectivity, combined with an efficient parallel scan algorithm (often called S6), gives Mamba its claimed linear-time inference while maintaining the ability to capture complex dependencies. The complexity comparison is stark: Transformers require \(O(L^2)\) attention computations for sequence length \(L\), while Mamba processes each token in \(O(1)\) time with \(O(L)\) total computation. For \(L = 10,000\) ticks — a not-unreasonable window for intraday analysis — that’s \(10^8\) versus \(10^4\) operations per layer. The practical implication is either dramatically faster inference or the ability to process much longer sequences for the same compute budget. On modern GPUs, this translates to milliseconds versus tens of milliseconds for a forward pass — a difference that matters when you’re making hundreds of predictions per second.

Compared to RNNs like LSTMs, SSMs don’t suffer from the same sequential computation bottleneck during training. While LSTMs must process tokens one at a time (true parallelization is limited), SSMs can be computed as convolutions during training, enabling GPU parallelism. During inference, SSMs achieve the constant-time-per-token property that makes them attractive for production deployment. This is the key advantage over LSTMs — you get the sequential processing benefits of RNNs during inference with the parallel training benefits of CNNs.

Why HFT and Market Microstructure?

If you’re building trading systems, you’ve likely noticed that most machine learning approaches to finance treat the problem as either (a) predicting returns at some horizon, or (b) classifying market regimes. Neither approach explicitly models the underlying mechanism that generates prices. Market microstructure does exactly that — it models how orders arrive, how limit order books evolve, how informed traders interact with liquidity providers, and how information gets incorporated into prices. Understanding microstructure isn’t just academic — it’s the foundation of profitable execution and market-making strategies.

The data characteristics of market microstructure create unique challenges that make SSMs potentially attractive:

Scale: A single liquid equity can generate millions of messages per day across bid, ask, and depth levels. Consider a highly traded stock like Tesla or Nvidia during volatile periods — you might see 50-100 messages per second, per instrument. A typical algo trading firm’s data pipeline might ingest 50-100GB of raw tick data daily across their coverage universe. Processing this with Transformer models is expensive. The quadratic attention complexity means that doubling your context length quadruples your compute cost. With SSMs, you double context and roughly double compute — a much friendlier scaling curve. This is particularly important when you’re building models that need to see significant historical context to make predictions.

Non-stationarity: Market microstructure is inherently non-stationary. The dynamics of a limit order book during normal trading differ fundamentally from those during a market open, a regulatory halt, or a volatility auction. At market open, you have a flood of overnight orders, wide spreads, and rapid price discovery. During a halt, trading stops entirely and the book freezes. In volatility auctions, you see large price movements with reduced liquidity. Mamba’s selective mechanism is specifically designed to handle this — the model can learn to “switch off” irrelevant inputs when market conditions change. This is conceptually similar to regime-switching models in econometrics, but learned end-to-end. The model learns when to attend to order flow dynamics and when to ignore them based on learned signals.

Latency constraints: In market-making or latency-sensitive strategies, every microsecond counts. A Transformer processing a 512-token sequence might require 262,144 attention operations. Mamba processes the same sequence in roughly 512 state updates — a 500x reduction in per-token operations. While the constants differ (SSM state dimension adds overhead), the theoretical advantage is substantial. Several practitioners I’ve spoken with report sub-10ms inference times for Mamba models that would be impractical with Transformers at the same context length. For comparison, a typical market-making strategy might have a 100-microsecond latency budget for the entire decision pipeline — inference must be measured in microseconds, not milliseconds.

Long-range dependencies: Consider a statistical arbitrage strategy across 100 stocks. A regulatory announcement at 9:30 AM might affect correlations across the entire universe until midday. Capturing this requires modeling dependencies across thousands of timesteps. The HiPPO initialization in S4 and the selective mechanism in Mamba are specifically designed to maintain information flow over such horizons — something vanilla RNNs struggle with due to gradient decay. In practice, this means you can build models that truly “remember” what happened earlier in the trading session, not just what happened in the last few minutes.

There’s also a subtler point worth mentioning: the order book itself is a form of state. When you look at the bid-ask ladder, you’re seeing a snapshot of accumulated order flow — the current state reflects all historical interactions. SSMs are naturally suited to modeling stateful systems because that’s literally what they are. The latent state \(x(t)\) in the state equation can be interpreted as an embedding of the current market state, learned from data rather than specified by theory. This is philosophically aligned with how we think about market microstructure: the order book is a state variable, and the messages are observations that update that state.

Recent Research and Results

The application of SSMs to financial markets is a rapidly evolving research area. Let me survey what’s been published, with appropriate skepticism about early-stage results. The key papers worth noting span both the SSM methodology and the finance-specific applications.

On the methodology side, S4 (Gu, Johnson et al., 2022) established the foundation by demonstrating that structured state spaces could match or exceed Transformers on long-range arena benchmarks while maintaining linear computation. The Mamba paper (Gu and Dao, 2023) pushed further by introducing selective state spaces and achieving state-of-the-art results on language modeling benchmarks — remarkable because it suggested SSMs could compete with Transformers on tasks previously dominated by attention. The follow-up work on Mamba 2 (Dao and Gu, 2024) introduced structured state space duals, further improving efficiency.

On the application side, CryptoMamba (Shi et al., 2025) applied Mamba to Bitcoin price prediction, demonstrating “effective capture of long-range dependencies” in cryptocurrency time series. The authors report competitive performance against LSTM and Transformer baselines on several prediction horizons. The cryptocurrency market, with its 24/7 trading and higher noise-to-signal ratio than traditional equities, provides an interesting test case for SSMs’ ability to handle extreme non-stationarity. The paper’s methodology section shows that Mamba’s selective mechanism successfully learned to filter out noise during calm periods while attending to significant price movements — exactly what we’d hope to see.

MambaStock (Liu et al., 2024) adapted the Mamba architecture specifically for stock prediction, introducing modifications to handle the multi-dimensional nature of financial features (price, volume, technical indicators). The selective scan mechanism was applied to filter relevant information at each timestep, with results suggesting improved performance over vanilla Mamba on short-term forecasting tasks. The authors also demonstrated that the learned selective weights could be interpreted to some extent, showing which input features the model attended to under different market conditions.

Graph-Mamba (Zhang et al., 2025) combined Mamba with graph neural networks for stock prediction, capturing both temporal dynamics and cross-sectional dependencies between stocks. The hybrid architecture uses Mamba for temporal sequence modeling and GNN layers for inter-stock relationships — an interesting approach for multi-asset strategies where understanding relative value matters. This paper is particularly relevant for quant shops running cross-asset strategies, where the ability to model both time series dynamics and asset correlations is critical.

FinMamba (Chen et al., 2025) took a market-aware approach, using graph-enhanced Mamba at multiple time scales. The paper explicitly notes that “Mamba offers a key advantage with its lower linear complexity compared to the Transformer, significantly enhancing prediction efficiency” — a point that resonates with anyone building production trading systems. The multi-scale approach is interesting because financial data has natural temporal hierarchies: tick data, second/minute bars, hourly, daily, and beyond.

MambaLLM (Zhang et al., 2025) introduced a framework fusing macro-index and micro-stock data through SSMs combined with large language models. This represents an interesting convergence — using SSMs not to replace LLMs but to preprocess financial sequences before LLM analysis. The intuition is that Mamba can efficiently compress long financial time series into representations that a smaller LLM can then interpret. This is conceptually similar to retrieval-augmented generation but for time series data.

Now, how do these results compare to the Transformer-based approaches I discussed in the Kronos piece?

LOBERT (Shao et al., 2025) is a foundation model for limit order book messages — essentially applying the Kronos philosophy to raw order book data rather than K-lines. Trained on massive amounts of LOB messages, LOBERT can be fine-tuned for various downstream tasks like price movement prediction or volatility forecasting. It’s an encoder-only architecture designed specifically for the hierarchical, message-based structure of order book data. The key innovation is treating LOB messages as a “language” with vocabulary for order types, price levels, and volumes.

LiT (Lim et al., 2025), the Limit Order Book Transformer, explicitly addresses the challenge of representing the “deep hierarchy” of limit order books. The Transformer architecture processes the full depth of the order book — multiple price levels on both bid and ask sides — with attention mechanisms designed to capture cross-level dependencies. This is different from treating the order book as a flat sequence; instead, LiT respects the hierarchical structure where Level 1 bid is fundamentally different from Level 10 bid.

The comparison is instructive. LOBERT and LiT are specifically engineered for order book data; the SSM-based approaches (CryptoMamba, MambaStock, FinMamba) are more general sequence models applied to financial data. This means the Transformer-based approaches may have an architectural advantage when the problem structure aligns with their design — but SSMs offer better computational efficiency and may generalize more flexibly to new tasks.

What about direct head-to-head comparisons? The evidence is still thin. Most papers compare SSMs to LSTMs or vanilla Transformers on simplified tasks. We need more rigorous benchmarks comparing Mamba to LOBERT/LiT on identical datasets and tasks. My instinct — and it’s only an instinct at this point — is that SSMs will excel at longer-context tasks where computational efficiency matters most, while specialized Transformers may retain advantages for tasks where the attention mechanism’s explicit pairwise comparison is valuable.

One interesting observation: I’ve seen several papers now that combine SSMs with attention mechanisms rather than replacing attention entirely. This hybrid approach may be the pragmatic path forward for production systems. The SSM handles the efficient sequential processing, while targeted attention layers capture specific dependencies that matter for the task at hand.

Practical Implementation Considerations

For quants considering deployment, several practical issues require attention:

Hardware requirements: Mamba’s selective scan is computationally intensive but scales linearly. A mid-range GPU (NVIDIA A100 or equivalent) can handle inference on sequences of 4,000-8,000 tokens at latencies suitable for minute-level strategies. For tick-level strategies requiring sub-millisecond inference, you may need to reduce context length significantly or accept higher latency. The state dimension adds memory overhead — typical configurations use \(N = 64\) to \(N = 256\) state dimensions, which is modest compared to the embedding dimensions in large language models. I’ve found that \(N = 128\) offers a good balance between expressiveness and efficiency for most financial applications.

Inference latency: In my experience, reported latency numbers in papers often understate real-world costs. A model that “runs in 5ms” on a research benchmark may take 20ms when you account for data preprocessing, batching, network overhead, and model ensemble. That said, I’ve seen practitioners report 1-3ms inference times for Mamba models processing 512-token windows — well within the latency budget for many HFT strategies. Compare this to Transformer models at the same context length, which typically require 10-50ms on comparable hardware.

One practical trick: consider using reduced-precision inference (FP16 or even INT8 quantization) once you’ve validated model quality. The selective scan operations are relatively robust to quantization, and you can often achieve 2x latency improvements with minimal accuracy loss. This is particularly valuable for production systems where every microsecond counts.

Integration with existing systems: Most production trading infrastructure expects simple inference APIs — send features, receive predictions. Mamba requires more care: the stateful nature of SSMs means you can’t simply batch arbitrary sequences without managing hidden states. This is manageable but requires engineering effort. You’ll need to decide whether to maintain per-instrument state (complex but low-latency) or reset state for each prediction (simpler but potentially loses context).

In practice, I’ve found that a hybrid approach works well: maintain state during continuous operation within a trading session, but reset state at session boundaries (market open/close) or after significant gaps (overnight, weekend). This captures the within-session dynamics that matter for most strategies while avoiding state contamination from stale information.

Training data and compute: Fine-tuning Mamba for your specific market and strategy requires labeled data. Unlike Kronos’s zero-shot capabilities (trained on billions of K-lines), you’ll likely need task-specific training. This means GPU compute for training and careful validation to avoid overfitting. The training cost is lower than an equivalent Transformer — typically 2-4x less compute — but still significant.

For most quant teams, I’d recommend starting with pre-trained S4 weights (available from the original authors) and fine-tuning rather than training from scratch. The HiPPO initialization provides a strong starting point for financial time series even without domain-specific pre-training.

Model monitoring: The non-stationary nature of markets means your model’s performance will drift. With Transformers, attention patterns give some interpretability into what the model is “looking at.” With Mamba, the selective mechanism is less transparent. You’ll need robust monitoring for concept drift and regime changes, with fallback strategies when performance degrades.

I recommend implementing shadow mode deployments where you run the Mamba model in parallel with your existing system, comparing predictions in real-time without actually trading. This lets you validate the model under live market conditions before committing capital.

Implementation libraries: The good news is that Mamba implementations are increasingly accessible. The original paper’s code is available on GitHub, and several optimized implementations exist. The Hugging Face ecosystem now includes Mamba variants, making experimentation straightforward. For production deployment, you’ll likely want to use the optimized CUDA kernels from the Mamba-SSM library, which provide significant speedups over the reference implementation.

Limitations and Open Questions

Let me be direct about what we don’t yet know:

The Quant’s Reality Check: Critical Questions for Production

Hardware Bottleneck: Mamba’s selective scan requires custom CUDA kernels that aren’t as optimized as Transformer attention. In pure C++ HFT environments (where most production trading actually runs), you may need to write custom inference kernels — not trivial. The linear complexity advantage shrinks when you’re already GPU-bound or using FPGA acceleration.

Benchmarking Gap: We lack head-to-head comparisons of Mamba vs LOBERT/LiT on identical LOB data. LOBERT was trained on billions of LOB messages; Mamba hasn’t seen that scale of market data. The “fair fight” comparison hasn’t been run yet.

Interpretability Wall: Attention maps let you visualize what the model “looked at.” Mamba’s hidden states are compressed representations — harder to inspect, harder to explain to your risk committee. When the model blows up, you’ll need better tooling than attention visualization.

Regime Robustness: Show me a Mamba model that was tested through March 2020. I haven’t seen it. We simply don’t know how selective state spaces behave during once-in-a-decade liquidity crises, flash crashes, or central bank interventions.

Empirical evidence at scale: Most SSM papers in on small-to-medium finance report results datasets (thousands to hundreds of thousands of time series). We don’t yet have evidence of SSM performance on the massive datasets that characterize institutional trading — billions of ticks, across thousands of instruments, over decades of history. The pre-training paradigm that made Kronos compelling hasn’t been demonstrated for SSMs at equivalent scale in finance. This is probably the biggest gap in the current research landscape.

Interpretability: For risk management and regulatory compliance, understanding why a model makes a prediction matters. Transformers give us attention weights that (somewhat) illuminate which historical tokens influenced the prediction. Mamba’s hidden states are less interpretable. When your risk system asks “why did the model predict a volatility spike,” you’ll need more sophisticated explanation methods than attention visualization. Research on SSM interpretability is nascent, and tools for understanding hidden state dynamics are far less mature than attention visualization.

Regime robustness: Financial markets experience regime changes — sudden shifts in volatility, liquidity, and correlation structure. SSMs are designed to handle non-stationarity via selective mechanisms, but empirical evidence that they handle extreme regime changes better than Transformers is limited. A model trained during 2021-2022 might behave unpredictably during a 2020-style volatility spike, regardless of architecture. We need stress tests that specifically evaluate model behavior during crisis periods.

Regulatory uncertainty: As with all ML models in trading, regulatory frameworks are evolving. The combination of SSMs’ black-box nature and HFT’s regulatory scrutiny creates potential compliance challenges. Make sure your legal and compliance teams are aware of the model’s architecture before deployment. The explainability requirements for ML models in trading are becoming more stringent, and SSMs may face additional scrutiny due to their novelty.

Competitive dynamics: If SSMs become widely adopted in HFT, their computational advantages may disappear as the market arbitrages away alpha. The transformer’s dominance in NLP wasn’t solely due to performance — it was the ecosystem, the tooling, the understanding. SSMs are early in this curve. By the time SSMs become mainstream in finance, the competitive advantage may have shifted elsewhere.

Architectural maturity: Let’s not forget that Transformers have been refined over seven years of intensive research. Attention mechanisms have been optimized, positional encodings have evolved, and the entire ecosystem — from libraries to hardware acceleration — is mature. SSMs are at version 1.0. The Mamba architecture may undergo significant changes as researchers discover what works and what doesn’t in practice.

Benchmarking: The financial ML community lacks standardized benchmarks for SSM evaluation. Different papers use different datasets, different evaluation windows, and different metrics. This makes comparison difficult. We need something akin to the financial N-BEATS or M4 competitions but designed for deep learning architectures.

Conclusion: A Pragmatic Hybrid View

The question “Can Mamba replace Transformers?” is the wrong frame. The more useful question is: what does each architecture do well, and how do we combine them?

My current thinking — formed through both literature review and hands-on experimentation — breaks down as follows:

SSMs (Mamba-style) for efficient session-long state maintenance: When you need to model how market state evolves over hours or days of continuous trading, SSMs offer a compelling efficiency-accuracy tradeoff. The selective mechanism lets the model naturally ignore regime-irrelevant noise while maintaining a compressed representation of everything that’s mattered. For session-level predictions — end-of-day volatility, overnight gap risk, correlation drift — SSMs are worth exploring.

Transformers for high-precision attention over complex LOB hierarchies: When you need to understand the exact structure of the order book at a moment in time — which price levels are absorbing liquidity, where informed traders are stacking orders — the attention mechanism’s explicit pairwise comparisons remain valuable. Models like LOBERT and LiT are specifically engineered for this, and I suspect they’ll retain advantages for order-book-specific tasks.

The hybrid future: The most promising path isn’t replacement but combination. Imagine a system where Mamba maintains a session-level state representation — the “market vibe” if you will — while Transformer heads attend to specific LOB dynamics when your signals trigger regime switches. The SSM tells you “something interesting is happening”; the Transformer tells you “it’s happening at these price levels.”

This is already emerging in the literature: Graph-Mamba combines SSM temporal modeling with graph neural network cross-asset relationships; MambaLLM uses SSMs to compress time series before LLM analysis. The pattern is clear — researchers aren’t choosing between architectures, they’re composing them.

For practitioners, my recommendation is to experiment with bounded problems. Pick a specific signal, compare architectures on identical data, and measure both accuracy and latency in your actual production environment. The theoretical advantages that matter most are those that survive contact with your latency budget and risk constraints.

The post-Transformer era isn’t about replacement — it’s about selection. Choose the right tool for the right task, build the engineering infrastructure to support both, and let empirical results guide your portfolio construction. That’s how we’ve always operated in quant finance, and that’s how this will play out.

I’m continuing to experiment. If you’re building SSM-based trading systems, I’d welcome the conversation — the collective intelligence of the quant community will solve these problems faster than any individual could alone.

References

Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752. https://arxiv.org/abs/2312.00752
Gu, A., Goel, K., & Ré, C. (2022). Efficiently Modeling Long Sequences with Structured State Spaces. In International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=uYLFoz1vlAC
Linna, E., et al. (2025). LOBERT: Generative AI Foundation Model for Limit Order Book Messages. arXiv preprint arXiv:2511.12563. https://arxiv.org/abs/2511.12563
(2025). LiT: Limit Order Book Transformer. Frontiers in Artificial Intelligence. https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1616485/full
Avellaneda, M., & Stoikov, S. (2008). High-frequency trading in a limit order book. Quantitative Finance, 8(3), 217–224. (Manuscript PDF) https://people.orie.cornell.edu/sfs33/LimitOrderBook.pdf