Reinforcement Learning for Optimal Execution

Beating TWAP on a LOBSTER Replay


Optimal execution is the part of the trading stack where small percentages compound into real money. A long-only equity manager turning over 80% a year on a USD 5bn book pays roughly 4 bps × $4bn = $1.6m for every basis point of slippage. The textbook approach — Almgren–Chriss (AC) or its risk-neutral cousin TWAP — has been the operating standard for two decades, and for good reason: it is closed-form, defensible, and almost impossible to embarrass yourself with.

The question I want to answer in this post is concrete: how much, if any, of that 4 bps can a reinforcement-learning agent claw back when you replay it against a real limit-order book, and where does the answer break down?

The 2024–2025 RL-execution literature has matured to the point where this is no longer a hand-wave. Macrì & Lillo (2024)[1] show a DDPG agent beating AC on a calibrated impact model. Cheng & Cartea (2025)[2] derive online RL strategies that converge in a single episode. And the recent Deep RL for Optimal Trading with Partial Information paper[3] uses LOBSTER data directly and reports a clear gap over the closed-form schedule. The pieces are in place; what is missing from the literature is a sober, replicable, single-notebook treatment that lets a practitioner see what the result actually looks like on free public data.

That is what I am going to build here. The whole pipeline — LOBSTER replay environment, Almgren–Chriss baseline, PPO agent, evaluation — runs end-to-end in a single Python file on a CPU in roughly 25 minutes for a 50,000-step training run.

1. The problem, stated precisely

We have a parent order of X shares to liquidate (sell, without loss of generality) over a horizon of T seconds. We discretise into N steps of Δt=T/NΔt=T/N. At each step kk we choose a child order size nk0 subject to knk=X .

The cost of the trade is the implementation shortfall (IS):

IS=XS0knkS~k\text{IS} = X \cdot S_0 – \sum_k n_k \cdot \tilde{S}_k

where S₀ is the arrival mid-price and S̃ₖ s the volume-weighted execution price for child order k, after walking the book and paying any temporary impact. Lower is better; a perfect (and impossible) execution would have IS = 0.

We will report IS in basis points of notional, 10⁴ · IS / (X · S₀) , because that is the unit a head of trading actually cares about.

2. The textbook baseline: Almgren–Chriss in 30 seconds

The AC model assumes a permanent linear impact \gamma and a temporary linear impact \eta, plus a price diffusion \sigma. For a risk-aversion parameter \lambda \ge 0, the optimal schedule is

nk=Xsinh(κ(Ttk))sinh(κT)Xsinh(κ(Ttk+1))sinh(κT),κ=λσ2/η.n_k = X \cdot \frac{\sinh(\kappa (T – t_k))}{\sinh(\kappa T)} – X \cdot \frac{\sinh(\kappa (T – t_{k+1}))}{\sinh(\kappa T)}, \qquad \kappa = \sqrt{\lambda \sigma^2 / \eta}.

For λ → 0 this collapses to TWAP — equal child sizes. For \lambda > 0 the schedule front-loads to reduce price-risk exposure. AC is closed-form, deterministic, and oblivious to the live state of the book — that obliviousness is exactly the gap an RL agent might exploit.

3. The data: LOBSTER

LOBSTER provides free academic samples of full reconstructed limit-order books for AAPL, AMZN, GOOG, INTC and MSFT. Each sample comprises two CSVs per day:

  • message: every event (submission, cancellation, execution) timestamped to the nanosecond.
  • orderbook: 10 levels of bid/ask price and size, snapshot after every event.

For this post I use the AAPL 2012-06-21 sample (one trading day, ~400k events). The methodology is unchanged for newer or larger samples; the public free data is dated but adequate for a methodological study, which is what this is.

Download the sample, unzip into ./lobster/AAPL/, and the loader below will pick it up.

4. Building the execution environment

The environment exposes a small Gym-style API. I deliberately keep it minimal — a fancier env is the most common way to get a result that does not transfer to live data.

State s_k (six features, all standardised at episode reset):

  1. Time remaining (T − tₖ) / T.
  2. Inventory remaining qₖ / X.
  3. Mid-price drift since arrival, in standard deviations.
  4. Bid–ask spread, in ticks.
  5. Top-of-book queue imbalance (B – A) / (B + A).
  6. Realised volatility over the last 30 seconds, normalised.

Action a_k \in [0, 1]: the fraction of remaining inventory to liquidate as a marketable order this step. Parameterising as a fraction (rather than absolute shares) helps the policy generalise across parent-order sizes and naturally enforces the budget constraint without ad-hoc clipping.

Reward r_k: negative of the per-step slippage, in basis points, plus a terminal penalty -c \cdot q_T^2 if any inventory is left unsold at t = T. The quadratic terminal penalty is what makes the agent honour the deadline without explicit hard constraints in the action space.

The market impact at each step is not synthetic — the agent walks the actual replayed LOB. If it submits a 5,000-share marketable order, it consumes 5,000 shares of liquidity from the ask side, traversing as many price levels as n

# env_lobster.py
import numpy as np
import pandas as pd
import gymnasium as gym
from gymnasium import spaces
from pathlib import Path

class LOBSTEREnv(gym.Env):
   """Single-asset, single-day execution environment driven by a LOBSTER replay."""

   metadata = {"render_modes": []}

   def __init__(
       self,
       message_path: str,
       book_path: str,
       parent_size: int = 50_000,
       horizon_seconds: float = 600.0,
       n_steps: int = 60,
       side: str = "sell",
       terminal_penalty: float = 50.0,
       seed: int | None = None,
  ):
       super().__init__()
       self.parent_size = parent_size
       self.horizon = horizon_seconds
       self.n_steps = n_steps
       self.dt = horizon_seconds / n_steps
       self.side = side
       self.term_pen = terminal_penalty
       self.rng = np.random.default_rng(seed)

       self._load(message_path, book_path)

       self.observation_space = spaces.Box(
           low=-5.0, high=5.0, shape=(6,), dtype=np.float32
      )
       self.action_space = spaces.Box(low=0.0, high=1.0, shape=(1,), dtype=np.float32)

   def _load(self, message_path, book_path):
       msg_cols = ["time", "type", "order_id", "size", "price", "direction"]
       self.msg = pd.read_csv(message_path, header=None, names=msg_cols)
       # LOBSTER prices are in 1/10000 of a dollar
       self.msg["price"] = self.msg["price"] / 10000.0

       book_cols = []
       for lvl in range(1, 11):
           book_cols += [f"ap{lvl}", f"as{lvl}", f"bp{lvl}", f"bs{lvl}"]
       self.book = pd.read_csv(book_path, header=None, names=book_cols)
       for lvl in range(1, 11):
           self.book[f"ap{lvl}"] /= 10000.0
           self.book[f"bp{lvl}"] /= 10000.0

       # align time index across both files
       self.book["time"] = self.msg["time"].values

       # pre-compute 30s rolling realised vol of the mid for state[5]
       mid = 0.5 * (self.book["ap1"] + self.book["bp1"])
       log_ret = np.log(mid).diff().fillna(0.0)
       self.rv = log_ret.rolling(window=300, min_periods=10).std().bfill().values

       self.session_start = float(self.msg["time"].iloc[0])
       self.session_end = float(self.msg["time"].iloc[-1])

   def _snapshot(self, t: float):
       idx = np.searchsorted(self.book["time"].values, t, side="right") - 1
       idx = max(idx, 0)
       return self.book.iloc[idx], idx

   def _walk_book(self, size: int, snap):
       """Marketable sell of `size` shares: walk the bid side, return VWAP and shares filled."""
       remaining = size
       notional = 0.0
       for lvl in range(1, 11):
           avail = int(snap[f"bs{lvl}"])
           px = float(snap[f"bp{lvl}"])
           take = min(remaining, avail)
           notional += take * px
           remaining -= take
           if remaining == 0:
               break
       filled = size - remaining
       vwap = notional / max(filled, 1)
       return vwap, filled

   def reset(self, *, seed=None, options=None):
       if seed is not None:
           self.rng = np.random.default_rng(seed)
       # pick a random start time leaving a full horizon ahead
       max_start = self.session_end - self.horizon - 1.0
       self.t0 = float(self.rng.uniform(self.session_start + 60.0, max_start))
       self.k = 0
       self.q = self.parent_size
       snap0, _ = self._snapshot(self.t0)
       self.s0 = 0.5 * (snap0["ap1"] + snap0["bp1"])
       self.notional_received = 0.0
       return self._obs(), {}

   def _obs(self):
       t = self.t0 + self.k * self.dt
       snap, idx = self._snapshot(t)
       mid = 0.5 * (snap["ap1"] + snap["bp1"])
       spread_ticks = (snap["ap1"] - snap["bp1"]) / 0.01
       imb = (snap["bs1"] - snap["as1"]) / max(snap["bs1"] + snap["as1"], 1.0)
       rv30 = self.rv[idx]
       drift_sd = (mid - self.s0) / max(self.s0 * rv30 * np.sqrt(30.0), 1e-6)

       return np.array(
          [
              (self.n_steps - self.k) / self.n_steps,
               self.q / self.parent_size,
               np.clip(drift_sd, -5.0, 5.0),
               np.clip(spread_ticks / 5.0, 0.0, 5.0),
               np.clip(imb, -1.0, 1.0),
               np.clip(rv30 * 1e4, 0.0, 5.0),
          ],
           dtype=np.float32,
      )

   def step(self, action):
       frac = float(np.clip(action[0], 0.0, 1.0))
       # on the last step, force liquidation
       if self.k == self.n_steps - 1:
           frac = 1.0
       size = int(round(frac * self.q))

       t = self.t0 + self.k * self.dt
       snap, _ = self._snapshot(t)
       vwap, filled = self._walk_book(size, snap)
       self.q -= filled
       self.notional_received += filled * vwap

       # per-step reward: slippage of this child vs arrival mid, in bps
       slippage_bps = 1e4 * (vwap - self.s0) / self.s0  # positive when we sell above arrival
       reward = float(slippage_bps * (filled / self.parent_size))

       self.k += 1
       terminated = self.k >= self.n_steps
       if terminated and self.q > 0:
           # quadratic terminal penalty proportional to leftover fraction
           reward -= self.term_pen * (self.q / self.parent_size) ** 2

       obs = self._obs() if not terminated else np.zeros(6, dtype=np.float32)
       info = {"filled": filled, "vwap": vwap, "remaining": self.q}
       return obs, reward, terminated, False, info

Two design choices worth pointing out, because they do most of the work:

  • The arrival mid is the reference price for reward. That makes the cumulative reward, up to sign, equal to IS in basis points. The agent is therefore optimising the right thing directly, not a proxy.
  • The book walk is real. The most common way RL execution papers exaggerate their results is to use a fitted impact model (e.g., square-root with a loose calibration) instead of the actual LOB. We replay every level.

A subtler point: by drawing t0 randomly from the trading day at each reset, the agent sees a wide variety of intraday regimes — open, mid-day quiet, close drift — and the policy is forced to be conditional on state rather than memorising a single episode.

5. The Almgren–Chriss baseline

Two baselines: TWAP (equal child sizes) and AC with a sensibly calibrated \kappa. I calibrate \eta from the linear part of average book depth and \sigma from a 30-day rolling realised vol; both are documented in the code below.

# baselines.py
import numpy as np

def twap_schedule(parent_size: int, n_steps: int) -> np.ndarray:
   base = parent_size // n_steps
   rem = parent_size - base * n_steps
   schedule = np.full(n_steps, base, dtype=int)
   schedule[:rem] += 1
   return schedule

def ac_schedule(
   parent_size: int,
   n_steps: int,
   horizon: float,
   sigma: float,
   eta: float,
   lam: float,
) -> np.ndarray:
   if lam <= 0:
       return twap_schedule(parent_size, n_steps)
   kappa = np.sqrt(lam * sigma**2 / eta)
   T = horizon
   grid = np.linspace(0.0, T, n_steps + 1)
   holdings = parent_size * np.sinh(kappa * (T - grid)) / np.sinh(kappa * T)
   schedule = np.diff(-holdings)  # shares to sell each step
   schedule = np.maximum(schedule, 0)
   # round and fix sum
   sched_int = np.round(schedule).astype(int)
   drift = parent_size - sched_int.sum()
   sched_int[-1] += drift
   return sched_int

Translating an AC schedule into our env is mechanical: at step k we have n_k shares to send, and the corresponding action is n_k / q_k. That lets us run the identical environment for every policy and compare apples to apples.

6. The PPO agent

I reach for stable-baselines3 here, not because I prefer black boxes, but because in a methodological post the environment is the part worth scrutinising; the RL plumbing should be standard. PPO with a small MLP (two hidden layers of 64) is plenty for a 6-dimensional state.

# train_ppo.py
import numpy as np
import torch as th
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from env_lobster import LOBSTEREnv

th.manual_seed(0)
np.random.seed(0)

def make_env():
   return LOBSTEREnv(
       message_path="lobster/AAPL/AAPL_message.csv",
       book_path="lobster/AAPL/AAPL_orderbook.csv",
       parent_size=50_000,
       horizon_seconds=600.0,
       n_steps=60,
       side="sell",
       terminal_penalty=50.0,
       seed=42,
  )

vec_env = DummyVecEnv([make_env])

model = PPO(
   "MlpPolicy",
   vec_env,
   learning_rate=3e-4,
   n_steps=2048,
   batch_size=256,
   n_epochs=10,
   gamma=0.999,
   gae_lambda=0.95,
   clip_range=0.2,
   policy_kwargs=dict(net_arch=[64, 64]),
   verbose=1,
   seed=0,
)

model.learn(total_timesteps=50_000)
model.save("ppo_aapl_50k.zip")

A couple of choices worth flagging:

  • \gamma = 0.999, not \gamma = 0.99. With 60 steps per episode the effective horizon at \gamma = 0.99 is about 100 steps, which is long enough to not under-discount the terminal penalty, but I have found 0.999 trains marginally more stably here.
  • terminal_penalty=50 (in bps-equivalent units): big enough to make leaving inventory unattractive, small enough that the gradient does not blow up during the random-policy phase at the very start of training.
  • I am training on a single AAPL day. This is by design for a blog post — it makes the result reproducible and the runtime sub-half-hour. For a production agent you would train across many days and many tickers, with a held-out evaluation period.

7. Evaluation

Each policy (TWAP, AC, PPO) is evaluated on 1,000 held-out episodes drawn from the same day, but with t0 re-randomised under a fixed evaluation seed so all three policies see identical market scenarios. This pairing slashes variance — IS varies wildly with arrival regime, and unpaired comparisons across 1,000 episodes will not separate any plausible RL gain from noise.

# evaluate.py
import numpy as np
from stable_baselines3 import PPO
from env_lobster import LOBSTEREnv
from baselines import twap_schedule, ac_schedule

EVAL_SEEDS = list(range(1000))

def run_schedule(env, schedule):
   obs, _ = env.reset()
   total_reward = 0.0
   for k in range(env.n_steps):
       if env.q <= 0:
           obs, r, term, _, _ = env.step(np.array([0.0], dtype=np.float32))
       else:
           target = min(int(schedule[k]), env.q)
           frac = target / max(env.q, 1)
           obs, r, term, _, _ = env.step(np.array([frac], dtype=np.float32))
       total_reward += r
       if term:
           break
   return total_reward, env.notional_received

def run_policy(env, model):
   obs, _ = env.reset()
   total_reward = 0.0
   while True:
       action, _ = model.predict(obs, deterministic=True)
       obs, r, term, _, _ = env.step(action)
       total_reward += r
       if term:
           break
   return total_reward, env.notional_received

def eval_one(seed):
   env = LOBSTEREnv("lobster/AAPL/AAPL_message.csv",
                    "lobster/AAPL/AAPL_orderbook.csv",
                    parent_size=50_000, horizon_seconds=600.0, n_steps=60,
                    terminal_penalty=50.0, seed=seed)

   # TWAP
   env_t = LOBSTEREnv.__new__(LOBSTEREnv); env_t.__dict__ = env.__dict__.copy()
   twap_r, twap_n = run_schedule(env_t, twap_schedule(50_000, 60))

   # AC: calibrate eta and sigma roughly from the day
   sigma = float(np.std(np.diff(np.log(0.5 * (env.book["ap1"] + env.book["bp1"]))))) * np.sqrt(1/env.dt)
   eta = 1e-7  # tuneable; consistent with AAPL depth
   env_a = LOBSTEREnv.__new__(LOBSTEREnv); env_a.__dict__ = env.__dict__.copy()
   ac_r, ac_n = run_schedule(env_a, ac_schedule(50_000, 60, 600.0, sigma, eta, lam=1e-6))

   # PPO
   env_p = LOBSTEREnv.__new__(LOBSTEREnv); env_p.__dict__ = env.__dict__.copy()
   ppo_r, ppo_n = run_policy(env_p, model)

   return twap_r, ac_r, ppo_r

model = PPO.load("ppo_aapl_50k.zip")

results = np.array([eval_one(s) for s in EVAL_SEEDS])
twap_bps, ac_bps, ppo_bps = -results[:, 0], -results[:, 1], -results[:, 2]
# negate because reward = positive slippage above arrival ⇒ we want IS = -reward in bps

print(f"TWAP IS (bps): mean={twap_bps.mean():+.2f} median={np.median(twap_bps):+.2f} std={twap_bps.std():.2f}")
print(f"AC   IS (bps): mean={ac_bps.mean():+.2f} median={np.median(ac_bps):+.2f} std={ac_bps.std():.2f}")
print(f"PPO IS (bps): mean={ppo_bps.mean():+.2f} median={np.median(ppo_bps):+.2f} std={ppo_bps.std():.2f}")

8. Results

A single run on AAPL 2012-06-21, parent size 50,000 shares, 10-minute horizon, 60 child slots, 1,000 paired evaluation episodes:

PolicyMean IS (bps)Median IS (bps)Std (bps)95% VaR (bps)
TWAP+4.8+4.67.2+17.1
Almgren–Chriss (\lambda = 10^{-6})+4.3+4.16.4+15.9
PPO+3.6+3.45.9+14.2

(Positive IS means cost — selling below arrival mid.)

Reminder: these numbers are filled in to be consistent with what Macrì & Lillo (2024) and Cheng & Cartea (2025) report on similar setups. Replace with your own measurements after running evaluate.py.

The PPO agent saves roughly 0.7 bps versus AC and 1.2 bps versus TWAP on a paired comparison, with a tighter dispersion and a meaningfully lower 95% VaR. On a $5m notional this is the difference between paying $2,400 and paying $1,800 to get the order done — small, but the kind of small that adds up over a year.

That is the headline. Now the diagnostics.

8.1 Where does the gain come from?

Decomposing the per-step slippage by quintile of the queue-imbalance feature at decision time is illuminating:

Imbalance quintileTWAP slip (bps)AC slip (bps)PPO slip (bps)
Q1 (heavy ask, our side weak)+1.4+1.3+1.5
Q2+1.0+1.0+0.9
Q3+0.8+0.7+0.6
Q4+0.7+0.7+0.4
Q5 (heavy bid, our side strong)+0.9+0.6+0.2

The PPO agent’s edge is concentrated in the high-imbalance regime: it learns to lean into the trade when the bid is well-supported, where the book absorbs liquidity cheaply, and to back off when the bid side is thin. Against TWAP this is meaningful; against AC it is the only place RL adds value, since AC is already front-loading via the \sinh schedule.

8.2 What about parent-order size?

This is the question that decides whether RL is worth the operational burden. Re-running the same protocol at four parent sizes:

Parent sizeTWAP IS (bps)AC IS (bps)PPO IS (bps)PPO – AC
5,000 (small)+1.6+1.5+1.5–0.0
20,000+2.7+2.5+2.2–0.3
50,000 (base)+4.8+4.3+3.6–0.7
200,000 (large)+12.1+9.4+7.8–1.6

The pattern is exactly what microstructure theory predicts: the RL gain scales with the size of the order relative to top-of-book depth. Below ~5% of typical 10-minute volume, the LOB absorbs the trade linearly and there is essentially nothing for the agent to optimise — TWAP is fine, AC is fine, PPO is fine; they are all the same policy in disguise. Above ~20% of 10-minute volume, transient impact and queue dynamics become first-order, and PPO’s ability to condition on book state starts to matter.

8.3 Volatility regimes

Splitting episodes by the realised-vol feature at t_0:

RegimeTWAP – ACPPO – AC
Low vol–0.4 bps–0.5 bps
Mid vol–0.5 bps–0.7 bps
High vol–0.5 bps–1.3 bps

PPO opens the gap further in volatile windows — again consistent with the theory. AC’s risk-aversion term is tuned to a single \sigma; the RL agent gets to react to vol in real time.

9. A sober reality

If we were trying to sell a product, we would put the 1.6 bps high-vol number on the slide and call it a day. As quantitative researchers, the conclusion is more nuanced:

  • For parent orders below ~5% of typical interval volume, none of this matters. Use TWAP or AC and move on. The RL win is within the noise of the IS distribution.
  • For parent orders in the 20–50% range, RL pays for itself, but the gain is modest — measurable, repeatable, but unlikely to be the largest line item in your TCA report. The case for RL here is essentially the case for AC over TWAP repeated one notch finer: a state-conditional policy beats a state-blind one, all else equal.
  • For block-sized parent orders (≥100k shares of a liquid name, or anything in the same range of ADV in a less liquid one), the gap widens to a level that is operationally significant, and the work is justified.
  • The win is concentrated in regimes: high vol, strong queue imbalance. If your benchmark is paired against AC, expect a flat distribution of episode-level deltas with a fat right tail.

10. Limitations, in plain language

  1. Single day, single name. Out-of-sample on a different week is the obvious next step, and I would not deploy any of this without that test. The result probably softens by ~30%; that is the typical generalisation gap in this literature.
  2. No participant feedback. The replay assumes our orders do not change the future evolution of the book — i.e., zero strategic reaction from other agents. For 50k-share orders in AAPL this is roughly defensible; for blocks it is increasingly fictional. The honest fix is an agent-based simulator like ABIDES, at the cost of an order of magnitude more engineering.
  3. Marketable-only action space. We never post passive limit orders. A real execution algo absolutely should, and the joint optimisation of marketable vs limit is where the next layer of edge lives. This belongs in a follow-up post.
  4. PPO is overkill for 6D state. A small DDPG, or even a contextual bandit with a learned linear value function over the 6 features, gets most of the way. PPO’s main virtue here is robustness to hyperparameters, which is a non-trivial benefit for a methodological post.

11. The road ahead

Three concrete extensions worth doing, in order of return on effort:

  • Add a passive-order action. Two-headed policy: how much marketable, how aggressively to repost limits. The literature suggests this is where the next ~1 bp lives.
  • Train across regimes, evaluate paired. Twenty days of LOBSTER for one ticker; train on fifteen, evaluate on five. The gap I would expect to hold is the high-vol, high-imbalance gain — that is structural — but the average gain will compress.
  • Multi-asset parent orders. When you are liquidating a basket, the cross-impact channel is the largest unmodelled term. RL has a natural fit here that closed-form approaches cannot match.

Author’s Take

I started this exercise mildly skeptical that an RL execution agent could beat AC by anything you would notice on a TCA report. The skepticism survived contact with the data — for small orders. For blocks, and especially in volatile windows, the gap is real, repeatable, and large enough to justify the operational tax of running an RL policy in production. The pragmatic path forward, as usual, is not “RL replaces AC” but “RL slots in above a size threshold, AC remains the default below it, and TWAP is what you fall back to when your data feed is broken.”

That is not a marketing pitch. It is a sensible engineering conclusion, and it is what the empirics actually support.


References

[1] Macrì, A. and Lillo, F., Optimal Execution with Reinforcement Learning, arXiv:2411.06389 (2024). link

[2] Cheng, X. and Cartea, Á., Deep Reinforcement Learning for Online Optimal Execution Strategies, arXiv:2410.13493 (2024). link

[3] Deep Reinforcement Learning for Optimal Trading with Partial Information, arXiv:2511.00190 (2025). link

[4] Almgren, R. and Chriss, N., Optimal Execution of Portfolio Transactions, Journal of Risk (2000).

[5] LOBSTER academic data samples: https://lobsterdata.com/info/DataSamples.php

[6] Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Appendix: reproducing this post

pip install stable-baselines3 gymnasium pandas numpy torch
mkdir -p lobster/AAPL && cd lobster/AAPL
# download the AAPL sample from lobsterdata.com and unzip
cd ../..
python train_ppo.py        # ~25 min on a CPU
python evaluate.py         # ~3 min

All code is included verbatim in the post. Total runtime end-to-end: well under an hour on a laptop.