Deep Learning for Volatility Surface Repair

A self-contained synthetic benchmark of a small mask-conditional CNN against calendar-projected linear interpolation and a per-slice SVI fit.

A volatility surface marker is rarely a clean rectangle of quotes. Strikes go unobserved during illiquid hours, wings get crossed and then erased, broker stripes drop out across an entire maturity, and weeklies arrive at the desk with random missingness on top of base quote noise. Anyone calibrating an SVI surface or running an SSVI fit operationally is doing it on top of an upstream repair step, whether that step is explicit or not. The repair step is usually some flavour of local interpolation, sometimes followed by a no-arbitrage projection, sometimes pre-empted by a model-based smoother.

The question I want to put a number on is whether a small learned model can compete with the local approach in this repair role. The reason to ask is that a learned model, in principle, knows something about the joint structure of plausible volatility surfaces that a local interpolator does not — vol surfaces are not arbitrary functions on a (k, T) grid, they have term-structure shape, characteristic skew patterns, ATM smoothness — and a model that has seen a thousand surfaces should be able to use that prior to improve on local interpolation, especially where local data is thin.

The reason to be sceptical is that local methods are very strong at what they do. Linear interpolation in (T, k) is unbiased, has no parameters to overfit, costs nothing operationally, and is hard to beat on smooth surfaces with reasonable observation density. Per-slice SVI gets you smile shape correctly even when only a handful of strikes are observed, provided the slice has enough quotes to fit. Beating both of those baselines requires that the learned prior contributes something local methods cannot — and the most plausible places for that to happen are when local data is too sparse for SVI to fit and too irregular for interpolation to fill the gap cleanly.

This note runs that experiment on synthetic data. It is deliberately a small CNN rather than a U-Net or a VAE, partly because that is the smallest interesting architecture for this problem, and partly because if a small model cannot establish a foothold here, the question of whether to build something larger has a clearer answer than if it can.

The full code is in vol_surface_repair.py; it runs on CPU in roughly four minutes.

Setup

Grid. Maturities and log-moneyness on a 13 × 17 grid: T \in [0.08, 2.0] (years), k \in [-0.45, 0.45]. All surfaces are stored as total variance w(k, T) = \sigma^2(k, T)\, T; evaluation metrics are computed in implied-vol units (\sigma) for interpretability.

Training surfaces. 1600 SSVI surfaces drawn from a fairly tight parameter range:

theta0 = U(0.010, 0.032)        # ATM total variance at front
theta_slope = U(0.020, 0.070)   # linear term in maturity
theta_curve = U(-0.006, 0.010)  # quadratic term
rho = U(-0.72, -0.18)           # skew
eta = U(0.55, 1.55)             # SSVI eta
gamma = U(0.18, 0.62)           # SSVI gamma

Calendar monotonicity is enforced on the clean target via cumulative maximum along T so the model is never rewarded for learning calendar arbitrage from the simulator. A 200-surface validation set, drawn from the same parameter range with different seeds, is held out for best-checkpoint selection.

Test surfaces. Two test families, each with 200 surfaces per cell and seeds independent of training:

  • Shifted SSVI — same generator with widened parameter ranges and an occasional maturity-localised bump. The generator is in-distribution in form but stress-tests the boundaries.
  • SABR-style event — a separate generator deliberately not identical to SSVI, with square-root maturity decay in ATM vol, stochastic skew term-structure, asymmetric wings, and occasional event-maturity bumps and kinks. This is the out-of-distribution test: smile structure that the model has never seen.

Both families are then perturbed by realistic quote noise (~18 bps in vol space, with wing- and front-end inflation) and masked according to one of two missingness regimes:

  • Regular — about 50% observation density, modest wing deletions, occasional broker-style stripes.
  • Adversarial — about 18% observation density, large contiguous wing or maturity holes, with a thin ATM spine and a few scattered anchors restored so the problem remains solvable but is genuinely sparse.

The result is a 2 × 2 evaluation: {shifted SSVI, SABR-event} × {regular, adversarial}. The motivation is to disentangle two different ways the repair task can be hard: hard because the model has not seen this surface family before, and hard because the available data is too sparse for any local method to cope.

Baselines. Two of them.

  • Calendar-projected linear interpolation: triangulated linear interpolation in (T, k) on observed total variance, with a nearest-neighbour fallback for points outside the convex hull, followed by a cumulative-maximum projection along T to enforce calendar monotonicity. Unparameterised, fast, hard to beat on smooth surfaces.
  • Per-slice SVI fit: at each maturity, a coarse raw-SVI grid search over (\rho, m, \sigma) with the best (a, b) solved in closed form by least squares. Faster than nonlinear least squares, adequate as a baseline. Followed by the same calendar projection as interpolation. This is the published-textbook approach to surface repair when you have enough quotes per slice.

Both baselines are evaluated on the same noisy/masked inputs as the CNN.

Model

The repair network is a four-layer 32-channel convolutional network with two output heads — a softplus mean head and a clipped log-variance head:

class RepairCNN(nn.Module):
   def __init__(self, in_channels=4, width=32):
       super().__init__()
       self.backbone = nn.Sequential(
           nn.Conv2d(in_channels, width, 3, padding=1), nn.SiLU(),
           nn.Conv2d(width, width, 3, padding=1), nn.SiLU(),
           nn.Conv2d(width, width, 3, padding=1), nn.SiLU(),
           nn.Conv2d(width, width, 3, padding=1), nn.SiLU(),
      )
       self.mean_head = nn.Conv2d(width, 1, 1)
       self.logvar_head = nn.Conv2d(width, 1, 1)

   def forward(self, x):
       z = self.backbone(x)
       mean = F.softplus(self.mean_head(z)) + 1e-5  # w / W_SCALE
       log_var = torch.clamp(self.logvar_head(z), -8.0, 1.5)
       return mean, log_var

The four input channels are the masked observed total variance (normalised by W_SCALE = 0.08), the binary observation mask, and two normalised coordinate channels for maturity and log-moneyness. The mean head outputs normalised total variance; the log-variance head produces a heteroscedastic uncertainty estimate.

The output space is total variance rather than implied vol. Both choices are defensible — a controlled experiment with the same training pipeline run in vol space gives essentially identical missing-point RMSE — and total variance has the practical advantage that calendar-arbitrage and butterfly-arbitrage diagnostics are natural in this space.

The loss is a weighted reconstruction term in the normalised w-space, plus a heteroscedastic Gaussian NLL (turned on after a 10-epoch warmup), plus a calendar-arbitrage penalty (penalising negative differences along T in real-w space), plus a small smoothness regulariser:

def repair_loss(mean_norm, log_var, target_norm, mask, cfg, use_nll):
   weights = 1.0 + (cfg.missing_weight - 1.0) * (1.0 - mask)
   sq = (mean_norm - target_norm)**2
   mse = torch.mean(weights * sq)
   loss = mse
   if use_nll and cfg.nll_weight:
       inv_var = torch.exp(-log_var)
       nll = 0.5 * torch.mean(weights * (sq * inv_var + log_var))
       loss = loss + cfg.nll_weight * nll
   if cfg.calendar_weight:
       loss = loss + cfg.calendar_weight * calendar_penalty_w(mean_norm)
   if cfg.smoothness_weight:
       loss = loss + cfg.smoothness_weight * smoothness_penalty(mean_norm)
   return loss

The missing-cell weight is 5x the observed-cell weight, so the model is explicitly graded on its repair quality rather than its ability to denoise observed quotes. The calendar weight is set to 80, which is high enough to drive raw calendar-violation rates into single digits at evaluation time but not so high that it dominates the reconstruction loss during training. The smoothness term is small (0.05 weight) and exists mainly to discourage high-frequency artefacts at the wings.

Training runs for 60 epochs in a single process with AdamW (lr 1e-3, wd 1e-4) and a cosine annealing schedule. The first 10 epochs run pure MSE; the NLL term turns on afterwards to give the variance head an MSE-stabilised mean to train against. A 200-surface validation set is used for best-checkpoint selection by missing-point RMSE in vol units. Batch size 128, gradient clipping at norm 1.0.

I will say a word about training duration because the result is more sensitive to it than I would like. With only 8 training epochs, the model’s missing-point RMSE on shifted SSVI / regular missing is 0.049 in vol units; at 60 epochs it is 0.018. The convergence is slow and the validation curve is still improving slightly at epoch 60. Sixty epochs is therefore a deliberate choice rather than a generous one — at that point the validation curve has flattened enough that further training mostly trades off in-distribution refinement against out-of-distribution generalisation, with no clear winner.

The 2 × 2 result

Headline numbers: missing-point RMSE in implied-vol units, mean ± standard error across 200 test surfaces per cell. Bold marks the best estimator per row, with ties (within one SE) bolded together.

CaseObs %CNNInterpSVI
Shifted SSVI / regular50.5%0.0184 ± 0.00100.0131 ± 0.00080.0191 ± 0.0048
Shifted SSVI / adversarial18.1%0.0527 ± 0.00240.0506 ± 0.00260.0540 ± 0.0123
SABR-event / regular50.4%0.0671 ± 0.00130.0248 ± 0.00110.0189 ± 0.0021
SABR-event / adversarial17.7%0.0960 ± 0.00110.0679 ± 0.00190.0784 ± 0.0089

Four cells, four different stories.

Shifted SSVI / regular. Calendar-projected linear interpolation wins outright: 0.013 versus the CNN’s 0.018. The surfaces here are smooth, the parameter shifts from the training distribution are modest, and roughly half the grid is observed. There is little for a learned prior to add: the local data is dense enough that triangulated interpolation captures essentially all the recoverable structure. The CNN is 40% worse, well outside one SE.

Shifted SSVI / adversarial. The CNN and interpolation are statistically tied (0.053 vs 0.051, within one SE of each other). With observation density at 18% and large contiguous holes, neither method has a clean run, but the CNN’s learned prior on smile shape brings it back into the same neighbourhood as interpolation. The SVI fit is also competitive here, although noisier across surfaces because individual maturity slices sometimes have too few quotes to fit reliably.

SABR-event / regular. SVI wins narrowly (0.019 vs interpolation’s 0.025), the CNN comes in third at 0.067. This is the cell that distinguishes baselines: SVI fits the local smile structure correctly slice-by-slice and pays no penalty for the SABR family being out-of-distribution because it is a per-slice model with no cross-slice prior to mislead it. The CNN, trained only on SSVI surfaces, has learned a prior that does not transfer cleanly to the asymmetric-wings, event-kink SABR family. It is 2.5× worse than SVI here.

SABR-event / adversarial. Interpolation wins (0.068 vs the CNN’s 0.096), with SVI in the middle at 0.078 and noisy because slice-level data is too sparse to fit consistently. The dominant error source for the CNN here is generalisation, not data scarcity. Even with adversarial missingness — exactly the case where one might hope a learned prior contributes most — the OOD penalty dominates.

The pattern across cells is consistent. The CNN is competitive only where its learned prior matches the test distribution and local methods are operating at their weakest. It loses materially when either of those conditions fails. Calendar-projected linear interpolation is the most consistent baseline of the three: it is the best estimator in two cells, statistically tied for best in a third, and the second-best in the fourth.

The diagnostic figure below shows a single SABR-event adversarial-missingness example. The observed input has lost a substantial chunk of the wings, the entire long-maturity tail, and the front-maturity strip; what remains is a thin ATM spine and a handful of scattered anchors. The CNN repair is smooth and plausible, with errors concentrated at the front-maturity wings — exactly where the input is most aggressively masked. The uncertainty head correctly flags that region as high-uncertainty. Calendar-projected interpolation produces the characteristic “shelf” artefact at the maturities where the cumulative-max projection has had to adjust the raw output.

Diagnostics

A repair model that minimises missing-point RMSE while producing arbitrageable surfaces and miscalibrated uncertainty is not a usable estimator. The diagnostics below are not the headline; they are the things you have to report in order for the headline to load-bear.

Caseraw cal %post-proj cal %g(k)<0 %cov80cov95err–sd corrstale AUC
SSVI / regular2.650.005.320.690.850.690.66
SSVI / adversarial9.750.007.710.470.630.640.64
SABR / regular2.190.008.730.300.430.760.55
SABR / adversarial11.050.0011.080.280.410.860.55

The honest summary, line by line:

Calendar arbitrage. The raw CNN output violates calendar monotonicity in 2–11% of (k, T) edges across the four cells. The cumulative-maximum projection drives this to zero everywhere. The calendar-projection step is therefore doing real work, and the model should not be deployed without it. The training-time calendar penalty is doing partial work — without it, raw violation rates would be substantially higher — but it is not on its own sufficient to produce calendar-monotone output reliably.

Butterfly arbitrage. Even after calendar projection, 5–11% of grid points exhibit g(k) < 0 under the discrete Gatheral–Roper diagnostic, with the worst rates on the SABR cells where the CNN is least confident. The smoothness penalty in the loss does not buy enough convexity to fix this. A real butterfly-arbitrage projection — one that actually projects onto the no-arbitrage manifold along k rather than just regularising — would be the right next step. I have not done it here, and the post-projection g(k)<0 rate is the most concerning single number in this set of diagnostics.

Uncertainty calibration. The heteroscedastic head undercovers. Nominal 80% intervals deliver 28–69% empirical coverage; nominal 95% intervals deliver 41–85%. The error-versus-predicted-sd correlation is positive everywhere (0.64–0.86), so the model is at least directionally aware of where its output is unreliable, but it is overconfident about how unreliable it is — particularly on the OOD SABR cells, where the variance head has nothing to recalibrate against. This is the standard limit of in-training Gaussian NLL: under distribution shift, the variance head is as miscalibrated as the mean head, in the same direction. A held-out conformal step is the obvious fix and would be the cheapest single change to make the uncertainty channel operationally useful.

Stale-quote AUC. A synthetic test: inject stale errors into 8% of observed quotes, run the model on the stale-injected surface, and compute the AUC of the residual |obs – \mu| as a stale-quote score. Numbers come in at 0.55–0.66. Better than chance, but weak — particularly on the OOD SABR cells where the AUC sits just above 0.55. This says the model’s residual is not, on its own, a strong stale-quote detector. A more useful operational stale-detection pipeline would combine the model residual with quote-time and quote-source signals, and the model is contributing a useful but limited fraction of the discriminative signal.

Downstream SVI projection

Surface repair is a means to an end. What the calibration desk usually wants is a smooth, arbitrage-projected SVI surface, not the raw repair output. A repair that is more accurate in the missing-point RMSE sense but less amenable to clean SVI projection might be a worse operational deliverable than the reverse. The right question is not just “how accurate is the repair” but “how good is the SVI fit to the repaired full surface”.

To get at this, I run a per-slice SVI projection on the full repaired surface (CNN or calendar-projected interpolation), then re-score the SVI fit against the held-out missing cells:

CaseSVI after CNNSVI after interp
Shifted SSVI / regular0.0171 ± 0.00270.0159 ± 0.0071
Shifted SSVI / adversarial0.0401 ± 0.00800.0426 ± 0.0113
SABR-event / regular0.0697 ± 0.00500.0295 ± 0.0062
SABR-event / adversarial0.0913 ± 0.00570.0634 ± 0.0065

The downstream metric does not change the qualitative ranking: CNN-then-SVI narrowly beats interp-then-SVI on the cell where the headline RMSE was already tied (SSVI / adversarial), and loses everywhere else. The CNN is not producing surfaces that are pathologically uncooperative under SVI projection — the SVI residuals follow the missing-point RMSE residuals reasonably faithfully. That is mildly reassuring from a deployment perspective: the choice between estimators is not being secretly arbitraged away by the projection step.

What this experiment shows and does not show

A small mask-conditional CNN, trained on 1600 synthetic SSVI surfaces under explicit calendar and smoothness penalties, with 200 validation surfaces for checkpoint selection, can repair sparse and noisy total-variance surfaces under a tight enough discipline that it:

  • produces calendar-monotone output after a cumulative-maximum post-projection (which both baselines also need);
  • matches calendar-projected linear interpolation, within statistical noise, on adversarial in-distribution missingness;
  • loses to interpolation by roughly 40% on benign in-distribution missingness, where the local-data density is high enough that triangulated interpolation captures essentially all the recoverable structure;
  • loses by a factor of 1.4–2.7× on out-of-distribution SABR-style smiles, depending on observation density;
  • carries a heteroscedastic uncertainty estimate whose direction is right (positive correlation with error) but whose magnitude is undercalibrated, particularly under distribution shift.

What this experiment does not show, and what I want to be plain about:

It does not show that this kind of CNN-based repair is useful on real data. The synthetic surfaces have no calibration drift, no quote-time-of-day noise, no microstructure asymmetries, no realistic smile dynamics, no hard-to-fit weeklies or single-name idiosyncrasies. The repair task here is pristine compared to anything one would do on production market data. Whether the small relative gap between CNN and interpolation on the adversarial cell survives a real-data test is an open question that this experiment cannot answer.

It does not show that a CNN is the right architecture for this task. A four-layer 32-channel CNN on a 13 × 17 grid is the smallest interesting model for this problem; a U-Net, an attention-conditioned masked decoder, or a conditional VAE all have published precedents in the volatility-repair literature and would be plausible candidates for materially better performance. The choice here was deliberate — keep the model small and the comparison clean — but it is not the architecture I would deploy if I were trying to win the headline number.

It does not show that the SABR-event family is the right test for OOD generalisation. The CNN is being asked to handle smiles whose convexity term-structure and wing asymmetry have a different functional form from anything in its training set. That is a hard test by design, and the gap to SVI on SABR / regular says that what the CNN has learned is closer to “the SSVI smile family” than “smile structure in general”. A more useful experiment would mix multiple smile families during training and re-test on a held-out one.

It does not, on its own, justify a production system. Before this estimator went anywhere near a market-making book it would need a real-data study, downstream P&L attribution, a much more serious calibration of the uncertainty head, a butterfly-projection step, and a comparison against more competitive learned baselines.

Where to take this next

Roughly in priority order:

  1. Real-data replication. Run the same 2 × 2 on an index-options panel across a year — in-sample dates against out-of-sample dates, regular trading days against unusually sparse ones — and see whether the conditional pattern survives. This is the single biggest credibility step. Everything in this note is conditional on the synthetic setup being a reasonable proxy for production data, and that conditioning is not free.
  2. No-arbitrage projection. Add a full butterfly-arbitrage projection alongside the calendar cumulative-max, and report whether forcing the CNN onto the no-arbitrage manifold during evaluation changes the ranking. The 5–11% post-projection g(k)<0 rate is the most uncomfortable number in the diagnostics.
  3. Calibrated uncertainty. Replace the in-training heteroscedastic NLL with a conformal wrapper trained on a held-out residual set, or with a deep ensemble across seeds. The current undercoverage on OOD cells is bad enough that the uncertainty channel is more decorative than useful.
  4. Mixture training. Train on a mixture of smile families (SSVI plus SABR-event plus a Heston-like family) and re-test on a held-out family. The SABR loss is dominantly a generalisation failure, and the cheapest single fix is to broaden the training distribution.
  5. Generative baseline. Compare a small VAE on the same harness, conditioned on the missingness pattern, as the published baseline for learned vol-surface repair. The conditional-on-mask deterministic CNN here is probably not the right architecture in the limit.

None of these requires a deep architectural rethink. They are mostly questions of where the experiment runs and what it gets compared against.

Code and references

The full script (vol_surface_repair.py) is self-contained, CPU-friendly, and reproducible: it generates the training and test data, trains the CNN, evaluates against both baselines, computes the diagnostics, runs the downstream SVI projection, and writes the figure and the numeric results to disk. Run with python vol_surface_repair.py. Approximately four minutes on CPU.

Selected references for context (full bibliographic details should be checked against the published versions):

  • Gatheral, J. The Volatility Surface: A Practitioner’s Guide. Wiley, 2006.
  • Gatheral, J. and Jacquier, A. Arbitrage-free SVI volatility surfaces. Quantitative Finance, 2014.
  • Roper, M. Arbitrage free implied volatility surfaces. Working paper, 2010.
  • Bergeron, M., Fung, N., Hull, J., Poulos, Z., and Veneris, A. Variational autoencoders: a hands-off approach to volatility. Journal of Financial Data Science, 2022.
  • Ning, B., Jaimungal, S., Zhang, X., and Bergeron, M. Arbitrage-free implied volatility surface generation with variational autoencoders. arXiv:2108.04941.
  • Cont, R. and Vuletic, M. Simulation of arbitrage-free implied volatility surfaces. Applied Mathematical Finance, 2023.
"""
Deep Learning for Volatility Surface Repair.

Self-contained, CPU-friendly PyTorch script that trains a small mask-conditional
CNN on synthetic SSVI total-variance surfaces and evaluates it against
calendar-projected linear interpolation and a per-slice SVI fit on a 2x2 design:

    {shifted SSVI, SABR-style event} x {regular missingness, adversarial missingness}

Reports missing-point RMSE in implied-vol units with standard errors, calendar
violation rates before and after isotonic projection, butterfly arbitrage
violations, uncertainty calibration coverage, a stale-quote residual AUC, and
a downstream SVI-projection metric.

Run:
    python vol_surface_repair.py

Dependencies: numpy, scipy, scikit-learn, matplotlib, torch.
"""

from __future__ import annotations

import math
import os
import random
import sys
import warnings
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, List, Tuple

import matplotlib

matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from scipy.interpolate import LinearNDInterpolator, NearestNDInterpolator
from sklearn.metrics import roc_auc_score

warnings.filterwarnings("ignore", category=RuntimeWarning)

# -----------------------------
# Reproducibility and settings
# -----------------------------
SEED = 17
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.set_num_threads(2)
try:
    torch.set_num_interop_threads(2)
except RuntimeError:
    pass

OUT_DIR = Path(os.environ.get("OUT_DIR", "./out"))
OUT_DIR.mkdir(parents=True, exist_ok=True)
FIG_PATH = OUT_DIR / "vol_surface_repair_diagnostic.png"
SVI_EVAL_SURFACES = 10  # CNN/interp use all test surfaces; SVI-grid/projection use first 10 for runtime

DEVICE = torch.device("cpu")
DTYPE = torch.float32

# Grid: maturities and log-moneyness.
N_T = 13
N_K = 17
T_GRID = np.linspace(0.08, 2.0, N_T).astype(np.float64)
K_GRID = np.linspace(-0.45, 0.45, N_K).astype(np.float64)
TT, KK = np.meshgrid(T_GRID, K_GRID, indexing="ij")
T_NORM = ((TT - TT.min()) / (TT.max() - TT.min()) * 2.0 - 1.0).astype(np.float32)
K_NORM = ((KK - KK.min()) / (KK.max() - KK.min()) * 2.0 - 1.0).astype(np.float32)

# Total-variance normalisation scale for stable CNN training. The model targets
# w / W_SCALE; outputs are converted back to total variance and then to implied
# vol for evaluation. Vol-space and w-space training give essentially identical
# missing-point RMSE under this pipeline, so w-space is preferred for the cleaner
# arbitrage-diagnostic geometry.
W_SCALE = 0.08

# Coarse raw-SVI shape library for fast per-slice benchmarking.
def _make_svi_shape_library() -> np.ndarray:
    rhos = np.linspace(-0.90, 0.40, 9)
    ms = np.linspace(-0.22, 0.22, 7)
    sigmas = np.array([0.04, 0.07, 0.11, 0.17, 0.26, 0.40])
    shapes = []
    for rho in rhos:
        for m in ms:
            for sig in sigmas:
                x = K_GRID - m
                f = rho * x + np.sqrt(x * x + sig * sig)
                shapes.append(f)
    return np.asarray(shapes, dtype=np.float64)


SVI_SHAPES = _make_svi_shape_library()


# -----------------------------
# Surface generators
# -----------------------------

def ssvi_total_variance(n: int, shifted: bool = False, seed: int = 0) -> np.ndarray:
    rng = np.random.default_rng(seed)
    out = np.empty((n, N_T, N_K), dtype=np.float64)
    for i in range(n):
        theta0 = rng.uniform(0.010, 0.032) if not shifted else rng.uniform(0.006, 0.045)
        theta_slope = rng.uniform(0.020, 0.070) if not shifted else rng.uniform(0.012, 0.095)
        theta_curve = rng.uniform(-0.006, 0.010) if not shifted else rng.uniform(-0.015, 0.020)
        rho = rng.uniform(-0.72, -0.18) if not shifted else rng.uniform(-0.88, -0.05)
        eta = rng.uniform(0.55, 1.55) if not shifted else rng.uniform(0.35, 2.10)
        gamma = rng.uniform(0.18, 0.62) if not shifted else rng.uniform(0.08, 0.78)
        t_scaled = T_GRID / T_GRID.max()
        theta = theta0 + theta_slope * t_scaled + theta_curve * t_scaled**2
        theta = np.maximum.accumulate(np.maximum(theta, 0.004))
        phi = eta / np.maximum(theta, 1e-4) ** gamma
        for j, th in enumerate(theta):
            ph = phi[j]
            x = ph * K_GRID + rho
            out[i, j, :] = 0.5 * th * (1.0 + rho * ph * K_GRID + np.sqrt(x * x + 1.0 - rho * rho))
        if shifted and rng.random() < 0.45:
            event_T = rng.choice(np.arange(2, N_T - 2))
            bump = rng.uniform(0.0015, 0.0045) * np.exp(-0.5 * ((np.arange(N_T) - event_T) / 0.8) ** 2)
            out[i] += bump[:, None] * (1.0 + 0.4 * np.tanh(4.0 * K_GRID))[None, :]
        out[i] = np.maximum(out[i], 1e-4)
    return out.astype(np.float32)


def sabr_event_total_variance(n: int, seed: int = 1) -> np.ndarray:
    rng = np.random.default_rng(seed)
    out = np.empty((n, N_T, N_K), dtype=np.float64)
    for i in range(n):
        base = rng.uniform(0.13, 0.29)
        decay = rng.uniform(0.03, 0.12)
        long = rng.uniform(0.05, 0.13)
        skew0 = rng.uniform(-0.52, -0.10)
        skew_decay = rng.uniform(0.1, 1.2)
        convex0 = rng.uniform(0.25, 0.95)
        wing_asym = rng.uniform(-0.12, 0.18)
        event_idx = rng.choice(np.arange(1, N_T - 2)) if rng.random() < 0.75 else None
        event_amp = rng.uniform(0.015, 0.055) if event_idx is not None else 0.0
        for j, T in enumerate(T_GRID):
            atm = long + base * np.exp(-decay * 4.0 * T) + rng.normal(0, 0.002)
            skew = skew0 * np.exp(-skew_decay * T) + rng.normal(0, 0.025)
            convex = convex0 / np.sqrt(T + 0.30) + rng.normal(0, 0.03)
            event = 0.0
            if event_idx is not None:
                event = event_amp * np.exp(-0.5 * ((j - event_idx) / 0.65) ** 2)
            vol = atm + event + skew * K_GRID + convex * K_GRID**2 + wing_asym * np.maximum(K_GRID, 0) ** 3
            if event_idx is not None and abs(j - event_idx) <= 1:
                kink_loc = rng.uniform(-0.12, 0.12)
                vol += rng.uniform(0.010, 0.030) * np.maximum(0.0, 1.0 - np.abs(K_GRID - kink_loc) / 0.12)
            vol = np.clip(vol, 0.04, 1.20)
            out[i, j, :] = vol * vol * T
        out[i] = np.maximum.accumulate(out[i], axis=0)
        out[i] = np.maximum(out[i], 1e-4)
    return out.astype(np.float32)


# -----------------------------
# Missingness and noise
# -----------------------------

def make_mask(kind: str, n: int, seed: int) -> np.ndarray:
    rng = np.random.default_rng(seed)
    masks = np.zeros((n, N_T, N_K), dtype=np.float32)
    center = N_K // 2
    for i in range(n):
        if kind == "regular":
            p = rng.uniform(0.34, 0.46)
            m = (rng.random((N_T, N_K)) < p).astype(np.float32)
            atm_band = slice(center - 1, center + 2)
            m[:, atm_band] = np.maximum(m[:, atm_band], (rng.random((N_T, 3)) < 0.74).astype(np.float32))
            for row in rng.choice(N_T, size=rng.integers(1, 4), replace=False):
                cols = rng.choice(N_K, size=rng.integers(4, 8), replace=False)
                m[row, cols] = 1.0
            for col in rng.choice(N_K, size=rng.integers(1, 3), replace=False):
                rows = rng.choice(N_T, size=rng.integers(4, 9), replace=False)
                m[rows, col] = 1.0
            if rng.random() < 0.55:
                wing = slice(0, rng.integers(2, 5)) if rng.random() < 0.5 else slice(rng.integers(N_K - 5, N_K - 2), N_K)
                rows = rng.choice(N_T, size=rng.integers(3, 8), replace=False)
                m[rows, wing] = 0.0
        elif kind == "adversarial":
            p = rng.uniform(0.16, 0.25)
            m = (rng.random((N_T, N_K)) < p).astype(np.float32)
            atm_keep = rng.random(N_T) < rng.uniform(0.55, 0.82)
            m[atm_keep, center] = 1.0
            near = rng.choice([center - 2, center - 1, center + 1, center + 2], size=rng.integers(1, 3), replace=False)
            for col in near:
                rows = rng.choice(N_T, size=rng.integers(3, 7), replace=False)
                m[rows, col] = 1.0
            if rng.random() < 0.5:
                m[:, : rng.integers(4, 7)] = 0.0
            else:
                m[:, rng.integers(N_K - 7, N_K - 4) :] = 0.0
            if rng.random() < 0.65:
                m[: rng.integers(2, 5), :] = 0.0
            if rng.random() < 0.45:
                m[rng.integers(N_T - 5, N_T - 2) :, :] = 0.0
            for _ in range(rng.integers(6, 12)):
                m[rng.integers(0, N_T), rng.integers(0, N_K)] = 1.0
            m[:, center] = np.maximum(m[:, center], (rng.random(N_T) < 0.35).astype(np.float32))
        else:
            raise ValueError(f"unknown mask kind {kind}")
        if m.sum() < 18:
            flat = rng.choice(N_T * N_K, size=18, replace=False)
            m.flat[flat] = 1.0
        masks[i] = m
    return masks


def corrupt_surfaces_w(w: np.ndarray, mask: np.ndarray, seed: int, vol_noise_bps: float = 18.0) -> np.ndarray:
    """Add realistic quote noise in implied vol space, return masked total variance.

    Returns total variance with unobserved cells set to zero.
    """
    rng = np.random.default_rng(seed)
    vol = np.sqrt(np.maximum(w / TT[None, :, :], 1e-8))
    noise = rng.normal(0.0, vol_noise_bps / 10000.0, size=vol.shape)
    wing_factor = 1.0 + 1.5 * (np.abs(KK)[None, :, :] / np.max(np.abs(K_GRID)))
    front_factor = 1.0 + 0.6 * (T_GRID.max() - TT)[None, :, :] / (T_GRID.max() - T_GRID.min())
    noisy_vol = np.clip(vol + noise * wing_factor * front_factor, 0.03, 2.00)
    noisy_w = (noisy_vol**2) * TT[None, :, :]
    return (noisy_w * mask).astype(np.float32)


def w_to_vol(w: np.ndarray) -> np.ndarray:
    """Convert total variance to implied vol. Handles 2D (T,K) or 3D (n,T,K) input."""
    if w.ndim == 2:
        return np.sqrt(np.maximum(w / TT, 1e-8)).astype(np.float32)
    return np.sqrt(np.maximum(w / TT[None, :, :], 1e-8)).astype(np.float32)


def vol_to_w(vol: np.ndarray) -> np.ndarray:
    """Convert implied vol to total variance. Handles 2D (T,K) or 3D (n,T,K) input."""
    if vol.ndim == 2:
        return (vol * vol * TT).astype(np.float32)
    return (vol * vol * TT[None, :, :]).astype(np.float32)


# -----------------------------
# Model
# -----------------------------

class RepairCNN(nn.Module):
    """Small mask-conditional CNN that predicts normalised total variance.

    Output: w / W_SCALE via softplus head (positivity).
    Log-variance head predicts uncertainty in the same w/W_SCALE space.

    Single-process training for ~60 epochs with preserved AdamW state is
    important: with shorter training, the model materially underperforms even
    classical baselines in this setup.
    """

    def __init__(self, in_channels: int = 4, width: int = 32):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Conv2d(in_channels, width, 3, padding=1), nn.SiLU(),
            nn.Conv2d(width, width, 3, padding=1), nn.SiLU(),
            nn.Conv2d(width, width, 3, padding=1), nn.SiLU(),
            nn.Conv2d(width, width, 3, padding=1), nn.SiLU(),
        )
        self.mean_head = nn.Conv2d(width, 1, 1)
        self.logvar_head = nn.Conv2d(width, 1, 1)
        with torch.no_grad():
            self.logvar_head.bias.fill_(-3.0)

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        z = self.backbone(x)
        mean = F.softplus(self.mean_head(z)) + 1e-5
        log_var = torch.clamp(self.logvar_head(z), -8.0, 1.5)
        return mean, log_var


def make_inputs(observed_w: np.ndarray, mask: np.ndarray) -> torch.Tensor:
    """Build input tensor.

    Channel 0: observed w / W_SCALE (zero where missing).
    Channel 1: observation mask.
    Channel 2: maturity coordinate.
    Channel 3: log-moneyness coordinate.
    """
    n = observed_w.shape[0]
    coords_t = np.broadcast_to(T_NORM[None, :, :], (n, N_T, N_K))
    coords_k = np.broadcast_to(K_NORM[None, :, :], (n, N_T, N_K))
    x = np.stack([observed_w / W_SCALE, mask, coords_t, coords_k], axis=1).astype(np.float32)
    return torch.tensor(x, dtype=DTYPE, device=DEVICE)


# -----------------------------
# Loss
# -----------------------------

def calendar_penalty_w(mean_norm: torch.Tensor) -> torch.Tensor:
    """Calendar arbitrage penalty: total variance must be non-decreasing in T.

    Operates on the actual w (not normalised) for correct scaling of the penalty
    relative to the MSE term in normalised space.
    """
    w = mean_norm * W_SCALE
    dw = w[:, :, 1:, :] - w[:, :, :-1, :]
    return torch.mean(F.relu(-dw))


def smoothness_penalty(mean_norm: torch.Tensor) -> torch.Tensor:
    """Smoothness regulariser on normalised w (not a butterfly proxy)."""
    d2k = mean_norm[:, :, :, 2:] - 2.0 * mean_norm[:, :, :, 1:-1] + mean_norm[:, :, :, :-2]
    d2t = mean_norm[:, :, 2:, :] - 2.0 * mean_norm[:, :, 1:-1, :] + mean_norm[:, :, :-2, :]
    return torch.mean(d2k * d2k) + torch.mean(d2t * d2t)


@dataclass
class LossConfig:
    missing_weight: float = 5.0
    calendar_weight: float = 80.0
    smoothness_weight: float = 0.05
    nll_weight: float = 0.10


def repair_loss(
    mean_norm: torch.Tensor,
    log_var: torch.Tensor,
    target_norm: torch.Tensor,
    mask: torch.Tensor,
    cfg: LossConfig,
    use_nll: bool,
) -> torch.Tensor:
    """Loss in w/W_SCALE space."""
    weights = 1.0 + (cfg.missing_weight - 1.0) * (1.0 - mask)
    sq = (mean_norm - target_norm) ** 2
    mse = torch.mean(weights * sq)
    loss = mse
    if use_nll and cfg.nll_weight:
        inv_var = torch.exp(-log_var)
        nll = 0.5 * torch.mean(weights * (sq * inv_var + log_var))
        loss = loss + cfg.nll_weight * nll
    if cfg.calendar_weight:
        loss = loss + cfg.calendar_weight * calendar_penalty_w(mean_norm)
    if cfg.smoothness_weight:
        loss = loss + cfg.smoothness_weight * smoothness_penalty(mean_norm)
    return loss


# -----------------------------
# Training
# -----------------------------

def train_model(
    train_w: np.ndarray,
    train_mask: np.ndarray,
    val_w: np.ndarray,
    val_mask: np.ndarray,
    cfg: LossConfig,
    epochs: int = 60,
    seed: int = SEED,
    verbose: bool = True,
) -> RepairCNN:
    """Train the CNN in a single process with optimiser state preserved.

    Targets and predictions are in w/W_SCALE space. Validation RMSE is reported
    in implied-vol units for interpretability and best-checkpoint selection.
    The first 10 epochs run pure MSE; the heteroscedastic NLL term turns on
    afterwards to avoid early variance head instabilities.
    """
    torch.manual_seed(seed)

    obs_w = corrupt_surfaces_w(train_w, train_mask, seed=seed + 101)
    x_train = make_inputs(obs_w, train_mask)
    y_train = torch.tensor((train_w / W_SCALE)[:, None, :, :], dtype=DTYPE, device=DEVICE)
    m_train = torch.tensor(train_mask[:, None, :, :], dtype=DTYPE, device=DEVICE)

    obs_w_val = corrupt_surfaces_w(val_w, val_mask, seed=seed + 202)
    x_val = make_inputs(obs_w_val, val_mask)
    y_val_vol = w_to_vol(val_w)

    model = RepairCNN().to(DEVICE)
    opt = torch.optim.AdamW(model.parameters(), lr=1.0e-3, weight_decay=1e-4)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=epochs)

    batch_size = 128
    n = train_w.shape[0]
    idx = np.arange(n)

    best_val = float("inf")
    best_state: Dict[str, torch.Tensor] | None = None

    for ep in range(epochs):
        np.random.shuffle(idx)
        model.train()
        running = 0.0
        n_batches = 0
        use_nll = ep >= 10
        for start in range(0, n, batch_size):
            batch = idx[start : start + batch_size]
            mean_norm, logv = model(x_train[batch])
            loss = repair_loss(mean_norm, logv, y_train[batch], m_train[batch], cfg, use_nll=use_nll)
            opt.zero_grad(set_to_none=True)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            opt.step()
            running += float(loss.detach())
            n_batches += 1
        scheduler.step()
        train_loss = running / max(n_batches, 1)

        model.eval()
        with torch.no_grad():
            mean_val_norm, _ = model(x_val)
            mu_w_val = (mean_val_norm[:, 0].cpu().numpy() * W_SCALE).astype(np.float32)
        mu_v_val = w_to_vol(mu_w_val)
        miss = val_mask < 0.5
        if miss.sum() > 0:
            val_rmse = float(np.sqrt(np.mean((mu_v_val[miss] - y_val_vol[miss]) ** 2)))
        else:
            val_rmse = float("nan")

        if val_rmse < best_val:
            best_val = val_rmse
            best_state = {k: v.detach().clone() for k, v in model.state_dict().items()}

        if verbose and (ep < 5 or ep % 10 == 9 or ep == epochs - 1):
            print(f"  epoch {ep+1:02d}/{epochs}  train_loss {train_loss:.5f}  val_RMSE_vol {val_rmse:.5f}  best {best_val:.5f}", flush=True)

    if best_state is not None:
        model.load_state_dict(best_state)
    model.eval()
    return model


# -----------------------------
# Baselines
# -----------------------------

def interpolate_surface(obs: np.ndarray, mask: np.ndarray) -> np.ndarray:
    """Linear interpolation in (T, k) on total variance, with nearest fallback."""
    points = np.column_stack([TT[mask > 0.5].ravel(), KK[mask > 0.5].ravel()])
    values = obs[mask > 0.5].ravel()
    full_points = np.column_stack([TT.ravel(), KK.ravel()])
    if len(values) < 4:
        fill = np.nanmean(values) if len(values) else 0.02
        return np.full((N_T, N_K), fill, dtype=np.float32)
    lin = LinearNDInterpolator(points, values, fill_value=np.nan)
    pred = lin(full_points).reshape(N_T, N_K)
    if np.isnan(pred).any():
        near = NearestNDInterpolator(points, values)
        pred[np.isnan(pred)] = near(full_points).reshape(N_T, N_K)[np.isnan(pred)]
    return np.maximum(pred, 1e-5).astype(np.float32)


def fit_svi_slice(k_obs: np.ndarray, w_obs: np.ndarray) -> np.ndarray:
    idx = np.zeros(N_K, dtype=bool)
    for ko in k_obs:
        idx[np.argmin(np.abs(K_GRID - ko))] = True
    y = np.asarray(w_obs, dtype=np.float64)
    good = np.isfinite(y) & (y > 0)
    y = y[good]
    obs_idx = np.where(idx)[0][good] if idx.sum() == len(good) else np.where(idx)[0]
    if len(y) != len(obs_idx):
        obs_idx = np.where(idx)[0][: len(y)]
    if len(y) < 3:
        fill = float(np.nanmean(y)) if len(y) else 0.02
        return np.full(N_K, fill, dtype=np.float64)
    Fobs = SVI_SHAPES[:, obs_idx]
    f_mean = Fobs.mean(axis=1)
    y_mean = y.mean()
    f_center = Fobs - f_mean[:, None]
    y_center = y - y_mean
    denom = np.sum(f_center * f_center, axis=1) + 1e-10
    b = np.sum(f_center * y_center[None, :], axis=1) / denom
    b = np.clip(b, 1e-6, 10.0)
    a = y_mean - b * f_mean
    a = np.clip(a, 1e-6, 5.0)
    pred_obs = a[:, None] + b[:, None] * Fobs
    sse = np.mean((pred_obs - y[None, :]) ** 2, axis=1)
    best = int(np.argmin(sse))
    pred = a[best] + b[best] * SVI_SHAPES[best]
    return np.maximum(pred, 1e-5)


def svi_fit_surface(obs: np.ndarray, mask: np.ndarray, enforce_calendar: bool = True) -> np.ndarray:
    pred = np.empty((N_T, N_K), dtype=np.float64)
    for j in range(N_T):
        idx = mask[j] > 0.5
        pred[j] = fit_svi_slice(K_GRID[idx], obs[j, idx])
    pred = np.maximum(pred, 1e-5)
    if enforce_calendar:
        pred = calendar_project(pred)
    return pred.astype(np.float32)


# -----------------------------
# Projections, metrics, evaluation
# -----------------------------

def calendar_project(w: np.ndarray) -> np.ndarray:
    """Strict isotonic projection along maturity via cumulative maximum."""
    return np.maximum.accumulate(np.maximum(w, 1e-5), axis=-2 if w.ndim == 3 else 0)


def calendar_violation_rate(w: np.ndarray) -> float:
    dw = np.diff(w, axis=-2 if w.ndim == 3 else 0)
    return float(np.mean(dw < -1e-8))


def butterfly_g_violation_rate(w: np.ndarray) -> float:
    """Discrete Gatheral-Roper g(k) >= 0 diagnostic on total variance surfaces."""
    dk = K_GRID[1] - K_GRID[0]
    wi = np.maximum(w[..., 1:-1], 1e-6)
    wp = (w[..., 2:] - w[..., :-2]) / (2.0 * dk)
    wpp = (w[..., 2:] - 2.0 * w[..., 1:-1] + w[..., :-2]) / (dk * dk)
    k = K_GRID[1:-1][None, None, :]
    while k.ndim < wi.ndim:
        k = np.expand_dims(k, axis=0)
    g = (1.0 - k * wp / (2.0 * wi)) ** 2 - (wp * wp / 4.0) * (1.0 / wi + 0.25) + 0.5 * wpp
    return float(np.mean(g < -1e-6))


def rmse_missing_w(pred_w: np.ndarray, target_w: np.ndarray, mask: np.ndarray) -> Tuple[float, float]:
    """Per-surface RMSE in implied vol units on missing cells, returned as (mean, SE)."""
    pred_vol = w_to_vol(pred_w)
    target_vol = w_to_vol(target_w)
    per = []
    for i in range(target_w.shape[0]):
        miss = mask[i] < 0.5
        if miss.sum() == 0:
            continue
        per.append(math.sqrt(float(np.mean((pred_vol[i][miss] - target_vol[i][miss]) ** 2))))
    arr = np.array(per)
    return float(arr.mean()), float(arr.std(ddof=1) / math.sqrt(len(arr)))


def predict_model(
    model: RepairCNN, true_w: np.ndarray, mask: np.ndarray, seed: int
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Run model on noisy/sparse surfaces. Returns (mu_w, sd_vol, observed_w).

    The model's mean output is in w/W_SCALE space; we convert to actual w and
    derive an approximate vol-space standard deviation via delta-method
    sd_vol ~ sd_w / (2 * vol * T) where vol = sqrt(w / T).
    """
    obs_w = corrupt_surfaces_w(true_w, mask, seed=seed)
    x = make_inputs(obs_w, mask)
    model.eval()
    with torch.no_grad():
        mean_norm, logv = model(x)
    mu_w = (mean_norm[:, 0].cpu().numpy() * W_SCALE).astype(np.float32)
    sd_w = (np.sqrt(np.exp(logv[:, 0].cpu().numpy())) * W_SCALE).astype(np.float32)
    # Delta-method conversion to vol-space sd.
    vol_pred = w_to_vol(mu_w)
    sd_vol = sd_w / (2.0 * np.maximum(vol_pred, 1e-3) * TT[None, :, :])
    return mu_w, sd_vol.astype(np.float32), obs_w


def predict_model_from_obs(model: RepairCNN, obs_w: np.ndarray, mask: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
    """Run the model on a pre-corrupted observed_w (used by stale-quote AUC)."""
    x = make_inputs(obs_w, mask)
    model.eval()
    with torch.no_grad():
        mean_norm, logv = model(x)
    mu_w = (mean_norm[:, 0].cpu().numpy() * W_SCALE).astype(np.float32)
    sd_w = (np.sqrt(np.exp(logv[:, 0].cpu().numpy())) * W_SCALE).astype(np.float32)
    return mu_w, sd_w


def uncertainty_coverage(mu_w: np.ndarray, sd_vol: np.ndarray, target_w: np.ndarray, mask: np.ndarray) -> Dict[str, float]:
    """Coverage of nominal Gaussian intervals, evaluated in vol units."""
    mu_vol = w_to_vol(mu_w)
    target_vol = w_to_vol(target_w)
    miss = mask < 0.5
    err = np.abs(mu_vol[miss] - target_vol[miss])
    s = np.maximum(sd_vol[miss], 1e-6)
    cov80 = float(np.mean(err <= 1.28155 * s))
    cov95 = float(np.mean(err <= 1.95996 * s))
    if err.size > 5:
        corr = float(np.corrcoef(err.ravel(), s.ravel())[0, 1])
    else:
        corr = float("nan")
    return {"coverage80": cov80, "coverage95": cov95, "err_sd_corr": corr, "avg_sd": float(np.mean(s))}


def stale_quote_auc(model: RepairCNN, true_w: np.ndarray, mask: np.ndarray, seed: int, stale_frac: float = 0.08) -> float:
    """Stale-quote AUC: model sees the stale-injected input.

    1. Generate a noisy observed surface.
    2. Inject stale errors at a random subset of observed cells.
    3. Run the model on the stale-injected surface.
    4. Score residuals at observed cells; AUC of stale label vs |obs - mu|.
    """
    rng = np.random.default_rng(seed)
    noisy_w = corrupt_surfaces_w(true_w, mask, seed=seed + 77)
    labels = np.zeros_like(mask, dtype=np.int32)
    corrupted = noisy_w.copy()
    for i in range(mask.shape[0]):
        obs_idx = np.argwhere(mask[i] > 0.5)
        if len(obs_idx) == 0:
            continue
        n_stale = max(1, int(len(obs_idx) * stale_frac))
        chosen = obs_idx[rng.choice(len(obs_idx), size=n_stale, replace=False)]
        for r, c in chosen:
            labels[i, r, c] = 1
            vol = math.sqrt(max(corrupted[i, r, c] / T_GRID[r], 1e-8))
            vol += rng.choice([-1.0, 1.0]) * rng.uniform(0.0050, 0.0110)
            corrupted[i, r, c] = max(vol, 0.03) ** 2 * T_GRID[r]
    # Run model on the stale-injected input so residuals reflect detection capability.
    mu_w, _ = predict_model_from_obs(model, corrupted, mask)
    scores = np.abs(corrupted - mu_w)
    observed = mask > 0.5
    y = labels[observed].ravel()
    s = scores[observed].ravel()
    if len(np.unique(y)) < 2:
        return float("nan")
    return float(roc_auc_score(y, s))


def svi_projection_metric(pred_w: np.ndarray, target_w: np.ndarray, mask: np.ndarray, max_surfaces: int) -> Tuple[float, float]:
    n = min(max_surfaces, pred_w.shape[0])
    projected = np.empty_like(pred_w[:n])
    full_mask = np.ones((N_T, N_K), dtype=np.float32)
    for i in range(n):
        projected[i] = svi_fit_surface(pred_w[i], full_mask, enforce_calendar=True)
    return rmse_missing_w(projected, target_w[:n], mask[:n])


def evaluate_case(name: str, model: RepairCNN, true_w: np.ndarray, mask: np.ndarray, seed: int) -> Dict[str, object]:
    mu_w_raw, sd_vol, obs_w = predict_model(model, true_w, mask, seed)
    mu_w = calendar_project(mu_w_raw)

    interp_w = np.stack([interpolate_surface(obs_w[i], mask[i]) for i in range(true_w.shape[0])])
    interp_w = calendar_project(interp_w)

    n_svi = min(SVI_EVAL_SURFACES, true_w.shape[0])
    svi_w = np.full_like(interp_w, np.nan)
    for i in range(n_svi):
        svi_w[i] = svi_fit_surface(obs_w[i], mask[i], enforce_calendar=True)

    cnn_m, cnn_se = rmse_missing_w(mu_w, true_w, mask)
    interp_m, interp_se = rmse_missing_w(interp_w, true_w, mask)
    svi_m, svi_se = rmse_missing_w(svi_w[:n_svi], true_w[:n_svi], mask[:n_svi])

    unc = uncertainty_coverage(mu_w, sd_vol, true_w, mask)
    auc = stale_quote_auc(model, true_w, mask, seed=seed + 401)
    svi_after_cnn, svi_after_cnn_se = svi_projection_metric(mu_w, true_w, mask, SVI_EVAL_SURFACES)
    svi_after_interp, svi_after_interp_se = svi_projection_metric(interp_w, true_w, mask, SVI_EVAL_SURFACES)

    return {
        "case": name,
        "observed_fraction": float(mask.mean()),
        "cnn_missing_rmse": cnn_m, "cnn_missing_se": cnn_se,
        "interp_missing_rmse": interp_m, "interp_missing_se": interp_se,
        "svi_missing_rmse": svi_m, "svi_missing_se": svi_se,
        "calendar_raw_rate": calendar_violation_rate(mu_w_raw),
        "calendar_projected_rate": calendar_violation_rate(mu_w),
        "butterfly_g_rate": butterfly_g_violation_rate(mu_w),
        "unc_cov80": unc["coverage80"],
        "unc_cov95": unc["coverage95"],
        "unc_corr": unc["err_sd_corr"],
        "unc_avg_sd": unc["avg_sd"],
        "stale_auc": auc,
        "svi_after_cnn_rmse": svi_after_cnn, "svi_after_cnn_se": svi_after_cnn_se,
        "svi_after_interp_rmse": svi_after_interp, "svi_after_interp_se": svi_after_interp_se,
        "_mu_w": mu_w, "_sd_vol": sd_vol, "_obs_w": obs_w, "_interp_w": interp_w, "_svi_w": svi_w,
    }


def print_results_table(results: List[Dict[str, object]]) -> None:
    print("\n2x2 missing-point RMSE in implied vol units (mean +/- SE)")
    print("-" * 110)
    print(f"{'Case':40s} {'Obs%':>6s}   {'CNN':>16s}   {'Interp':>16s}   {'SVI':>16s}")
    for r in results:
        print(
            f"{r['case']:40s} {100*r['observed_fraction']:5.1f}%  "
            f"{r['cnn_missing_rmse']:.4f} ± {r['cnn_missing_se']:.4f}  "
            f"{r['interp_missing_rmse']:.4f} ± {r['interp_missing_se']:.4f}  "
            f"{r['svi_missing_rmse']:.4f} ± {r['svi_missing_se']:.4f}"
        )
    print("-" * 110)

    print("\nDiagnostics after calendar projection")
    print("-" * 112)
    print(f"{'Case':40s} {'raw cal%':>9s} {'post cal%':>10s} {'g(k)<0%':>9s} {'cov80':>7s} {'cov95':>7s} {'corr':>7s} {'AUC':>7s}")
    for r in results:
        print(
            f"{r['case']:40s} {100*r['calendar_raw_rate']:8.2f}% {100*r['calendar_projected_rate']:9.2f}% "
            f"{100*r['butterfly_g_rate']:8.2f}% {r['unc_cov80']:7.3f} {r['unc_cov95']:7.3f} {r['unc_corr']:7.3f} {r['stale_auc']:7.3f}"
        )
    print("-" * 112)

    print("\nDownstream SVI projection: missing-point RMSE in implied vol units")
    print("-" * 96)
    print(f"{'Case':40s} {'SVI after CNN':>22s}     {'SVI after interp':>22s}")
    for r in results:
        print(
            f"{r['case']:40s} "
            f"{r['svi_after_cnn_rmse']:.4f} ± {r['svi_after_cnn_se']:.4f}      "
            f"{r['svi_after_interp_rmse']:.4f} ± {r['svi_after_interp_se']:.4f}"
        )
    print("-" * 96)


def make_diagnostic_plot(result: Dict[str, object], true_w: np.ndarray, mask: np.ndarray, idx: int = 0) -> None:
    """Diagnostic 2x3 plot in implied vol units."""
    target_vol = w_to_vol(true_w[idx])
    mu_vol = w_to_vol(result["_mu_w"][idx])
    obs_vol = w_to_vol(result["_obs_w"][idx])
    interp_vol = w_to_vol(result["_interp_w"][idx])
    sd_vol = result["_sd_vol"][idx]

    panels = [
        target_vol,
        np.where(mask[idx] > 0.5, obs_vol, np.nan),
        mu_vol,
        interp_vol,
        sd_vol,
        np.abs(mu_vol - target_vol),
    ]
    titles = [
        "Clean implied vol",
        "Observed sparse quotes",
        "CNN repair (vol)",
        "Calendar-projected interp",
        "CNN uncertainty (vol)",
        "Absolute CNN error",
    ]
    fig, axes = plt.subplots(2, 3, figsize=(13, 7), constrained_layout=True)
    vmin = float(np.nanmin(target_vol))
    vmax = float(np.nanmax(target_vol))
    for ax, data, title in zip(axes.ravel(), panels, titles):
        if "uncertainty" in title.lower() or "error" in title.lower():
            im = ax.imshow(data, aspect="auto", origin="lower", extent=[K_GRID.min(), K_GRID.max(), T_GRID.min(), T_GRID.max()])
        else:
            im = ax.imshow(data, aspect="auto", origin="lower", extent=[K_GRID.min(), K_GRID.max(), T_GRID.min(), T_GRID.max()], vmin=vmin, vmax=vmax)
        ax.set_title(title)
        ax.set_xlabel("log-moneyness k")
        ax.set_ylabel("maturity T")
        fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
    fig.suptitle(f"Diagnostic: {result['case']}", fontsize=13)
    fig.savefig(FIG_PATH, dpi=150)
    plt.close(fig)


# -----------------------------
# Main
# -----------------------------

def main() -> None:
    n_train = 1600
    n_val = 200
    n_test = 200
    epochs = 60

    print(f"Training surfaces: {n_train}; validation: {n_val}; test/cell: {n_test}; epochs: {epochs}; grid: {N_T}x{N_K}")

    train_w = ssvi_total_variance(n_train, shifted=False, seed=100)
    train_mask = make_mask("regular", n_train, seed=200)
    val_w = ssvi_total_variance(n_val, shifted=False, seed=300)
    val_mask = make_mask("regular", n_val, seed=400)

    cfg = LossConfig(missing_weight=5.0, calendar_weight=80.0, smoothness_weight=0.05, nll_weight=0.10)
    print("Training CNN ...", flush=True)
    model = train_model(train_w, train_mask, val_w, val_mask, cfg, epochs=epochs, seed=SEED, verbose=True)

    test_sets = {
        "Shifted SSVI / regular missing": (ssvi_total_variance(n_test, shifted=True, seed=1001), make_mask("regular", n_test, seed=2001)),
        "Shifted SSVI / adversarial missing": (ssvi_total_variance(n_test, shifted=True, seed=1002), make_mask("adversarial", n_test, seed=2002)),
        "SABR-event / regular missing": (sabr_event_total_variance(n_test, seed=1003), make_mask("regular", n_test, seed=2003)),
        "SABR-event / adversarial missing": (sabr_event_total_variance(n_test, seed=1004), make_mask("adversarial", n_test, seed=2004)),
    }

    results = []
    for i, (name, (w, m)) in enumerate(test_sets.items()):
        print(f"Evaluating {name} ...", flush=True)
        results.append(evaluate_case(name, model, w, m, seed=3000 + i))
    print_results_table(results)

    # Save numeric summary.
    res_path = OUT_DIR / "vol_surface_repair_results.txt"
    with res_path.open("w", encoding="utf-8") as f:
        f.write("Results: 2x2 missing-point RMSE (vol units), diagnostics, downstream SVI\n")
        for r in results:
            f.write(
                f"{r['case']}, obs={r['observed_fraction']:.4f}, "
                f"cnn={r['cnn_missing_rmse']:.6f}+/-{r['cnn_missing_se']:.6f}, "
                f"interp={r['interp_missing_rmse']:.6f}+/-{r['interp_missing_se']:.6f}, "
                f"svi={r['svi_missing_rmse']:.6f}+/-{r['svi_missing_se']:.6f}, "
                f"raw_cal={r['calendar_raw_rate']:.6f}, post_cal={r['calendar_projected_rate']:.6f}, g_rate={r['butterfly_g_rate']:.6f}, "
                f"cov80={r['unc_cov80']:.4f}, cov95={r['unc_cov95']:.4f}, corr={r['unc_corr']:.4f}, auc={r['stale_auc']:.4f}, "
                f"svi_after_cnn={r['svi_after_cnn_rmse']:.6f}, svi_after_interp={r['svi_after_interp_rmse']:.6f}\n"
            )

    # Diagnostic plot from the SABR adversarial case.
    hard_w, hard_m = test_sets["SABR-event / adversarial missing"]
    make_diagnostic_plot(results[-1], hard_w, hard_m, idx=3)
    print(f"\nSaved diagnostic figure: {FIG_PATH}")
    print(f"Saved results summary: {res_path}")


if __name__ == "__main__":
    main()
    sys.stdout.flush()
    sys.stderr.flush()

Implied Volatility in Merton’s Jump Diffusion Model

The “implied volatility” corresponding to an option price is the value of the volatility parameter for which the Black-Scholes model gives the same price. A well-known phenomenon in market option prices is the “volatility smile”, in which the implied volatility increases for strike values away from the spot price. The jump diffusion model is a generalization of Black\[Dash]Scholes in which the stock price has randomly occurring jumps in addition to the random walk behavior. One of the interesting properties of this model is that it displays the volatility smile effect. In this Demonstration, we explore the Black-Scholes implied volatility of option prices (equal for both put and call options) in the jump diffusion model. The implied volatility is modeled as a function of the ratio of option strike price to spot price.


Covered Writes, Covered Wrongs

What is a Covered Call?

covered call (or covered write or buy-write) is a long position in a security and a short position in a call option on that security.  The diagram below constructs the covered call payoff diagram, including the option premium, at expiration when the call option is written at a $100 strike with a $25 option premium.

Payoff

Equity index covered calls are an attractive strategy to many investors because they have realized returns not much lower than those of the equity market but with much lower volatility.  But investors often do the trade for the wrong reasons:  there are a number of myths about covered writes that persist even amongst professional options traders.  I have heard most, if not all of them professed by seasons floor traders on the American Stock Exchange and, I confess, I have even used one or two of them myself.  Roni Israelov and Larn Nielsen of AQR Capital Management, LLC have done a fine job of elucidating and then dispelling these misunderstandings about the strategy, in their paper Covered Call Strategies: One Fact and Eight Myths, Financial Analysts Journal, Vol. 70, No. 6, 2014.

SSALGOTRADING AD

The Cover Call Strategy and its Benefits for Investors

The covered call strategy has generated attention due to its attractive historical risk-adjusted returns. For example, the CBOE S&P 500 BuyWrite Index, the industry-standard covered call benchmark, is commonly described as providing average returns comparable to the S&P 500 Index with approximately two-thirds the volatility, supported by statistics such as those shown below.

Table 1

The key advantages of the strategy (compared to an outright, delta-one position) include lower volatility, beta and tail risk.  As a consequence, the strategy produces higher risk-adjusted rates if return (Sharpe Ratio).  Note, too, the beta convexity of the strategy, a topic I cover in this post:

http://jonathankinlay.com/2017/05/beta-convexity/

Although the BuyWrite Index has historically demonstrated similar total returns to the S&P 500, it does so with a reduced beta to the S&P 500 Index. However, it is important to also understand that the BuyWrite Index is more exposed to negative S&P 500 returns than positive returns. This asymmetric relationship to the S&P 500 is consistent with its payoff characteristics and results from the fact that a covered call strategy sells optionality. What this means in simple terms is that while drawdowns are somewhat mitigated by the revenue associated with call writing, the upside is capped by those same call options.

Understandably, a strategy that produces equity-like return with lower beta and lower risk attracts considerable attention from investors.  According to Moringstar, growth in assets under management in covered call strategies has been over 25% per year over the 10 years through June 2014 with over $45 billion currently invested.

Myths about the Covered Call Strategy

Many option strategies are the subject of investor myths, partly, I suppose, because option strategies are relatively complicated and entail risks in several dimensions.  So it is quite easy for investors to become confused.  Simple anecdotes are attractive because they appear to cut through the complexity with an easily understood metaphor, but they can often be misleading.  An example is the widely-held view – even amongst professional option traders – is that selling volatility via strangles is a less risk approach than selling at-the-money straddles. Intuitively, this makes sense:  why wouldn’t  selling straddles that have strike prices (far) away from the current spot price be less risky than selling straddles that have strike prices close to the spot price?  But, in fact, it turns out that selling straddles is the less risky the two strategies – see this post for details:

http://jonathankinlay.com/2016/11/selling-volatility/

Likewise, the covered call strategy is subject to a number of “urban myths”, that turn out to be unfounded:

Myth 1: Risk exposure is described by the payoff diagram

That is only true at expiration.  Along the way, the positions will be marked-to-market and may produce a substantially different payoff if the trade is terminated early.  The same holds true for a zero-coupon bond – we know the terminal value for certain, but there can be considerable variation in the value of the asset from day to day.

Myth 2: Covered calls provide downside protection

This is partially true, but only in a very limited sense.  Unlike a long option hedge, the “protection” in a buy-write strategy is limited to only the premium collected on the option sale, a relatively modest amount in most cases.  Consider a covered call position on a $100 stock with a $10 at-the-money call premium. The covered call can potentially lose $90 and the long call option can lose $10. Each position has the same 50% exposure to the stock, but the covered call’s downside risk is disproportionate to its stock exposure. This is consistent with the covered call’s realized upside and downside betas as discussed earlier.

Myth 3: Covered calls generate income.

Remember that income is revenue minus costs.

It is true that option selling generates positive cash flow, but this incorrectly leads investors to the conclusion that covered calls generate investment income.  Just as is the case with bond issuance, the revenue generated from selling the call option is not income (though, like income, the cash flows received from selling options are considered taxable for many investors). In order for there to be investment income or earnings, the option must be sold at a favorable price – the option’s implied volatility needs to be higher than the stock’s expected volatility.

Myth 4: Covered calls on high-volatility stocks and/or shorter-dated options provide higher yield.

Though true that high volatility stocks and short-dated options command higher annualized premiums, insurance on riskier assets should rationally command a higher premium and selling insurance more often per year should provide higher annual premiums. However, these do not equate to higher net income or yield. For instance, if options are properly priced (e.g., according to the Black-Scholes pricing model), then selling 12 at-the-money options will generate approximately 3.5 times the cash flow of selling a single annual option, but this does not unequivocally translate into higher net profits as discussed earlier. Assuming fairly priced options, higher revenue is not necessarily a mechanism for increasing investment income.

The key point here is that what matters is value, not price. In other words, expected investment profits are generated by the option’s richness, not the option’s price. For example, if you want to short a stock with what you consider to be a high valuation, then the goal is not to find a stock with a high price, but rather one that is overpriced relative to its fundamental value. The same principle applies to options. It is not appropriate to seek an option with a high price or other characteristics associated with high prices. Investors must instead look for options that are expensive relative to their fundamental value.  Put another way, the investor should seek out options trading at a higher implied volatility than the likely futures realized volatility over the life of the option.

Myth 5: Time decay of options written works in your favor.

While it is true that the value of an option declines over time as it approaches expiration, that is not the whole story.  In fact an option’s expected intrinsic value increases as the underlying security realizes volatility.  What matters is whether the realized volatility turns out to be lower than the volatility baked into the option price – the implied volatility.  In truth, an option’s time decay only works in the seller’s favor if the option is initially priced expensive relative to its fundamental value. If the option is priced cheaply, then time decay works very much against the seller.

Myth 6: Covered calls are appropriate if you have a neutral to moderately bullish view.

This myth is an over-simplification.  In selling a call option you are expressing a view, not only on the future prospects for the stock, but also on its likely future volatility.  It is entirely possible that the stock could stall (or even decline) and yet the value of the option you have sold rises due, say, to takeover rumors.  A neutral view on the stock may imply a belief that the security price will not move far from its current price rather than its expected return is zero. If so, then a short straddle position is a way to express that view — not a covered call — because, in this case, no active position should be taken in the security.

Myth 7: Overwriting pays you for doing what you were going to do anyway

This myth is typically posed as the following question: if you have a price target for selling a stock you own, why not get paid to write a call option struck at that price target?

In fact this myth exposes the critical difference between a plan and a contractual obligation. If the former case, suppose that the stock hits your target price very much more quickly than you had anticipated, perhaps as a result of a new product announcement that you had not anticipated at the time you set your target.  In those circumstances you might very well choose to maintain your long position and revise your price target upwards. This is an example of a plan – a successful one – that can be adjusted to suit circumstances as they change.

A covered call strategy is an obligation, rather than a plan.  You have pre-sold the stock at the target price and, in the above scenario, you cannot change your mind in order to benefit from additional potential upside in the stock.

In other words, with a covered call strategy you have monetized the optionality that is inherent in any plan and turned it into a contractual obligation in exchange for a fee.

Myth 8: Overwriting allows you to buy a stock at a discounted price.

Here is how this myth is typically framed: if a stock that you would like to own is currently priced at $100 and that you think is currently expensive, you can act on that opinion by selling a naked put option at a $95 strike price and collect a premium of say $1. Then, if the price subsequently declines below the strike price, the option will likely be exercised thus requiring you to buy the stock for $95. Including the $1 premium, you effectively buy the stock at a 6% discount. If the option is not exercised you keep the premium as income. So, this type of outcome for selling naked put options may also lead you to conclude that the equivalent covered call strategy makes sense and is valuable.

But this argument is really a sleight of hand.  In our example above, if the option is exercised, then when you buy the stock for $95 you won’t care what the stock price was when you sold the option. What matters is the stock price on the date the option was exercised. If the stock price dropped all the way down to $80, the $95 purchase price no longer seems like a discount. Your P&L will show a mark-to-market loss of $14 ($95 – $80 – $1). The initial stock price is irrelevant and the $1 premium hardly helps.

Conclusion: How to Think About the Covered Call Strategy

Investors should ignore the misleading storytelling about obtaining downside buffers and generating income. A covered call strategy only generates income to the extent that any other strategy generates income, by buying or selling mispriced securities or securities with an embedded risk premium. Avoid the temptation to overly focus on payoff diagrams. If you believe the index will rise and implied volatilities are rich, a covered call is a step in the right direction towards expressing that view.

If you have no view on implied volatility, there is no reason to sell options, or covered calls

Aby znaleźć legalne kasyna online w Polsce, odwiedź stronę pl.kasynopolska10.com/legalne-kasyna/, partnera serwisu recenzującego kasyna online – KasynoPolska10.

Riders on the Storm

The Worst Volatility Scare for Years

February 2018 was an insane month for stocks, wrote CNN:

A profound inflation scare. Not one but two 1,000-point plunges for the Dow. And a powerful comeback that almost went straight back up.

The CNN story-line continues:

The Dow plummeted more than 3,200 points, or 12%, in just two weeks. Then stocks raced back to life, at one point recovering about three-quarters of those losses.

Fittingly, February ended with more drama. The Dow tumbled 680 points during the month’s final two days, leaving it down about 1,600 points from the record high in late January.

The headline in the Financial Times was a little more nuanced, focusing on the impact of the market turmoil on quant hedge funds:

 

FT

 

Quant Funds Get Trashed

The FT reported:

Computer-driven, trend-following hedge funds are heading for their worst month in nearly 17 years after getting whipsawed when the stock market’s steady soar abruptly reversed into one of the quickest corrections in history earlier in February.

The carnage amongst hedge funds was widespread, according to the article:

Société Générale’s CTA index is down 5.55 per cent this month, even after the recent market rebound, making it the worst period for these systematic hedge funds since November 2001.
Man AHL’s $1.1bn Diversified fund lost almost 10 per cent in the month to February 16, while the London investment firm’s AHL Evolution and Alpha funds were down about 4-5 per cent over the same period. The flagship funds of GAM’s Cantab Capital, Systematica and Winton lost 9.5 per cent, 7.2 per cent and 4.6 per cent* respectively between the start of the month and February 16. Aspect Capital’s Diversified Fund dropped 9.5 per cent in the month to February 20, while a trend-following fund run by Lynx Asset Management slumped 12.7 per cent. A leveraged version of the same fund tumbled 18.8 per cent. One of the other big victims is Roy Niederhoffer, whose fund lost 21.1 per cent in the month to February 20.

Painful reading, indeed.

 

Traders conditioned to a state of somnambulance were shocked by the ferocity of the volatility spike, as the CBOE VIX index soared by over 200% in a single day, reaching a high of over 38 on Feb 5th:

 

VIX Index

 

Indeed, this turned out to be the largest ever two-day increase in the history of the index:

VIX_Spike_1

This Quant Strategy Made 27% In February Alone

So, for a quant-driven options strategy that is typically a premium seller, February must surely have been a disaster, if not a total wipe-out.  Not quite.  On the contrary, our Option Trader strategy made a massive gain of 27% for the month.  As a result strategy performance is now running at over 55% for 2018 YTD, while maintaining a Sharpe Ratio of 2.23.

Option Trader

You can tell that the strategy has a tendency to collect option premiums, not only because the strategy description says as much, but also from the observation that over 90% of strategy trades have been profitable – one of the defining characteristics of volatility strategies that are short-Vega, long-Theta.  The theory is that such strategies make money most of the time, but then give it all back (and more) when volatility inevitably spikes.  While that is generally true, in my experience, that clearly didn’t occur here.  So what’s the story?

One of the advantages of our Algo Trading Platform is that it not only reports in detail the live performance of our strategies, but it also reveals the actual trades on the site (typically delayed by 24-72 hours).  A review of the trades made by the Option Trader strategy from the end of January though early February indicates a strongly bullish bias, with short put trades in stocks such as Netflix, Inc. (NFLX), Shopify Inc. (SHOP), The Goldman Sachs Group, Inc. (GS) and Facebook, Inc. (FB), coupled with short call trades in VIX ETF products such as ProShares Ultra VIX Short-Term Futures (UVXY) and iPath S&P 500 VIX ST Futures ETN (VXX).  As volatility began to spike on 2/5, more calls were sold at increasingly fat premiums in several of the VIX Index ETFs.  These short volatility positions were later hedged with long trades in the underlying ETFs and, over time, both the hedges and the original option sales proved highly profitable. In other words, the extremely high levels of volatility enabled the strategy to profit on both legs of the trade, a highly unusual occurrence.  Meanwhile, while it was hedging its bets in the VIX ETF option trades, the strategy was becoming increasingly aggressive in the single stocks sector, taking outright long positions in Baidu, Inc. (BIDU), Align Technology, Inc. (ALGN), Netflix, Inc. (NFLX) and others, just as they became trading off their lows in the second week of the month.  By around Feb 12th the strategy recognized that the volatility shock had begun to subside and took advantage of the inflated option premia, selling puts across the board, in particular in the technology (Tesla, Inc. (TSLA), NVIDIA Corporation (NVDA)) and retail sectors (GrubHub Inc. (GRUB), Alibaba Group Holding Limited (BABA)) that had suffered especially heavy declines.  Many of these trades were closed at a substantial profit within a span of just a few days as the market stabilized and volatility subsided.  The strategy broadened the scope of its option selling as the month progressed, initially recovering the entirety of the drawdown it had initially suffered, before going on to register substantial profits on almost every trade.

To summarize:

  1.  Like many other market players, the Volatility Trader strategy was initially caught on the wrong side of the volatility spike and suffered a significant drawdown.
  2. Instead of liquidating positions, the strategy began hedging aggressively in sectors holding the greatest danger – VIX ETFs, in particular.  These trades ultimately proved profitable on both option and hedge legs as the market turned around and volatility collapsed.
  3. As soon as volatility showed signed of easing, the strategy began making aggressive bets on market stabilization and recovery, taking long positions in some of the most beaten-down stocks and selling puts across the board to capture inflated option premia.

Lesson Learned:  Aggressive Defense is the best Options Strategy in a Volatile Market

If there is one lesson above all others to be learned from this case study it is this:  that a period of market turmoil is a time of opportunity for option traders, but only if they play aggressively, both in defense and offense.  Many traders run scared at times like this and liquidate positions, taking heavy losses in the process that can prove impossible to recover from if, as here, the drawdown is severe.  This study shows that by holding one’s nerve and hedging rather than liquidating loss-making positions and then moving aggressively to capitalize on inflated option prices a trader can not only weather the storm but, as in this case, produce exceptional returns.

The key take-away is this: in order to play aggressively you have to have sufficient reserves in the tank to enable you to hold positions rather than liquidate them and, later on, to transition to selling expensive option premiums.  The mistake many option traders make is to trade too close to the line in term of margin limits, resulting  in a forced liquidation of positions that would otherwise have been profitable.

You can trade the Option Trader strategy live in your own brokerage account – go here for details.

 

 

Investing in Leveraged ETFs – Theory and Practice

Summary

Leveraged ETFs suffer from decay, or “beta slippage.” Researchers have attempted to exploit this effect by shorting pairs of long and inverse leveraged ETFs.

The results of these strategies look good if you assume continuous compounding, but are often poor when less frequent compounding is assumed.

In reality, the trading losses incurred in rebalancing the portfolio, which requires you to sell low and buy high, overwhelm any benefit from decay, making the strategies unprofitable in practice.

A short levered ETF strategy has similar characteristics to a short straddle option position, with positive Theta and negative Gamma, and will experience periodic, large drawdowns.

It is possible to develop leveraged ETF strategies producing high returns and Sharpe ratios with relative value techniques commonly used in option trading strategies.

SSALGOTRADING AD

Decay in Leveraged ETFs

Leveraged ETFs continue to be much discussed on Seeking Alpha.

One aspect in particular that has caught analysts’ attention is the decay, or “beta slippage” that leveraged ETFs tend to suffer from.

Seeking Alpha contributor Fred Picard in a 2013 article (“What You Need To Know About The Decay Of Leveraged ETFs“) described the effect using the following hypothetical example:

To understand what is beta-slippage, imagine a very volatile asset that goes up 25% one day and down 20% the day after. A perfect double leveraged ETF goes up 50% the first day and down 40% the second day. On the close of the second day, the underlying asset is back to its initial price:

(1 + 0.25) x (1 – 0.2) = 1

And the perfect leveraged ETF?

(1 + 0.5) x (1 – 0.4) = 0.9

Nothing has changed for the underlying asset, and 10% of your money has disappeared. Beta-slippage is not a scam. It is the normal mathematical behavior of a leveraged and rebalanced portfolio. In case you manage a leveraged portfolio and rebalance it on a regular basis, you create your own beta-slippage. The previous example is simple, but beta-slippage is not simple. It cannot be calculated from statistical parameters. It depends on a specific sequence of gains and losses.

Fred goes on to make the point that is the crux of this article, as follows:

At this point, I’m sure that some smart readers have seen an opportunity: if we lose money on the long side, we make a profit on the short side, right?

Shorting Leveraged ETFs

Taking his cue from Fred’s article, Seeking Alpha contributor Stanford Chemist (“Shorting Leveraged ETF Pairs: Easier Said Than Done“) considers the outcome of shorting pairs of leveraged ETFs, including the Market Vectors Gold Miners ETF (NYSEARCA:GDX), the Direxion Daily Gold Miners Bull 3X Shares ETF (NYSEARCA:NUGT) and the Direxion Daily Gold Miners Bear 3X Shares ETF (NYSEARCA:DUST).

His initial finding appears promising:

Therefore, investing $10,000 each into short positions of NUGT and DUST would have generated a profit of $9,830 for NUGT, and $3,900 for DUST, good for an average profit of 68.7% over 3 years, or 22.9% annualized.

At first sight, this appears to a nearly risk-free strategy; after all, you are shorting both the 3X leveraged bull and 3X leveraged bear funds, which should result in a market neutral position. Is there easy money to be made?

Source: Standford Chemist

Not so fast! Stanford Chemist applies the same strategy to another ETF pair, with a very different outcome:

“What if you had instead invested $10,000 each into short positions of the Direxion Russell 1000 Financials Bullish 3X ETF (NYSEARCA:FAS) and the Direxion Russell 1000 Financials Bearish 3X ETF (NYSEARCA:FAZ)?

The $10,000 short position in FAZ would have gained you $8,680. However, this would have been dwarfed by the $28,350 loss that you would have sustained in shorting FAS. In total, you would be down $19,670, which translates into a loss of 196.7% over three years, or 65.6% annualized.

No free lunch there.

The Rebalancing Issue

Stanford Chemist puts his finger on one of the key issues: rebalancing. He explains as follows:

So what happened to the FAS-FAZ pair? Essentially, what transpired was that as the underlying asset XLF increased in value, the two short positions became unbalanced. The losing side (short FAS) ballooned in size, making further losses more severe. On the other hand, the winning side (short FAZ) shrunk, muting the effect of further gains.

To counteract this effect, the portfolio needs to be rebalanced. Stanford Chemist looks at the implications of rebalancing a short NUGT-DUST portfolio whenever the market value of either ETF deviates by more than N% from its average value, where he considers N% in the range from 10% to 100%, in increments of 10%.

While the annual portfolio return was positive in all but one of these scenarios, there was very considerable variation in the outcomes, with several of the rebalanced portfolios suffering very large drawdowns of as much as 75%:

Source: Stanford Chemist

The author concludes:

The results of the backtest showed that profiting from this strategy is easier said than done. The total return performances of the strategy over the past three years was highly dependent on the rebalancing thresholds chosen. Unfortunately, there was also no clear correlation between the rebalancing period used and the total return performance. Moreover, the total return profiles showed that large drawdowns do occur, meaning that despite being ostensibly “market neutral”, this strategy still bears a significant amount of risk.

Leveraged ETF Pairs – Four Case Studies

Let’s press pause here and review a little financial theory. As you recall, it is possible to express a rate of return in many different ways, depending on how interest is compounded. The most typical case is daily compounding:

R = (Pt – Pt-1) / Pt

Where Pt is the price on day t, and Pt-1 is the price on day t-1, one day prior.

Another commonly used alternative is continuous compounding, also sometimes called log-returns:

R = Ln(Pt) – Ln(Pt-1)

Where Ln(Pt) is the natural log of the price on day t, Pt

When a writer refers to a rate of return, he should make clear what compounding basis the return rate is quoted on, whether continuous, daily, monthly or some other frequency. Usually, however, the compounding basis is clear from the context. Besides, it often doesn’t make a large difference anyway. But with leveraged ETFs, even microscopic differences can produce substantially different outcomes.

I will illustrate the effect of compounding by reference to examples of portfolios comprising short positions in the following representative pairs of leveraged ETFs:

  • Direxion Daily Energy Bull 3X Shares ETF (NYSEARCA:ERX)
  • Direxion Daily Energy Bear 3X Shares ETF (NYSEARCA:ERY)
  • Direxion Daily Gold Miners Bull 3X ETF
  • Direxion Daily Gold Miners Bear 3X ETF
  • Direxion Daily S&P 500 Bull 3X Shares ETF (NYSEARCA:SPXL)
  • Direxion Daily S&P 500 Bear 3X Shares ETF (NYSEARCA:SPXS)
  • Direxion Daily Small Cap Bull 3X ETF (NYSEARCA:TNA)
  • Direxion Daily Small Cap Bear 3X ETF (NYSEARCA:TZA)

The findings in relation to these pairs are mirrored by results for other leveraged ETF pairs.

First, let’s look the returns in the ETF portfolios measured using continuous compounding.

Source: Yahoo! Finance

The portfolio returns look very impressive, with CAGRs ranging from around 20% for the short TNA-TZA pair, to over 124% for the short NUGT-DUST pair. Sharpe ratios, too, appear abnormally large, ranging from 4.5 for the ERX-ERY short pair to 8.4 for NUGT-DUST.

Now let’s look at the performance of the same portfolios measured using daily compounding.

Source: Yahoo! Finance

It’s an altogether different picture. None of the portfolios demonstrate an attract performance record and indeed in two cases the CAGR is negative.

What’s going on?

Stock Loan Costs

Before providing the explanation, let’s just get one important detail out of the way. Since you are shorting both legs of the ETF pairs, you will be faced with paying stock borrow costs. Borrow costs for leveraged ETFs can be substantial and depending on market conditions amount to as much as 10% per annum, or more.

In computing the portfolio returns in both the continuous and daily compounding scenarios I have deducted annual stock borrow costs based on recent average quotes from Interactive Brokers, as follows:

  • ERX-ERY: 14%
  • NUGT-DUST: 16%
  • SXPL-SPXS: 8%
  • TNA-TZA: 8%

It’s All About Compounding and Rebalancing

The implicit assumption in the computation of the daily compounded returns shown above is that you are rebalancing the portfolios each day. That is to say, it is assumed that at the end of each day you buy or sell sufficient quantities of shares of each ETF to maintain an equal $ value in both legs.

In the case of continuously compounded returns the assumption you are making is that you maintain an equal $ value in both legs of the portfolio at every instant. Clearly that is impossible.

Ok, so if the results from low frequency rebalancing are poor, while the results for instantaneous rebalancing are excellent, it is surely just a question of rebalancing the portfolio as frequently as is practically possible. While we may not be able to achieve the ideal result from continuous rebalancing, the results we can achieve in practice will reflect how close we can come to that ideal, right?

Unfortunately, not.

Because, while we have accounted for stock borrow costs, what we have ignored in the analysis so far are transaction costs.

Transaction Costs

With daily rebalancing transaction costs are unlikely to be a critical factor – one might execute a single trade towards the end of the trading session. But in the continuous case, it’s a different matter altogether.

Let’s use the SPXL-SPXS pair as an illustration. When the S&P 500 index declines, the value of the SPXL ETF will fall, while the value of the SPXS ETF will rise. In order to maintain the same $ value in both legs you will need to sell more shares in SPXL and buy back some shares in SPXS. If the market trades up, SPXL will increase in value, while the price of SPXS will fall, requiring you to buy back some SPXL shares and sell more SPXS.

In other words, to rebalance the portfolio you will always be trying to sell the ETF that has declined in price, while attempting to buy the inverse ETF that has appreciated. It is often very difficult to execute a sale in a declining stock at the offer price, or buy an advancing stock at the inside bid. To be sure of completing the required rebalancing of the portfolio, you are going to have to buy at the ask price and sell at the bid price, paying the bid-offer spread each time.

Spreads in leveraged ETF products tend to be large, often several pennies. The cumulative effect of repeatedly paying the bid-ask spread, while taking trading losses on shares sold at the low or bought at the high, will be sufficient to overwhelm the return you might otherwise hope to make from the ETF decay.

And that’s assuming the best case scenario that shares are always available to short. Often they may not be: so that, if the market trades down and you need to sell more SPXL, there may be none available and you will be unable to rebalance your portfolio, even if you were willing to pay the additional stock loan costs and bid-ask spread.

A Lose-Lose Proposition

So, in summary: if you rebalance infrequently you will avoid excessive transaction costs; but the $ imbalance that accrues over the course of a trading day will introduce a market bias in the portfolio. That can hurt portfolio returns very badly if you get caught on the wrong side of a major market move. The results from daily rebalancing for the illustrative pairs shown above indicate that this is likely to happen all too often.

On the other hand, if you try to maintain market neutrality in the portfolio by rebalancing at high frequency, the returns you earn from decay will be eaten up by transaction costs and trading losses, as you continuously sell low and buy high, paying the bid-ask spread each time.

Either way, you lose.

Ok, what about if you reverse the polarity of the portfolio, going long both legs? Won’t that avoid the very high stock borrow costs and put you in a better position as regards the transaction costs involved in rebalancing?

Yes, it will. Because, you will be selling when the market trades up and buying when it falls, making it much easier to avoid paying the bid-ask spread. You will also tend to make short term trading profits by selling high and buying low. Unfortunately, you may not be surprised to learn, these advantages are outweighed by the cost of the decay incurred in both legs of the long ETF portfolio.

In other words: you can expect to lose if you are short; and lose if you are long!

An Analogy from Option Theory

To anyone with a little knowledge of basic option theory, what I have been describing should sound like familiar territory.

Being short the ETF pair is like being short an option (actually a pair of call and put options, called a straddle). You earn decay, or Theta, for those familiar with the jargon, by earning the premium on the options you have sold; but at the risk of being short Gamma – which measures your exposure to a major market move.

Source: Interactive Brokers

You can hedge out the portfolio’s Gamma exposure by trading the underlying securities – the ETF pair in this case – and when you do that you find yourself always having to sell at the low and buy at the high. If the options are fairly priced, the option decay is enough, but not more, to compensate for the hedging cost involved in continuously trading the underlying.

Conversely, being long the ETF pair is like being long a straddle on the underling pair. You now have positive Gamma exposure, so your portfolio will make money from a major market move in either direction. However, the value of the straddle, initially the premium you paid, decays over time at a rate Theta (also known as the “bleed”).

Source: Interactive Brokers

You can offset the bleed by performing what is known as Gamma trading. When the market trades up your portfolio delta becomes positive, i.e. an excess $ value in the long ETF leg, enabling you to rebalance your position by selling excess deltas at the high. Conversely, when the market trades down, your portfolio delta becomes negative and you rebalance by buying the underlying at the current, reduced price. In other words, you sell at the high and buy at the low, typically making money each time. If the straddle is fairly priced, the profits you make from Gamma trading will be sufficient to offset the bleed, but not more.

Typically, the payoff from being short options – being short the ETF pair – will show consistent returns for sustained periods, punctuated by very large losses when the market makes a significant move in either direction.

Conversely, if you are long options – long the ETF pair – you will lose money most of the time due to decay and occasionally make a very large profit.

In an efficient market in which securities are fairly priced, neither long nor short option strategy can be expected to dominate the other in the long run. In fact, transaction costs will tend to produce an adverse outcome in either case! As with most things in life, the house is the player most likely to win.

Developing a Leveraged ETF Strategy that Works

Investors shouldn’t be surprised that it is hard to make money simply by shorting leveraged ETF pairs, just as it is hard to make money by selling options, without risking blowing up your account.

And yet, many traders do trade options and often manage to make substantial profits. In some cases traders are simply selling options, hoping to earn substantial option premiums without taking too great a hit when the market explodes. They may get away with it for many years, before blowing up. Indeed, that has been the case since 2009. But who would want to be an option seller here, with the market at an all-time high? It’s simply far too risky.

The best option traders make money by trading both the long and the short side. Sure, they might lean in one direction or the other, depending on their overall market view and the opportunities they find. But they are always hedged, to some degree. In essence what many option traders seek to do is what is known as relative value trading – selling options they regard as expensive, while hedging with options they see as being underpriced. Put another way, relative value traders try to buy cheap Gamma and sell expensive Theta.

This is how one can thread the needle in leveraged ETF strategies. You can’t hope to make money simply by being long or short all the time – you need to create a long/short ETF portfolio in which the decay in the ETFs you are short is greater than in the ETFs you are long. Such a strategy is, necessarily, tactical: your portfolio holdings and net exposure will likely change from long to short, or vice versa, as market conditions shift. There will be times when you will use leverage to increase your market exposure and occasions when you want to reduce it, even to the point of exiting the market altogether.

If that sounds rather complicated, I’m afraid it is. I have been developing and trading arbitrage strategies of this kind since the early 2000s, often using sophisticated option pricing models. In 2012 I began trading a volatility strategy in ETFs, using a variety of volatility ETF products, in combination with equity and volatility index futures.

I have reproduced the results from that strategy below, to give some indication of what is achievable in the ETF space using relative value arbitrage techniques.

Source: Systematic Strategies, LLC

Source: Systematic Strategies LLC

Conclusion

There are no free lunches in the market. The apparent high performance of strategies that engage systematically in shorting leveraged ETFs is an illusion, based on a failure to quantify the full costs of portfolio rebalancing.

The payoff from a short leveraged ETF pair strategy will be comparable to that of a short straddle position, with positive decay (Theta) and negative Gamma (exposure to market moves). Such a strategy will produce positive returns most of the time, punctuated by very large drawdowns.

The short Gamma exposure can be mitigated by continuously rebalancing the portfolio to maintain dollar neutrality. However, this will entail repeatedly buying ETFs as they trade up and selling them as they decline in value. The transaction costs and trading losses involved in continually buying high and selling low will eat up most, if not all, of the value of the decay in the ETF legs.

A better approach to trading ETFs is relative value arbitrage, in which ETFs with high decay rates are sold and hedged by purchases of ETFs with relatively low rates of decay.

An example given of how this approach has been applied successfully in volatility ETFs since 2012.