Backtesting Archives - QUANTITATIVE RESEARCH AND TRADING

May 17, 2026May 17, 2026

Agentic Workflows for Alpha Research

A 12-Week Practitioner Case Study

There is by now a small mountain of vendor material claiming that AI agents will run hedge funds. The reality on the ground — for those of us who actually do the work — is more interesting and more useful. Agentic workflows, properly constructed, materially accelerate the parts of quant research that consume the most time. They also fail in specific, predictable ways that you can defend against if you take them seriously and ignore if you don’t.

This post is a write-up of an architecture I have been using for the last four months on an FX-carry research project, and what it changed about my throughput. The headline finding is that the right unit of measurement is not “ideas per hour” — which is misleading — but ideas that survive a human-grade critique per month. On that metric the lift, on this single workstream, is on the order of 2× rather than 10×, and it comes from a very specific allocation of work between the human and the agent.

The single most important thing to internalise before reading further is that the architecture is the load-bearing piece — not the prompts, not the model choice. Most of what makes this stack work would still work if you swapped Claude for any other frontier model; very little of it would work if you swapped the typed handoffs, the research log, and the human gates for a single conversational thread. The recent multi-agent literature converges on the same conclusion from the software-engineering side — AutoGen [1] frames LLM applications as configurable agents with structured interaction, and MetaGPT [2] argues explicitly that encoding standard operating procedures into role-specialised pipelines is what produces reliable outputs. The point of this post is to make the same argument for the quant-research side, and to instrument the claim with measured numbers rather than vibes.

1. What alpha research actually consists of

Before discussing what to automate, it helps to be honest about what the day-to-day is.

A reasonable decomposition of the time I spend on a single research idea, end-to-end:

Literature triage and replication — finding the three papers that matter out of the thirty that cite the relevant phenomenon, and reproducing their core result. 20–25%.
Hypothesis specification — stating the economic claim precisely enough that a backtest can falsify it. 5%.
Data wrangling — sourcing, aligning, point-in-time correctness, handling holidays and corporate actions. 25–30%.
Implementation — writing the signal, the portfolio construction, the cost model, the evaluator. 10–15%.
Diagnostic and ablation work — by-regime, by-subsample, by-feature, transaction-cost sensitivity, parameter stability. 20%.
Judgment and synthesis — deciding whether what you have is real, whether it adds to the existing book, and whether to risk it. 10%.

The last category is the one that actually distinguishes a senior researcher from a junior one, and it is the category that AI agents are worst at. The first four are the categories where they are dramatically better than the alternative of doing it yourself.

The architecture I will describe is built around that asymmetry: aggressively delegate the first four, keep judgment human, and instrument the boundary between the two so failures are visible early.

2. The naive loop and why it fails

The seductive thing to do — and the thing every demo on Twitter shows — is to wire a single capable LLM up to a Python sandbox and a price-history database and tell it “find me alpha in EM FX”. I tried this. So has everyone.

What you get back, reliably, is a strategy with an in-sample Sharpe of 2.4 that does the following four things:

Uses some flavour of recent-return signal with a lookback chosen to fit the sample.
Sizes positions inversely proportional to realised volatility, with the volatility window also chosen to fit the sample.
Quietly references a feature whose construction has a one-step look-ahead bug.
Reports backtest statistics over a period that conveniently excludes the 2022 carry drawdown.

The agent is not malicious. It is doing exactly what you asked. The objective you wrote — “maximise Sharpe on this dataframe” — has no concept of out-of-sample, of economic prior, or of regime. An agent with code execution and a permissive objective is a specification-gaming machine, and the result is the alpha-research equivalent of a model that achieves 99% accuracy on MNIST by memorising the test set.

This is a textbook case of the failure modes formalised in Amodei et al. [3]: reward hacking when the objective is misspecified, distributional shift between training and deployment regimes, and absence of scalable supervision when the supervisor is the same LLM doing the optimisation. The lesson is that the single-agent, single-objective loop is the wrong abstraction. Quant research has more than one objective, and the objectives are partly adversarial.

3. The architecture: separated roles, instrumented handoffs

The setup that has worked for me has four roles, each instantiated as a separate LLM call with its own system prompt, tool access, and — importantly — its own context window. They communicate via a structured research-log database rather than by sharing memory directly.

Proposer. Reads recent literature and the current research log, and emits a single falsifiable hypothesis in a fixed schema: economic claim, dependent variable, predictor(s), sample, null. No code. Read access to a curated paper corpus and to the research log; no access to price data. Forcing the hypothesis through a schema is the single most important constraint in the whole stack — it makes “interesting-sounding but unfalsifiable” outputs impossible.

Implementer. Takes a single approved hypothesis and produces a notebook that tests it. Has read access to data and write access to a sandboxed compute environment. Critically, has no access to the results of prior implementations — this prevents the agent from anchoring on prior backtest numbers and tuning the new implementation to match.

Critic. Reads only the implementer’s notebook and its output. Its prompt is to produce an adversarial list of reasons the result might be spurious: look-ahead bugs, multiple-testing inflation, regime cherry-picking, cost-model optimism, feature contamination. Outputs a checklist with severity. The Critic does not get to fix anything; it only files findings.

Replicator. Takes the Critic’s findings and the original notebook and produces a panel of robustness tests: alternative samples, alternative cost assumptions, leave-one-out by feature, and deliberate ablations of any flagged components. Outputs a single comparison table.

Replicator independence at promotion stage. For any candidate that has cleared the Critic and is being considered for the second human gate, the Replicator is not allowed to reuse the Implementer’s feature-generation code. It receives only the hypothesis schema and a frozen data contract, and reimplements the signal independently. This turns the Replicator from a robustness-script generator into a genuine independent check, and catches at least one class of bug — silent feature-construction errors — that the Critic structurally cannot detect from reading the Implementer’s notebook alone.

The human (me) sits as a gate at two points: between Proposer and Implementer (does this hypothesis deserve compute?) and between Replicator and “promotion to candidate” (is the robustness panel convincing?). Everything in between runs without supervision.

What this is, and what it is not. The stack is autonomous only inside pre-specified rails. It is a controlled batch pipeline with LLM modules, not an autonomous research scientist. It does not choose its own data permissions, change its own validation criteria, redefine the promotion threshold, or promote its own results. That is by design — and it is the design feature that separates this from the “AI hedge fund” pitch. The fully autonomous research agent is, as far as I can tell, not yet a viable target; what is a viable target is making each non-judgment step of the research pipeline an order of magnitude cheaper, while leaving the judgment steps untouched.

The key invariant is that no role sees its own prior outputs as ground truth. Each handoff is a fresh context with the schema-typed artifact and nothing else. This is what kills the most common failure mode of single-agent loops, which is that the agent quietly accumulates evidence in favour of its earlier guesses.

Schematically:

                   ┌──────────────────────┐
                   │  Research-log DB     │
                   │  (typed artifacts)   │
                   └─────────┬────────────┘
                             │
   ┌─────────┐   hypothesis  │   notebook   ┌────────┐
   │Proposer ├───────────────┴──────────────┤Impl.   │
   └────┬────┘            ▲                 └───┬────┘
        │                 │                     │
     human gate           │                     │
        │                 │   findings          ▼
        │             ┌───┴────┐           ┌────────┐
        └────────────►│Critic  │◄──────────┤notebook│
                      └───┬────┘           │+ output│
                          │                └────────┘
                       robustness
                          ▼
                      ┌──────────┐
                      │Replicator│──► comparison table ──► human gate
                      └──────────┘

4. The objective function, written down

It is worth being explicit about what the system as a whole is optimising. A single Sharpe number is not it. The composite I use is:

U \;=\; \mathrm{IR}_{\text{oos}} \;-\; \lambda_1 \,\big|\mathrm{IR}_{\text{is}} – \mathrm{IR}_{\text{oos}}\big| \;-\; \lambda_2 \, k_{\text{eff}} \;-\; \lambda_3 \, S_{\text{tc}} \;-\; \lambda_4 \log\!\big(1 + N_{\text{trials}}\big) \;-\; \lambda_5 \, C_{\text{frag}}

Term by term:

Out-of-sample IR. The information ratio of the strategy on data the Implementer has not seen. The sample boundary is fixed by the Proposer in the hypothesis schema, not chosen by the Implementer.
Overfitting drift. The absolute gap between in-sample and out-of-sample IR. A strategy with a 2.0 in-sample IR and 0.4 out-of-sample IR is worse than one at 0.9 / 0.7. The penalty weight is calibrated ex ante and frozen before any candidate is evaluated.
Effective parameters, k-eff. A degrees-of-freedom proxy that counts lookback choices, thresholds, feature inclusions, regime switches, and any other knob whose value was set after seeing data. The count is generated by the Implementer at submission time as part of the notebook schema, not estimated post hoc. A strategy with three tuned knobs is preferred over an empirically-equal strategy with eleven.
Transaction-cost sensitivity, S-tc. The slope of net returns with respect to a 1 bp shift in assumed cost. A strategy that goes from a 0.8 IR at 2 bps assumed cost to 0.0 at 3 bps is fragile to a part of the world we do not know well, and the objective should say so.
Search-intensity penalty. A logarithmic penalty in the effective number of trials the stack has run on related hypotheses in the same workstream. This is the term that explicitly links the objective to the multiple-testing literature: White’s Reality Check [4] on data-snooping, Bailey, Borwein, López de Prado and Zhu [5] on the probability of backtest overfitting (which gives a usable Deflated Sharpe Ratio formulation), and Harvey, Liu and Zhu [6] on inflated significance in factor research. Without it, an agentic stack that runs 38 hypotheses in 12 weeks will mechanically look better than a human who runs 11, even when the marginal hypothesis is no better — exactly the dynamic those papers warn against. The effective trial count is incremented every time the Implementer commits a notebook touching the same dependent variable, regardless of whether the result is positive.
Fragility penalty, C-frag. Captures dependence on one date range, one currency, one regime, one cost assumption, or one feature family. Computed as the maximum proportional loss in IR when any single such dimension is ablated. A strategy whose IR collapses when 2022 is excluded scores poorly regardless of headline performance.

The Proposer, Implementer, and Critic all see this composite. The Implementer is not told to maximise it — that would re-introduce the specification-gaming problem. It is told to test the hypothesis. The composite is used by the Critic to flag any result where any term contributes negatively beyond a fixed threshold, and by the human gate to compare candidates.

This is the same idea that underlies penalised regression: you write your taste explicitly into the objective rather than relying on the optimiser to share it. The λ weights are not magic; they are chosen so that — on a held-out historical set of strategies whose ex-post five-year outcomes are known — the ranking produced by U correlates with realised forward performance. The calibration is done once, before any candidate from the current workstream is evaluated, and is not re-tuned during the run.

5. The tooling, concretely

For practitioners who want to assemble something equivalent, the components I am using:

LLM: Claude Opus for Proposer and Critic (better at synthesis, more skeptical reading); Claude Sonnet for Implementer and Replicator (faster, sufficient for code). All calls go through the standard Anthropic SDK with prompt caching on the role system prompts — this matters for cost, since the role prompts are long and reused on every turn.
Execution sandbox: a pinned Docker image with pandas, numpy, statsmodels, scikit-learn, and a vendored copy of the data layer. No network. The sandbox is rebuilt nightly to keep dependencies fresh; the image hash is stored in every research-log entry so any result is exactly reproducible.
Research-log DB: SQLite with five tables — hypotheses, implementations, results, critiques, robustness. Every artifact has a UUID, a parent UUID, a timestamp, the image hash of the sandbox at the time, and the git commit of the data layer. This is the single most-valuable component and the one most people skip.
Data layer: a thin wrapper over the price store that enforces point-in-time correctness by construction. Any access by date t can only return data available at or before t. The wrapper raises if asked for anything later. This single guardrail prevents the most common look-ahead bug.
Human-gate UI: a tiny Streamlit app that surfaces (hypothesis, notebook, critique, robustness) as a single page with approve / reject / send-back-with-comment buttons. The friction here matters; if the gate is cumbersome you start waving things through.

A simplified version of the Proposer call, just to make it concrete:

# proposer.py
import anthropic, json
from research_log import recent_hypotheses, recent_critiques

client = anthropic.Anthropic()

SYSTEM = """You are the Proposer in a four-role alpha-research loop.
You produce ONE testable hypothesis in the schema below. You do not
write code. You do not run backtests. You do not propose hypotheses
that have been tested in the last 60 days (see prior list).

Schema (JSON):
{
  "economic_claim":      str,   # one sentence, mechanism stated
  "dependent_variable":  str,   # what we're trying to predict
  "predictor":           str,   # the signal, defined precisely
  "sample":              str,   # universe + date range, including OOS
  "null":                str    # what would falsify the claim
}

Rejection criteria you must apply to your own output before emitting:
- If the mechanism is "factor X has predicted Y" with no economic
  story, reject and try again.
- If the predictor's definition references information that would
  not have been available at decision time, reject and try again.
- If the sample omits a regime the claim should hold in, reject
  and try again.
"""

def propose(literature_excerpts: list[str]) -> dict:
    user_msg = {
        "recent_hypotheses": recent_hypotheses(days=60),
        "recent_critiques":  recent_critiques(days=60),
        "literature":        literature_excerpts,
    }
    resp = client.messages.create(
        model="claude-opus-4-7",
        system=[{"type": "text", "text": SYSTEM,
                 "cache_control": {"type": "ephemeral"}}],
        max_tokens=1024,
        messages=[{"role": "user",
                   "content": json.dumps(user_msg)}],
    )
    return json.loads(resp.content[0].text)

The Critic and Replicator are structurally similar — different system prompts, different tool access, same JSON-in / JSON-out discipline. The full set of prompts is on my GitHub; I will not paste all four here because the post would double in length and the prompts are not the load-bearing piece.

6. Validating the Critic

The Critic is a control on the rest of the pipeline. A reader is entitled to ask how I know it works, since using one LLM to validate another LLM’s output is exactly the circularity Amodei et al. [3] flag under scalable supervision.

The answer is a small but explicit validation suite. I seeded 25 notebooks with known defects across six categories: one-step look-ahead in a feature, sample-boundary drift, omitted transaction cost, regime cherry-picking, an unstable to-be-tuned parameter, and silent feature-name collision. Each defect was injected at a severity calibrated to a plausible human error, not an obvious one. The Critic was run blind on each notebook, alongside 25 syntactically-similar clean controls.

Defect class	Seeded	Caught	Missed	False positives (on clean controls)
Look-ahead	5	5	0	0
Sample-boundary drift	5	4	1	1
Cost omission	5	5	0	0
Regime cherry-picking	5	3	2	2
Unstable parameter	3	2	1	1
Feature-name collision	2	1	1	0
Total	25	20	5	4

An 80% catch rate on its own is not good enough — five missed severe defects across 25 notebooks would, if unaddressed, ship five strategies built on broken foundations. That is why the point-in-time data wrapper, the Implementer’s feature-schema requirement, the Replicator’s independent reimplementation, and the human gate exist alongside the Critic. Each catches a different defect class, and the failures are largely uncorrelated. The validation exercise is repeated whenever the Critic’s prompt is materially changed.

Two caveats. First, this exercise probably understates real-world false-positive rates, because syntactically-clean controls do not have the idiosyncrasies of real notebooks. Second, it does not test the most dangerous failure mode (confidently wrong synthesis); that is governed by the quote-the-cell-output constraint discussed in §8.

7. What it changed: 12 weeks on FX carry

Before the numbers, the operational definition of “promoted to candidate” — the endpoint that does the work in the table below. A candidate is a strategy that has cleared all of the following gates:

Positive net-of-cost out-of-sample IR over the full Proposer-defined sample.
No unresolved severe finding from the Critic (severity-1 issues must be fixed and re-run; severity-2 issues must be explicitly waived in writing with reasoning).
Stable sign of IR in at least six of the eight rows of the Replicator’s robustness panel.
No single regime contributes more than 40% of total backtest P&L.
Independent reimplementation by the Replicator (see §3) produces an IR within ±15% of the original.
A human-written one-paragraph economic rationale that the candidate’s mechanism is plausible, written before viewing the final composite-U score.

A candidate is not a deployed strategy. It is a strategy that has earned the right to a further month of paper trading and live-data review before being considered for any risk allocation. In the period under discussion, neither of the two candidates has yet been promoted to risk; that is a separate decision on a separate timescale.

I ran this stack against an FX-carry research workstream from late January through mid-April 2026, alongside a personal baseline of comparable hours from the equivalent period in 2025. The work was on conditional carry — under what regimes does the standard high-minus-low carry portfolio in G10 actually pay, and can we identify the regime ex ante.

Metric	Baseline (2025)	Agentic stack (2026)	Ratio
Hypotheses formally tested	11	38	3.5×
Time from hypothesis to first backtest	~2 days	~3 hours	~5×
Hypotheses that survived Critic	n/a	14 of 38 (37%)	—
Survived robustness panel	n/a	4 of 14 (29%)	—
Promoted to candidate (human gate)	1	2	2×
Researcher hours / week	~22	~18	0.8×
API spend / week (USD)	~0	~$340	—
Sandbox compute / week (USD)	~$15	~$25	1.7×

Measurement caveats. The comparison is not a randomised productivity experiment. It is a within-person case study with obvious confounds: different calendar periods, different available frontier models, possible learning effects on my part, a different specific workstream, and a subjective promotion threshold (whose criteria are at least now written down). I report it because the direction and magnitude were large enough to matter operationally, not because it proves a general law about agentic research productivity. The 2× candidate-yield figure should be read as an order of magnitude, not a point estimate; if the same exercise produces a 1.4× or 3× result on a different workstream, I would not be surprised. The cost figures above are included so a reader can judge total spend, not just throughput — a 2× lift at 10× spend is a different proposition from 2× at 1.2×.

What the stack visibly bought me, beyond raw throughput:

More diverse hypotheses. With a low cost per hypothesis I tested several that I would normally have ruled out at the back-of-the-envelope stage. One of the two promoted candidates came from this bucket.
Better robustness coverage. The Replicator runs the same eight-row sensitivity panel on every survivor. I almost never did this by hand for marginal-looking ideas; now it is free.
Better research log. I have a typed, searchable record of 38 hypotheses, their results, their critiques, and the exact code. The log itself has caught two cases where I started to re-propose something I had already rejected.

What it did not buy me:

Better economic intuition. The Proposer’s hypotheses are competent but unsurprising; they correspond closely to what a thoughtful junior would produce. The novel angle in one of the two promoted candidates came from a conversation I had at a conference, not from the stack.
Faster judgment at the human gate. The gate took roughly the same time per candidate as before — perhaps slightly longer, because I was reviewing better-documented work.

The first of these is, I think, fundamental to the current generation of models. The second is fine — judgment should be slow.

8. Failure modes I actually saw

Three of these came up repeatedly enough to deserve naming.

Plausible-feature contamination. The Implementer would invent a feature, name it something innocuous like carry_zscore_lookback, and quietly construct it using a rolling window that included the contemporaneous observation. The Critic caught most of these. The point-in-time data wrapper caught the rest. Without both layers, I would have shipped at least one of these.

Backtest period drift. The Implementer, given freedom over the sample, would sometimes anchor the start date a few months after a known drawdown. Never the full move — that would have been obvious — but enough to materially flatter the result. The fix was to require the Proposer to fix the sample as part of the hypothesis schema, and to have the Critic flag any deviation. After this change the failure stopped.

Confident wrong synthesis. The Critic, on long notebooks, would occasionally produce a confident-sounding summary that contradicted the actual numbers in the notebook. This is the single failure mode that scared me most, because it is the hardest to catch by glance. The mitigation is to require the Critic to quote specific cell outputs verbatim in its findings, with line references. After that change, hallucinated summaries dropped to roughly zero — the constraint of having to cite a concrete output is, empirically, enough to keep the model honest.

I do not claim these are the only failure modes. They are the ones that showed up at a rate I could measure.

9. What this means in practice

If you take only one thing from this post, take this: the value of agentic workflows in quant research is mostly in the structure, not the models. The exact LLM matters at the margin. The role separation, the typed handoffs, the research log, the point-in-time data wrapper, the search-intensity term in the objective, and the human gate at the right two points — those are what convert raw model capability into research that actually deserves to be looked at twice.

The fully autonomous research agent — Proposer to deployed strategy with no human in the loop — is, as far as I can tell, not yet a viable target. The judgment step is where the value-add of the senior researcher lives, and the current generation of models is not close to substituting for it. They are close enough to substitute for the work that surrounds it, and that is a meaningful change.

What I would do if I were standing up this stack from scratch, in order:

Build the point-in-time data wrapper first. Everything downstream depends on it.
Build the research-log DB second. Typed artifacts are the single biggest determinant of quality.
Write the Proposer / Implementer / Critic / Replicator prompts third. Iterate them against your own taste; expect to rewrite them three times.
Build the Critic validation suite fourth — before relying on the Critic as a control. If you cannot measure its catch rate, you do not know what it is doing.
Build the human-gate UI last, and make it pleasant to use. If the gate is cumbersome, you will start waving things through, and the whole system collapses.

The repository accompanying this post — prompts, sandbox image, log schema, gate UI, and the seeded-defect notebook set — is at the usual place. As always, the system is set up so you can run the entire loop against the free FRED and AlphaVantage data tiers; you do not need to subscribe to anything to reproduce the structural conclusions, only the FX-carry specifics.

References

[1] Wu, Q. et al. (2023). “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.” arXiv:2308.08155.

[2] Hong, S. et al. (2023). “MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework.” arXiv:2308.00352.

[3] Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mané, D. (2016). “Concrete Problems in AI Safety.” arXiv:1606.06565.

[4] White, H. (2000). “A Reality Check for Data Snooping.” Econometrica 68(5), 1097–1126.

[5] Bailey, D. H., Borwein, J., López de Prado, M., and Zhu, Q. J. (2016). “The Probability of Backtest Overfitting.” Journal of Computational Finance 20(4), 39–69.

[6] Harvey, C. R., Liu, Y., and Zhu, H. (2016). “…and the Cross-Section of Expected Returns.” Review of Financial Studies 29(1), 5–68.

July 25, 2022July 25, 2022

A New Approach to Generating Synthetic Market Data

The Importance of Synthetic Market Data

The principal argument in favor of using synthetic data is that it addresses one of the major concerns about using real data series for modelling purposes: i.e. that models designed to fit the historical data produce test results that are unlikely to be replicated, going forward. Such models are not robust to changes that are likely to occur in any dynamical statistical process and will consequently perform poorly out of sample.

By using multiple synthetic data series following a wide range of different price paths, one can hope to build models – both for risk management and investment purposes – that can accommodate a variety of different market scenarios, making them more likely to perform robustly in a live market context.

Producing authentic synthetic data is a significant challenge, one that has eluded researchers for many years. Generating artificial returns series is a considerably simpler task, but even here there are difficulties. For many applications it is simply not sufficient to sample from the empirical distribution, because we want to produce a sequence of returns that closely mirrors the pattern of real returns sequences. In particular, there may be long memory effects (non-zero autocorrelations at long lags) or GARCH effects, in which dependency is introduced into the returns process via the square (or absolute value) of returns. These have the effect of inducing “shocks” to the returns process that persist for some time, causing autocorrelation in the associated volatility process in the process.

But producing a set of synthetic stock price data is even more of a challenge because not only do the above do the above requirements apply, but we also need to ensure that the open, high, low and closing prices are internally consistent, i.e. that on any given bar the High >= {Open, Low and Close) and that the Low <= {Open, Close}. These basic consistency checks have been overlooked in the research thus far.

Econometric Methods

One classical approach to the problem would be to create a Vector Autoregression Model, in which lagged values of the Open, High, Low and Close prices are used to predict the current values (see here for a detailed exposition of the VAR approach). A compelling argument in favor of such models is that, almost by definition, O/H/L/C prices are necessarily cointegrated.

While a VAR model potentially has the ability to model long memory and even GARCH effects, it is unable to produce stock prices that are guaranteed to be consistent, in the sense defined above. Indeed, a failure rate of 35% or higher for basic consistency checks is typical for such a model, making the usefulness of the synthetic prices series highly questionable.

Another approach favored by some researchers is to stitch together sub-samples of the real data series in a varying time-order. This is applicable only to return series and, in any case, can introduce spurious autocorrelations, or overlook important dependencies in the data series. Besides these defects, it is challenging to produce a synthetic series that looks substantially different from the original – both the real and synthetic series exhibit common peaks and troughs, even if they occur in different places in each series.

Deep Learning Generative Adversarial Networks

In a previous post I looked in some detail at TimeGAN, one of the more recent methods for producing synthetic data series introduced in a paper in 2019 by Yoon, et al (link here).

Generating Synthetic Market Data

TimeGAN, which applies deep learning Generative Adversarial Networks to create synthetic data series, appears to work quite well for certain types of time series. But in my research I found it be inadequate for the purpose of producing synthetic stock data, for three reasons:

(i) The model produces synthetic data of fixed window lengths and stitching these together to form a single series can be problematic.

(ii) The prices fail a significant percentage of the basic consistency tests, regardless of the number of epochs used to train the model

(iii) The methodology introduces spurious correlations in the associated returns process that do not correspond to anything found in real stock return series and which get more pronounced as training continues.

Another GAN model, DoppleGANger, introduced by Lin, et. al. in 2020 (paper here) seeks to improve on TimeGAN and claims “up to 43% better fidelity than baseline models”, including TimeGAN. However, in my research I found that, while DoppleGANger trains much more quickly than TimeGAN, it produces a consistency test failure rate exceeding 30%, even after training for 500,000 epochs.

For both TimeGAN and DoppleGANger, the researchers have tended to benchmark performance using classical data science metrics such as TSNE plots rather than the more prosaic consistency checks that a market data specialist would be interested in, while the more advanced requirements such as long memory and GARCH effects are passed by without a mention.

The conclusion is that current methods fail to provide an adequate means of generating synthetic price series for financial assets that are consistent and sufficiently representative to be practically useful.

The Ideal Algorithm for Producing Synthetic Data Series

What are we looking for in the ideal algorithm for generating stock prices? The list would include:

(i) Computational simplicity & efficiency. Important if we are looking to mass-produce synthetic series for a large number of assets, for a variety of different applications. Some deep learning methods would struggle to meet this requirement, even supposing that transfer learning is possible.

(ii) The ability to produce price series that are internally consistent (i.e High > Low, etc) in every case .

(iii) Should be able to produce a range of synthetic series that vary widely in their correspondence to the original price series. In some case we want synthetic price series that are highly correlated to the original; in other cases we might want to test our investment portfolio or risk control systems under extreme conditions never before seen in the market.

(iv) The distribution of returns in the synthetic series should closely match the historical series, being non-Gaussian and with “fat-tails”.

(v) The ability to incorporate long memory effects in the sequence of returns.

(vi) The ability to model GARCH effects in the returns process.

After researching the problem over the course of many years, I have at last succeeded in developing an algorithm that meets these requirements. Before delving into the mechanics, let me begin by illustrating its application.

Application of the Ideal Algorithm

In this demonstration I am using daily O/H/L/C prices for the S&P 500 index for the period from Jan 1999 to July 2022, comprising four price series over 5,297 daily periods.

Synthetic Price Series

Generating ten synthetic series using the algorithm takes around 2 seconds with parallelization. I chose to generate series of the same length as the original, although I could just as easily have produced shorter, or longer sequences.

The first task is to confirm that the synthetic data are internally consistent, and indeed is guaranteed to be so because of the way the algorithm is designed. For example, here are the first few daily bars from the first synthetic series:

This means, of course, that we can immediately plot the synthetic series in a candlestick chart, just as we did with the real data series, above.

While the real and synthetic series are clearly different, the pattern of peaks and troughs somehow looks recognizably familiar. So, too, is the upward drift in the series, which is this case carries the synthetic S&P 500 Index to a high above 10,000 in 2022. Obviously this is a much more bullish scenario that we have seen in reality. But in fact this is just one example taken from the more “optimistic” end of the spectrum of possibilities. An illustration from the opposite end of the spectrum is shown in the chart below, in which the Index moves sideways over the entire 23 year span, with several very large drawdowns of -20% or more:

A more typical scenario might look something like our third chart, below. Here, too, we see several very large drawdowns, especially in the period from 2010-2011, but there is also a general upward drift in the process that enables the Index to reach levels comparable to those achieved by the real series:

Price Correlations

Reflecting these very different price path evolutions, we observe large variation in the correlations between the real and synthetic price series. For example:

As these tables indicate, the algorithm is capable of producing replica series that either mimic the original, real price series very closely, or which show completely different behavior, as in the second example.

Dimensionality Reduction

For completeness, as have previous researchers, we apply t-SNE dimensionality reduction and plot the two-factor weightings for both real (yellow) and synthetic data (blue). We observe that while there is considerable overlap in reduced dimensional space, it is not as pronounced as for the synthetic data produced by TimeGAN, for instance. However, as previously explained, we are less concerned by this than we are about the tests previously described, which in our view provide a more appropriate analysis benchmark, so far as market data is concerned. Furthermore, for the reasons previously given, we want synthetic market data that in some cases tracks well beyond the range seen in historical price series.

Returns Distributions

Moving on, we next consider the characteristics of the returns in the synthetic series in comparison to the real data series, where returns are measured as the differences in the Log-Close prices, in the usual way.

Histograms of the returns for the most “optimistic” and “pessimistic” scenarios charted previously are shown below:

In both cases the distribution of returns in the synthetic series closely matches that of the real returns process and are clearly non-Gaussian, with an over-weighting in the distribution tails. A more detailed look at the distribution characteristics for the first four synthetic series indicates that there is a very good match to the real returns process in each case (the results for other series are very similar):

We observe that the minimum and maximum returns of the synthetic series sometimes exceed those of the real series, which can be a useful characteristic for risk management applications. The median and mean of the real and synthetic series are broadly similar, sometimes higher, in other cases lower. Only for the standard deviation of returns do we observe a systematic pattern, in which returns volatility in the synthetic series is consistently higher than in the real series.

This feature, I would argue, is both appropriate and useful. Standard deviations should generally be higher, because there is indeed greater uncertainty about the prices and returns in artificially generated synthetic data, compared to the real series. Moreover, this characteristic is useful, because it will impose a greater stress-test burden on risk management systems compared to simply drawing from the distribution of real returns using Monte Carlo simulation. Put simply, there will be a greater number of more extreme tail events in scenarios using synthetic data, and this will cause risk control parameters to be set more conservatively than they otherwise might. This same characteristic – the greater variation in prices and returns – will also pose a tougher challenge for AI systems that attempt to create trading strategies using genetic programming, meaning that any such strategies are more likely to perform robustly in a live trading environment. I will be returning to this issue in a follow-up post.

Returns Process Characteristics

In the following plot we take a look at the autocorrelations in the returns process for a typical synthetic series. These compare closely with the autocorrelations in the real returns series up to 50 lags, which means that any long memory effects are likely to be conserved.

Finally, when we come to consider the autocorrelations in the square of the returns, we observe slowly decaying coefficients over long lags – evidence of so-called GARCH effects – for both real and synthetic series:

Summary

Overall, we observe that the algorithm is capable of generating consistent stock price series that correlate highly with the real price series. It is also capable of generating price series that have low, or even negative, correlation, a feature that may have important applications in the context of risk management. The distribution of returns in the synthetic series closely match those of the real returns process, and moreover retain important features such as long memory and GARCH effects.

Objections to the Use of Synthetic Data

Criticism of synthetic market data (including from myself) has hitherto focused on the inadequacy of such data in terms of representing important characteristics of real data series. Now that such technical issues have been addressed, I will try to anticipate some of the additional concerns that are likely to surface, going forward.

The Synthetic Data is “Unrealistic”

What is meant here is that there is no plausible set of real, economic factors that would be likely to combine in a way to produce the pattern of prices shown in some of the synthetic data series. The idea that, as observed in one of the artificial scenarios above, the Fed would stand idly by while the market plunged by 50% to 60%, seems highly implausible. Equally unlikely is a scenario in which the market moves sideways for an extended period of a decade, or longer.

To a limited extent, I would agree with this. However, just because such scenarios are currently unlikely doesn’t mean they can never happen. For instance, take a look at the performance of the S&P 500 Index over the period from 1966 through 1979:

The market index barely made any progress throughout the entire 13-year period, which was characterized by a vicious bout of stagflation. Note, too, the precipitous drop in the index following the oil shock in 1973.

So to say that such scenarios – however implausible they may appear to be – can never happen is simply mistaken.

Finally, let’s not forget that, while the focus of this article is on the US market index, there are many economies, such as Mexico, Brazil or Argentina, for which such adverse developments are much more credible than they might currently be for the United States. We may wish to produce synthetic data for the markets in such economies for modelling purposes, in which case we will want to generate synthetic data capturing the full range of possible market outcomes, including some of the worst-case scenarios.

2. Extreme Scenarios Occur Too Frequently in Synthetic Data

Actually this is not the case – the generator tends to produce extreme scenarios with a frequency that is plausible, given the history and characteristics of the underlying, real price process. But there can be good reasons for wanting to control the frequency of such scenarios.

For instance, an investment manager may be looking to develop a “long-only” investment portfolio because, given his investment remit, that is the only type of investment strategy permitted. He would likely want to limit his focus to the more benign market outcomes for two reasons: (i) his investment thesis is that the market is likely to perform well, going forward (or else how does he pitch his strategy to investors?) and (ii) while he accepts that he may be wrong, it is not his job to hedge a possible market downturn – the responsibility for dealing with an adverse outcome falls to his risk manager, or to the investor.

Conversely, a risk manager is much more likely to be interested in adverse scenarios and, if anything, is likely to want to see such outcomes over-represented in a sample of synthetic data.

The point is, there is no “correct” answer: one has to decide which types of scenarios best suit the application one has in mind and sample the data accordingly. This can be done in a variety of ways such as setting a minimum required correlation between the synthetic and real price series, or designing a system of stratified sampling in which the desired outcomes are sampled according to a stipulated frequency distribution.

3. Synthetic Data Does Not Prevent Data Snooping and Curve Fitting

A critic might argue that, in fact, the real market data is “unseen” only in a theoretical sense, since its essential attributes have been baked into the synthetic series produced by the generator. This applies to an even greater extent if the synthetic series are sampled in some way, as described above.

I think this is a fair point. To take an extreme scenario, one could choose to select only synthetic series for which the correlation with the real data is 99.9%, or higher. Clearly this runs counter to the spirit of what one is trying to achieve with synthetic data and one might just as well use real data for modelling purposes. In practice, of course, even where a sampling methodology is applied, it is unlikely to be as crudely biased as in this example.

But, in any case, what is the alternative? The only option I can see is one in which a pure mathematical model is used to produce synthetic data, without any reference to the underlying real series. But, in that case, how would one assess the validity of the model assumptions, or how representative the synthetic series it produces might be?

There is no alternative but to have recourse to the real data at some point in the modelling process. In this procedure, however, the impact of snooping bias or curve fitting, even though it can never be totally extinguished, is very much diminished and it plays a less central role in model development.

Conclusion

It is now possible to produce synthetic data series that have all of the hallmark characteristics of real price data. This permits the analyst to investigate market models without direct recourse to the real price series, thereby minimizing data snooping and curve fitting bias. Models developed using synthetic data describing many different price path evolutions are more likely to prove robust across a wider range of plausible market scenarios in the real world.

In the next, follow-up post I will illustrate the application of synthetic data to the development of a robust investment strategy.

June 25, 2022June 26, 2022

Backtest vs. Trading Reality

Kris Sidial, whose Twitter posts are often interesting, recently posted about the reality of trading profitability vs backtest performance, as follows:

While I certainly agree that the latter example is more representative of a typical trader’s P&L, I don’t concur that the first P&L curve is necessarily “99.9% garbage”. There are many strategies that have equity curves that are smoother and more monotonic than those of Kris’s Skeleton Case V2 strategy. Admittedly, most of these lie in the area of high frequency, which is not Kris’s domain expertise. But there are also lower frequency strategies that produce results which are not dissimilar to those shown the first chart.

As a case in point, consider the following strategy for the S&P 500 E-Mini futures contract, described in more detail below. The strategy was developed using 15-minute bar data from 1999 to 2012, and traded live thereafter. The live and backtest performance characteristics are almost indistinguishable, not only in terms of rate of profit, but also in regard to strategy characteristics such as the no. of trades, % win rate and profit factor.

Just in case you think the picture is a little too rosy, I would point out that the average profit factor is 1.25, which means that the strategy is generating only 25% more in profits than losses. There will be big losing trades from time to time and long sequences of losses during which the strategy appears to have broken down. It takes discipline to resist the temptation to “fix” the strategy during extended drawdowns and instead rely on reversion to the mean rate of performance over the long haul. One source of comfort to the trader through such periods is that the 60% win rate means that the majority of trades are profitable.

As you read through the replies to Kris’s post, you will see that several of his readers make the point that strategies with highly attractive equity curves and performance characteristics are typically capital constrained. This is true in the case of this strategy, which I trade with a very modest amount of (my own) capital. Even trading one-lots in the E-Mini futures I occasionally experience missed trades, either on entry or exit, due to limit orders not being filled at the high or low of a bar. In scaling the strategy up to something more meaningful such as a 10-lot, there would be multiple partial fills to deal with. But I think it would be a mistake to abandon a high performing strategy such as this just because of an apparent capacity constraint. There are several approaches one can explore to address the issue, which may be enough to make the strategy scalable.

Where (as here) the issue of scalability relates to the strategy fill rate on limit orders, a good starting point is to compute the extreme hit rate, which is the proportion of trades that take place at the high or low of the bar. As a rule of thumb, for strategies running on typical low frequency infrastructure an extreme hit rate of 10% or less is manageable; anything above that level quickly becomes problematic. If the extreme hit rate is very high, e.g. 25% or more, then you are going to have to pay a great deal of attention to the issues of latency and order priority to make the strategy viable in practise. Ultimately, for a high frequency market making strategy, most orders are filled at the extreme of each “bar”, so almost all of the focus in on minimizing latency and maintaining a high queue priority, with all of the attendant concerns regarding trading hardware, software and infrastructure.

Next, you need a strategy for handling missed trades. You could, for example, decide to skip any entry trades that are missed, while manually entering unfilled exit trades at the market. Or you could post market orders for both entry and exit trades if they are not filled. An extreme solution would be to substitute market-if-touched orders for limit orders in your strategy code. But this would affect all orders generated by the system, not just the 10% at the high or low of the bar and is likely to have a very adverse affect on overall profitability, especially if the average trade is low (because you are paying an extra tick on entry and exit of every trade).

The above suggests that you are monitoring the strategy manually, running simulation and live versions side by side, so that you can pick up any trades that the strategy should have taken, but which have been missed. This may be practical for a strategy that trades during regular market hours, but not for one that also trades the overnight session.

An alternative approach, one that is commonly applied by systematic traders, is to automate the handling of missed trades. Typically the trader will set a parameter that converts a limit order to a market order X seconds after a limit price has been traded but not filled. Of course, this will result in paying up an extra tick (or more) to enter trades that perhaps would have been filled if one had waited longer than X seconds. It will have some negative impact on strategy profitability, but not too much if the extreme hit rate is low. I tend to use this method for exit trades, preferring to skip any entry trades that don’t get filled at the limit price.

Beyond these simple measures, there are several other ways to extend the capacity of the strategy. An obvious place to start is by evaluating strategy performance on different session times and bar lengths. So, in this case, we might look at deploying the strategy on both the day and night sessions. We can also evaluate performance on bars of different length. This will give different entry and exit points for individual trades and trades that are at the extreme of a bar on one timeframe may not be at the high or low of a bar on the other timescale. For example, here is the (simulated) performance of the strategy on 13 minute bars:

There is a reason for choosing a bar interval such as 13 minutes, rather than the more commonplace 5- or 10 minutes, as explained in this post:

Trading Prime Market Cycles

Finally, it is worth exploring whether the strategy can be applied to other related markets such as NQ futures, for example. Typically this will entail some change to the strategy code to reflect the difference in price levels, but the thrust of the strategy logic will be similar. Another approach is to use the signals from the current strategy as inputs – i.e. alpha generators – for a derivative strategy, such as trading the SPY ETF based on signals from the ES strategy. The performance of the derived strategy may not be as good, but in a product like SPY the capacity might be larger.