Developing Trading Strategies With Synthetic Data

One of the main criticisms levelled at systematic trading over the last few years is that the over-use of historical market data has tended to produce curve-fitted strategies that perform poorly out of sample in a live trading environment. This is indeed a valid criticism – given enough attempts one is bound to arrive eventually at a strategy that performs well in backtest, even on a holdout data sample. But that by no means guarantees that the strategy will continue to perform well going forward.

The solution to the problem has been clear for some time: what is required is a method of producing synthetic market data that can be used to build a strategy and test it under a wide variety of simulated market conditions. A strategy built in this way is more likely to survive the challenge of live trading than one that has been developed using only a single historical data path.

The problem, however, has been in implementation. Up until now all the attempts to produce credible synthetic price data have failed, for one reason or another, as I described in an earlier post:

I have been able to devise a completely new algorithm for generating artificial price series that meet all of the key requirements, as follows:

  • Computational simplicity & efficiency. Important if we are looking to mass-produce synthetic series for a large number of assets, for a variety of different applications. Some deep learning methods would struggle to meet this requirement, even supposing that transfer learning is possible.
  • The ability to produce price series that are internally consistent (i.e High > Low, etc) in every case .
  • Should be able to produce a range of synthetic series that vary widely in their correspondence to the original price series. In some case we want synthetic price series that are highly correlated to the original; in other cases we might want to test our investment portfolio or risk control systems under extreme conditions never before seen in the market.
  • The distribution of returns in the synthetic series should closely match the historical series, being non-Gaussian and with “fat-tails”.
  • The ability to incorporate long memory effects in the sequence of returns.
  • The ability to model GARCH effects in the returns process.

This means that we are now in a position to develop trading strategies without any direct reference to the underlying market data. Consequently we can then use all of the real market data for out-of-sample back-testing.

Developing a Trading Strategy for the S&P 500 Index Using Synthetic Market Data

To illustrate the procedure I am going to use daily synthetic price data for the S&P 500 Index over the period from Jan 1999 to July 2022. Details of the the characteristics of the synthetic series are given in the post referred to above.

This image has an empty alt attribute; its file name is Fig3-12.png

Because we want to create a trading strategy that will perform under market conditions close to those currently prevailing, I will downsample the synthetic series to include only those that correlate quite closely, i.e. with a minimum correlation of 0.75, with the real price data.

Why do this? Surely if we want to make a strategy as robust as possible we should use all of the synthetic data series for model development?

The reason is that I believe that some of the more extreme adverse scenarios generated by the algorithm may occur quite rarely, perhaps once in every few decades. However, I am principally interested in a strategy that I can apply under current market conditions and I am prepared to take my chances that the worst-case scenarios are unlikely to come about any time soon. This is a major design decision, one that you may disagree with. Of course, one could make use of every available synthetic data series in the development of the trading model and by doing so it is likely that you would produce a model that is more robust. But the training could take longer and the performance during normal market conditions may not be as good.

Having generated the price series, the process I am going to follow is to use genetic programming to develop trading strategies that will be evaluated on all of the synthetic data series simultaneously. I will then use the performance of the aggregate portfolio, i.e. the outcome of all of the trades generated by the strategy when applied to all of the synthetic series, to assess the overall performance. In order to be considered, candidate strategies have to perform well under all of the different market scenarios, or at least the great majority of them. This ensures that the strategy is likely to prove more robust across different types of market conditions, rather than on just the single type of market scenario observed in the real historical series.

As usual in these cases I will reserve a portion (10%) of each data series for testing each strategy, and a further 10% sample for out-of-sample validation. This isn’t strictly necessary: since the real data series has not be used directly in the development of the trading system, we can later test the strategy on all of the historical data and regard this as an out-of-sample backtest.

To implement the procedure I am going to use Mike Bryant’s excellent Adaptrade Builder software.

This is an exemplar of outstanding software engineering and provides a broad range of features for generating trading strategies of every kind. One feature of Builder that is particularly useful in this context is its ability to construct strategies and test them on up to 20 data series concurrently. This enables us to develop a strategy using all of the synthetic data series simultaneously, showing the performance of each individual strategy as well for as the aggregate portfolio.

After evolving strategies for 50 generations we arrive at the following outcome:

The equity curve for the aggregate portfolio is shown in blue, while the equity curves for the strategy applied to individual synthetic data series are shown towards the bottom of the chart. Of course, the performance of the aggregate portfolio appears much superior to any of the individual strategies, because it is effectively the arithmetic sum of the individual equity curves. And just because the aggregate portfolio appears to perform well both in-sample and out-of-sample, that doesn’t imply that the strategy works equally well for every individual market scenario. In some scenarios it performs better than in others, as can be observed from the individual equity curves.

But, in any case, our objective here is not to create a stock portfolio strategy, but rather to trade a single asset – the S&P 500 Index. The role of the aggregate portfolio is simply to suggest that we may have found a strategy that is sufficiently robust to work well across a variety of market conditions, as represented by the various synthetic price series.

Builder generates code for the strategies it evolves in a number of different languages and in this case we take the EasyLanguage code for the fittest strategy #77 and apply it to a daily chart for the S&P 500 Index – i.e. the real data series – in Tradestation, with the following results:

The strategy appears to work well “out-of-the-box”, i,e, without any further refinement. So our quest for a robust strategy appears to have been quite successful, given that none of the 23-year span of real market data on which the strategy was tested was used in the development process.

We can take the process a little further, however, by “optimizing” the strategy. Traditionally this would mean finding the optimal set of parameters that produces the highest net profit on the test data. But this would be curve fitting in the worst possible sense, and is not at all what I am suggesting.

Instead we use a procedure known as Walk Forward Optimization (WFO), as described in this post:

The goal of WFO is not to curve-fit the best parameters, which would entirely defeat the object of using synthetic data. Instead, its purpose is to test the robustness of the strategy. We accomplish this by using a sequence of overlapping in-sample and out-of-sample periods to evaluate how well the strategy stands up, assuming the parameters are optimized on in-sample periods of varying size and start date and tested of similarly varying out-of-sample periods. A strategy that fails a cluster of such tests is unlikely to prove robust in live trading. A strategy that passes a test cluster at least demonstrates some capability to perform well in different market regimes.

To some extent we might regard such a test as unnecessary, given that the strategy has already been observed to perform well under several different market conditions, encapsulated in the different synthetic price series, in addition to the real historical price series. Nonetheless, we conduct a WFO cluster test to further evaluate the robustness of the strategy.

As the goal of the procedure is not to maximize the theoretical profitability of the strategy, but rather to evaluate its robustness, we select a criterion other than net profit as the factor to optimize. Specifically, we select the sum of the areas of the strategy drawdowns as the quantity to minimize (by maximizing the inverse of the sum of drawdown areas, which amounts to the same thing). This requires a little explanation.

If we look at the strategy drawdown periods of the equity curve, we observe several periods (highlighted in red) in which the strategy was underwater:

The area of each drawdown represents the length and magnitude of the drawdown and our goal here is to minimize the sum of these areas, so that we reduce both the total duration and severity of strategy drawdowns.

In each WFO test we use different % of OOS data and a different number of runs, assessing the performance of the strategy on a battery of different criteria:

x

These criteria not only include overall profitability, but also factors such as parameter stability, profit consistency in each test, the ratio of in-sample to out-of-sample profits, etc. In other words, this WFO cluster analysis is not about profit maximization, but robustness evaluation, as assessed by these several different metrics. And in this case the strategy passes every test with flying colors:

Other than validating the robustness of the strategy’s performance, the overall effect of the procedure is to slightly improve the equity curve by diminishing the magnitude and duration of the drawdown periods:

Conclusion

We have shown how, by using synthetic price series, we can build a robust trading strategy that performs well under a variety of different market conditions, including on previously “unseen” historical market data. Further analysis using cluster WFO tests strengthens the assessment of the strategy’s robustness.

A New Approach to Generating Synthetic Market Data

The Importance of Synthetic Market Data

The principal argument in favor of using synthetic data is that it addresses one of the major concerns about using real data series for modelling purposes: i.e. that models designed to fit the historical data produce test results that are unlikely to be replicated, going forward. Such models are not robust to changes that are likely to occur in any dynamical statistical process and will consequently perform poorly out of sample.

By using multiple synthetic data series following a wide range of different price paths, one can hope to build models – both for risk management and investment purposes – that can accommodate a variety of different market scenarios, making them more likely to perform robustly in a live market context.

Producing authentic synthetic data is a significant challenge, one that has eluded researchers for many years. Generating artificial returns series is a considerably simpler task, but even here there are difficulties. For many applications it is simply not sufficient to sample from the empirical distribution, because we want to produce a sequence of returns that closely mirrors the pattern of real returns sequences. In particular, there may be long memory effects (non-zero autocorrelations at long lags) or GARCH effects, in which dependency is introduced into the returns process via the square (or absolute value) of returns. These have the effect of inducing “shocks” to the returns process that persist for some time, causing autocorrelation in the associated volatility process in the process.

But producing a set of synthetic stock price data is even more of a challenge because not only do the above do the above requirements apply, but we also need to ensure that the open, high, low and closing prices are internally consistent, i.e. that on any given bar the High >= {Open, Low and Close) and that the Low <= {Open, Close}. These basic consistency checks have been overlooked in the research thus far.

Econometric Methods

One classical approach to the problem would be to create a Vector Autoregression Model, in which lagged values of the Open, High, Low and Close prices are used to predict the current values (see here for a detailed exposition of the VAR approach). A compelling argument in favor of such models is that, almost by definition, O/H/L/C prices are necessarily cointegrated.

While a VAR model potentially has the ability to model long memory and even GARCH effects, it is unable to produce stock prices that are guaranteed to be consistent, in the sense defined above. Indeed, a failure rate of 35% or higher for basic consistency checks is typical for such a model, making the usefulness of the synthetic prices series highly questionable.

Another approach favored by some researchers is to stitch together sub-samples of the real data series in a varying time-order. This is applicable only to return series and, in any case, can introduce spurious autocorrelations, or overlook important dependencies in the data series. Besides these defects, it is challenging to produce a synthetic series that looks substantially different from the original – both the real and synthetic series exhibit common peaks and troughs, even if they occur in different places in each series.

Deep Learning  Generative Adversarial Networks

In a previous post I looked in some detail at TimeGAN, one of the more recent methods for producing synthetic data series introduced in a paper in 2019 by Yoon, et al (link here).

TimeGAN, which applies deep learning  Generative Adversarial Networks to create synthetic data series, appears to work quite well for certain types of time series. But in my research I found it be inadequate for the purpose of producing synthetic stock data, for three reasons:

(i) The model produces synthetic data of fixed window lengths and stitching these together to form a single series can be problematic.

(ii) The prices fail a significant percentage of the basic consistency tests, regardless of the number of epochs used to train the model

(iii) The methodology introduces spurious correlations in the associated returns process that do not correspond to anything found in real stock return series and which get more pronounced as training continues.

Another GAN model, DoppleGANger, introduced by Lin, et. al. in 2020 (paper here) seeks to improve on TimeGAN and claims “up to 43% better fidelity than baseline models”, including TimeGAN. However, in my research I found that, while DoppleGANger trains much more quickly than TimeGAN, it produces a consistency test failure rate exceeding 30%, even after training for 500,000 epochs.

For both TimeGAN and DoppleGANger, the researchers have tended to benchmark performance using classical data science metrics such as TSNE plots rather than the more prosaic consistency checks that a market data specialist would be interested in, while the more advanced requirements such as long memory and GARCH effects are passed by without a mention.

The conclusion is that current methods fail to provide an adequate means of generating synthetic price series for financial assets that are consistent and sufficiently representative to be practically useful.

The Ideal Algorithm for Producing Synthetic Data Series

What are we looking for in the ideal algorithm for generating stock prices? The list would include:

(i) Computational simplicity & efficiency. Important if we are looking to mass-produce synthetic series for a large number of assets, for a variety of different applications. Some deep learning methods would struggle to meet this requirement, even supposing that transfer learning is possible.

(ii) The ability to produce price series that are internally consistent (i.e High > Low, etc) in every case .

(iii) Should be able to produce a range of synthetic series that vary widely in their correspondence to the original price series. In some case we want synthetic price series that are highly correlated to the original; in other cases we might want to test our investment portfolio or risk control systems under extreme conditions never before seen in the market.

(iv) The distribution of returns in the synthetic series should closely match the historical series, being non-Gaussian and with “fat-tails”.

(v) The ability to incorporate long memory effects in the sequence of returns.

(vi) The ability to model GARCH effects in the returns process.

After researching the problem over the course of many years, I have at last succeeded in developing an algorithm that meets these requirements. Before delving into the mechanics, let me begin by illustrating its application.

Application of the Ideal Algorithm

In this demonstration I am using daily O/H/L/C prices for the S&P 500 index for the period from Jan 1999 to July 2022, comprising four price series over 5,297 daily periods.

Synthetic Price Series

Generating ten synthetic series using the algorithm takes around 2 seconds with parallelization. I chose to generate series of the same length as the original, although I could just as easily have produced shorter, or longer sequences.

The first task is to confirm that the synthetic data are internally consistent, and indeed is guaranteed to be so because of the way the algorithm is designed. For example, here are the first few daily bars from the first synthetic series:

This means, of course, that we can immediately plot the synthetic series in a candlestick chart, just as we did with the real data series, above.

While the real and synthetic series are clearly different, the pattern of peaks and troughs somehow looks recognizably familiar. So, too, is the upward drift in the series, which is this case carries the synthetic S&P 500 Index to a high above 10,000 in 2022. Obviously this is a much more bullish scenario that we have seen in reality. But in fact this is just one example taken from the more “optimistic” end of the spectrum of possibilities. An illustration from the opposite end of the spectrum is shown in the chart below, in which the Index moves sideways over the entire 23 year span, with several very large drawdowns of -20% or more:

A more typical scenario might look something like our third chart, below. Here, too, we see several very large drawdowns, especially in the period from 2010-2011, but there is also a general upward drift in the process that enables the Index to reach levels comparable to those achieved by the real series:

Price Correlations

Reflecting these very different price path evolutions, we observe large variation in the correlations between the real and synthetic price series. For example:

As these tables indicate, the algorithm is capable of producing replica series that either mimic the original, real price series very closely, or which show completely different behavior, as in the second example.

Dimensionality Reduction

For completeness, as have previous researchers, we apply t-SNE dimensionality reduction and plot the two-factor weightings for both real (yellow) and synthetic data (blue). We observe that while there is considerable overlap in reduced dimensional space, it is not as pronounced as for the synthetic data produced by TimeGAN, for instance. However, as previously explained, we are less concerned by this than we are about the tests previously described, which in our view provide a more appropriate analysis benchmark, so far as market data is concerned. Furthermore, for the reasons previously given, we want synthetic market data that in some cases tracks well beyond the range seen in historical price series.

Returns Distributions

Moving on, we next consider the characteristics of the returns in the synthetic series in comparison to the real data series, where returns are measured as the differences in the Log-Close prices, in the usual way.

Histograms of the returns for the most “optimistic” and “pessimistic” scenarios charted previously are shown below:

In both cases the distribution of returns in the synthetic series closely matches that of the real returns process and are clearly non-Gaussian, with an over-weighting in the distribution tails. A more detailed look at the distribution characteristics for the first four synthetic series indicates that there is a very good match to the real returns process in each case (the results for other series are very similar):

We observe that the minimum and maximum returns of the synthetic series sometimes exceed those of the real series, which can be a useful characteristic for risk management applications. The median and mean of the real and synthetic series are broadly similar, sometimes higher, in other cases lower. Only for the standard deviation of returns do we observe a systematic pattern, in which returns volatility in the synthetic series is consistently higher than in the real series.

This feature, I would argue, is both appropriate and useful. Standard deviations should generally be higher, because there is indeed greater uncertainty about the prices and returns in artificially generated synthetic data, compared to the real series. Moreover, this characteristic is useful, because it will impose a greater stress-test burden on risk management systems compared to simply drawing from the distribution of real returns using Monte Carlo simulation. Put simply, there will be a greater number of more extreme tail events in scenarios using synthetic data, and this will cause risk control parameters to be set more conservatively than they otherwise might. This same characteristic – the greater variation in prices and returns – will also pose a tougher challenge for AI systems that attempt to create trading strategies using genetic programming, meaning that any such strategies are more likely to perform robustly in a live trading environment. I will be returning to this issue in a follow-up post.

Returns Process Characteristics

In the following plot we take a look at the autocorrelations in the returns process for a typical synthetic series. These compare closely with the autocorrelations in the real returns series up to 50 lags, which means that any long memory effects are likely to be conserved.

Finally, when we come to consider the autocorrelations in the square of the returns, we observe slowly decaying coefficients over long lags – evidence of so-called GARCH effects – for both real and synthetic series:

Summary

Overall, we observe that the algorithm is capable of generating consistent stock price series that correlate highly with the real price series. It is also capable of generating price series that have low, or even negative, correlation, a feature that may have important applications in the context of risk management. The distribution of returns in the synthetic series closely match those of the real returns process, and moreover retain important features such as long memory and GARCH effects.

Objections to the Use of Synthetic Data

Criticism of synthetic market data (including from myself) has hitherto focused on the inadequacy of such data in terms of representing important characteristics of real data series. Now that such technical issues have been addressed, I will try to anticipate some of the additional concerns that are likely to surface, going forward.

  1. The Synthetic Data is “Unrealistic”

What is meant here is that there is no plausible set of real, economic factors that would be likely to combine in a way to produce the pattern of prices shown in some of the synthetic data series. The idea that, as observed in one of the artificial scenarios above, the Fed would stand idly by while the market plunged by 50% to 60%, seems highly implausible. Equally unlikely is a scenario in which the market moves sideways for an extended period of a decade, or longer.

To a limited extent, I would agree with this. However, just because such scenarios are currently unlikely doesn’t mean they can never happen. For instance, take a look at the performance of the S&P 500 Index over the period from 1966 through 1979:

The market index barely made any progress throughout the entire 13-year period, which was characterized by a vicious bout of stagflation. Note, too, the precipitous drop in the index following the oil shock in 1973.

So to say that such scenarios – however implausible they may appear to be – can never happen is simply mistaken.

Finally, let’s not forget that, while the focus of this article is on the US market index, there are many economies, such as Mexico, Brazil or Argentina, for which such adverse developments are much more credible than they might currently be for the United States. We may wish to produce synthetic data for the markets in such economies for modelling purposes, in which case we will want to generate synthetic data capturing the full range of possible market outcomes, including some of the worst-case scenarios.

2. Extreme Scenarios Occur Too Frequently in Synthetic Data

Actually this is not the case – the generator tends to produce extreme scenarios with a frequency that is plausible, given the history and characteristics of the underlying, real price process. But there can be good reasons for wanting to control the frequency of such scenarios.

For instance, an investment manager may be looking to develop a “long-only” investment portfolio because, given his investment remit, that is the only type of investment strategy permitted. He would likely want to limit his focus to the more benign market outcomes for two reasons: (i) his investment thesis is that the market is likely to perform well, going forward (or else how does he pitch his strategy to investors?) and (ii) while he accepts that he may be wrong, it is not his job to hedge a possible market downturn – the responsibility for dealing with an adverse outcome falls to his risk manager, or to the investor.

Conversely, a risk manager is much more likely to be interested in adverse scenarios and, if anything, is likely to want to see such outcomes over-represented in a sample of synthetic data.

The point is, there is no “correct” answer: one has to decide which types of scenarios best suit the application one has in mind and sample the data accordingly. This can be done in a variety of ways such as setting a minimum required correlation between the synthetic and real price series, or designing a system of stratified sampling in which the desired outcomes are sampled according to a stipulated frequency distribution.

3. Synthetic Data Does Not Prevent Data Snooping and Curve Fitting

A critic might argue that, in fact, the real market data is “unseen” only in a theoretical sense, since its essential attributes have been baked into the synthetic series produced by the generator. This applies to an even greater extent if the synthetic series are sampled in some way, as described above.

I think this is a fair point. To take an extreme scenario, one could choose to select only synthetic series for which the correlation with the real data is 99.9%, or higher. Clearly this runs counter to the spirit of what one is trying to achieve with synthetic data and one might just as well use real data for modelling purposes. In practice, of course, even where a sampling methodology is applied, it is unlikely to be as crudely biased as in this example.

But, in any case, what is the alternative? The only option I can see is one in which a pure mathematical model is used to produce synthetic data, without any reference to the underlying real series. But, in that case, how would one assess the validity of the model assumptions, or how representative the synthetic series it produces might be?

There is no alternative but to have recourse to the real data at some point in the modelling process. In this procedure, however, the impact of snooping bias or curve fitting, even though it can never be totally extinguished, is very much diminished and it plays a less central role in model development.

Conclusion

It is now possible to produce synthetic data series that have all of the hallmark characteristics of real price data. This permits the analyst to investigate market models without direct recourse to the real price series, thereby minimizing data snooping and curve fitting bias. Models developed using synthetic data describing many different price path evolutions are more likely to prove robust across a wider range of plausible market scenarios in the real world.

In the next, follow-up post I will illustrate the application of synthetic data to the development of a robust investment strategy.

Generating Synthetic Market Data

Why Synthetic Data?

Synthetic market data has great potential for applications in financial research. Examples include testing the risk characteristics of a trading book or investment portfolio, developing trading strategies using previously unseen data, or simulating high frequency trading activity in a limit order book. It provides an answer to the criticism of curve fitting that is routinely levelled at existing approaches that use the single, observed historical path followed by an asset to construct investment and risk models. Such models, critics argue, are usually over-fitted to the historical data and are consequently unlikely to prove robust, going forward.

What is required is a model of the underlying asset processes that can then be used to generate a large number of price paths for all of the constituents of an investment portfolio. This should provide a more realistic assessment of the range of possible behaviours of the portfolio under a wide variety of market conditions, including during tail events.

Existing Methodology

Current approaches to modelling asset processes are often rudimentary and fail to capture the interplay of market dynamics that impact the evolution of the process. So, for example, we might begin by modelling the process of asset returns using a Gaussian or Student-T distribution. This immediately runs into the issue of under-representing the “fat tails” of empirical asset distributions, where tail events occur much more frequently than standard distributions would suggest. We might move on consider using the empirical distribution itself, and this might be sufficient for some applications.

But in many cases we want to generate a sequence of returns, or perhaps a time series of Open/High/Low/Close prices for modelling purposes. This is a challenge that is at least an order of magnitude more difficult. We not only have to ensure that the returns and/or prices at each individual time step are consistent (e.g. that the High > Low, in the case of prices, for example), but also that the sequence of returns is representative of known characteristics of financial assets such as serial autocorrelation, cross correlation and volatility clustering. GARCH models serve reasonably well in this context, but, for example, fail to capture long memory effects, amongst other deficiencies.

Deep Learning Models

Generative Adversarial Networks have become ubiquitous in the generation of “deep fakes” – .synthesised images generated by deep learning models that are close to indistinguishable from the real thing, whether it be the image of a human face, or medical images such X-ray scans. In 2019 Jinsung Yoon, Daniel Jarrett, and Mihaela van der Schaar published a paper on Time-series Generative Adversarial Networks (“TimeGAN”) in Neural Information Processing Systems (link to paper here) , a deep learning model that can be used to generate synthetic time series data.

An important characteristic of time series data is that it extends regular tabular data in the third dimension (i.e time):

As the authors note:

“A good generative model for time-series data should preserve temporal dynamics, in the sense that new sequences respect the original relationships between variables across time. Existing methods that bring generative adversarial networks (GANs) into the sequential setting do not adequately attend to the temporal correlations unique to time-series data. At the same time, supervised models for sequence prediction – which allow finer control over network dynamics – are inherently deterministic.”

They continue:

“[TimeGAN is a] novel framework for generating realistic time-series data that combines the flexibility of the unsupervised paradigm with the control afforded by supervised training. Through a learned embedding space jointly optimized with both supervised and adversarial objectives, we encourage the network to adhere to the dynamics of the training data during sampling”.

This sounds very promising and indeed the authors claim that “Qualitatively and quantitatively, we find that the proposed framework consistently and significantly outperforms state-of-the-art benchmarks with respect to measures of similarity and predictive ability” for several different types of time series dataset, including stock data.

A Brief Interlude on Generative Adversarial Networks

In the GAN architecture we implement two models: one to generate artificial data and another to distinguish artificial from real data. For example, a GAN model to generate artificial images of handwritten numbers would look approximately like this:

There are many architectures to consider for building the discriminator and the generator. We could build a deep neural network or Convolutional Neural Network (CNN) as well as other options.

TimeGAN

In the context of time series we face not only the problem of matching the features of synthetic and real data sequences, but also calibrating the time dynamics of the underlying generation process. TimeGAN addresses these challenges by using an unsupervised adversarial loss on both real and synthetic sequences, coupled with a stepwise supervised loss using the original data as supervision, thereby explicitly encouraging the model to capture the stepwise conditional distributions in the data. This takes advantage of the fact that there is more information in the training data than simply whether each datum is real or synthetic; we can expressly learn from the transition dynamics from real sequences.

A further innovative feature of the TimeGAN model in the introduction of an embedding network to provide a reversible mapping between features and latent representations, thereby reducing the high-dimensionality of the adversarial learning space. This capitalizes on the fact the temporal dynamics of even complex systems are often driven by fewer and lower-dimensional factors of variation.

Importantly, the supervised loss is minimized by jointly training both the embedding and generator networks, such that the latent space not only serves to promote parameter efficiency—it is specifically conditioned to facilitate the generator in learning temporal relationships.

The figure below shows how the various components are arranged and how the information flows between them during training in TimeGAN.

Further details of the TimeGAN model can be found in the paper and in the accompanying GitHub repository, which is found here.

Evaluating the Performance of TimeGAN

The researchers test the TimeGAN methodology using several different datasets, such as daily stock data for the period 2004 to 2019 downloaded from Google, including as features the volume and high, low, opening, and closing prices.

The TimeGAN model is trained for 50,000 epochs with a batch size of 128, using a 24-period rolling window, which the authors found to be the optimal window size. The trained synthesizer produces samples comprising a (128 x 24 x 5) dataframe of price and volume data which can then be compared to the original stock series. It is worth noting that the starting prices of each 24-period window are generated independently, meaning that, for example, the opening price in one sample window might be 10x larger than in another window. This immediately indicates one of the drawbacks of the TimeGAN approach: i.e. that the window length of the generated data is fixed and it can be challenging to stitch windows together to create a longer synthetic series, given that the initial prices for each vary considerably from window to window.

The data visualization methods chosen by the authors to evaluate the performance of the synthetic series in reproducing the features of the original series is problematic, at least as far as stock data is concerned. Both TSNE and PCA plots of the real vs. synthetic data appear to indicate a very close match:

This illustrates how misleading it can be to rely on data visualization for inference purposes. For stock data, there are some very basic tests that should first be performed to ensure the consistency of the synthetic output. In particular, in each row of the window, the High should exceed the Open, Low and Close prices, with the Low price falling below the Open, High and Close prices.

In my experimentation I found that after training the model for 50,000 epochs, the synthetic data failed these basic tests in around 15% of the sample. Further training rounds up to 100,000 epochs reduced the error rate to only 5% and it should be possible to eliminate almost all of these basic data issues with further rounds of training.

However, another basic problem with the synthetic data rapidly becomes apparent: the period to period (in this case, daily) returns have a strong tendency to diminish over time, typically being an order of magnitude larger at the start of each window than towards the end. This pattern of behavior is bound to introduce spurious autocorrelation and volatility-decay effects that are nowhere to be found in the real data series.

Finally, the fixed, limited window size and the independence of each window sample of synthetic data make it impossible to account for important characteristics such as volatility clustering or long memory effects in any adequate way.

Taken together, these flaws render the synthetic stock data produced by TimeGAN significantly unrepresentative and highly unreliable for modelling purposes.

Conclusion

TimeGAN is an important innovation in the field of synthetic data generation, with particular relevance to time series data. However it has significant limitations that make its application to financial time series problematic, in regard to the fixed window length, inconsistencies in the price data and spurious autocorrelation in the returns of the synthetic series it generates.

Backtest vs. Trading Reality

Kris Sidial, whose Twitter posts are often interesting, recently posted about the reality of trading profitability vs backtest performance, as follows:

While I certainly agree that the latter example is more representative of a typical trader’s P&L, I don’t concur that the first P&L curve is necessarily “99.9% garbage”. There are many strategies that have equity curves that are smoother and more monotonic than those of Kris’s Skeleton Case V2 strategy. Admittedly, most of these lie in the area of high frequency, which is not Kris’s domain expertise. But there are also lower frequency strategies that produce results which are not dissimilar to those shown the first chart.

As a case in point, consider the following strategy for the S&P 500 E-Mini futures contract, described in more detail below. The strategy was developed using 15-minute bar data from 1999 to 2012, and traded live thereafter. The live and backtest performance characteristics are almost indistinguishable, not only in terms of rate of profit, but also in regard to strategy characteristics such as the no. of trades, % win rate and profit factor.

Just in case you think the picture is a little too rosy, I would point out that the average profit factor is 1.25, which means that the strategy is generating only 25% more in profits than losses. There will be big losing trades from time to time and long sequences of losses during which the strategy appears to have broken down. It takes discipline to resist the temptation to “fix” the strategy during extended drawdowns and instead rely on reversion to the mean rate of performance over the long haul. One source of comfort to the trader through such periods is that the 60% win rate means that the majority of trades are profitable.

As you read through the replies to Kris’s post, you will see that several of his readers make the point that strategies with highly attractive equity curves and performance characteristics are typically capital constrained. This is true in the case of this strategy, which I trade with a very modest amount of (my own) capital. Even trading one-lots in the E-Mini futures I occasionally experience missed trades, either on entry or exit, due to limit orders not being filled at the high or low of a bar. In scaling the strategy up to something more meaningful such as a 10-lot, there would be multiple partial fills to deal with. But I think it would be a mistake to abandon a high performing strategy such as this just because of an apparent capacity constraint. There are several approaches one can explore to address the issue, which may be enough to make the strategy scalable.

Where (as here) the issue of scalability relates to the strategy fill rate on limit orders, a good starting point is to compute the extreme hit rate, which is the proportion of trades that take place at the high or low of the bar. As a rule of thumb, for strategies running on typical low frequency infrastructure an extreme hit rate of 10% or less is manageable; anything above that level quickly becomes problematic. If the extreme hit rate is very high, e.g. 25% or more, then you are going to have to pay a great deal of attention to the issues of latency and order priority to make the strategy viable in practise. Ultimately, for a high frequency market making strategy, most orders are filled at the extreme of each “bar”, so almost all of the focus in on minimizing latency and maintaining a high queue priority, with all of the attendant concerns regarding trading hardware, software and infrastructure.

Next, you need a strategy for handling missed trades. You could, for example, decide to skip any entry trades that are missed, while manually entering unfilled exit trades at the market. Or you could post market orders for both entry and exit trades if they are not filled. An extreme solution would be to substitute market-if-touched orders for limit orders in your strategy code. But this would affect all orders generated by the system, not just the 10% at the high or low of the bar and is likely to have a very adverse affect on overall profitability, especially if the average trade is low (because you are paying an extra tick on entry and exit of every trade).

The above suggests that you are monitoring the strategy manually, running simulation and live versions side by side, so that you can pick up any trades that the strategy should have taken, but which have been missed. This may be practical for a strategy that trades during regular market hours, but not for one that also trades the overnight session.

An alternative approach, one that is commonly applied by systematic traders, is to automate the handling of missed trades. Typically the trader will set a parameter that converts a limit order to a market order X seconds after a limit price has been traded but not filled. Of course, this will result in paying up an extra tick (or more) to enter trades that perhaps would have been filled if one had waited longer than X seconds. It will have some negative impact on strategy profitability, but not too much if the extreme hit rate is low. I tend to use this method for exit trades, preferring to skip any entry trades that don’t get filled at the limit price.

Beyond these simple measures, there are several other ways to extend the capacity of the strategy. An obvious place to start is by evaluating strategy performance on different session times and bar lengths. So, in this case, we might look at deploying the strategy on both the day and night sessions. We can also evaluate performance on bars of different length. This will give different entry and exit points for individual trades and trades that are at the extreme of a bar on one timeframe may not be at the high or low of a bar on the other timescale. For example, here is the (simulated) performance of the strategy on 13 minute bars:

There is a reason for choosing a bar interval such as 13 minutes, rather than the more commonplace 5- or 10 minutes, as explained in this post:

Finally, it is worth exploring whether the strategy can be applied to other related markets such as NQ futures, for example. Typically this will entail some change to the strategy code to reflect the difference in price levels, but the thrust of the strategy logic will be similar. Another approach is to use the signals from the current strategy as inputs – i.e. alpha generators – for a derivative strategy, such as trading the SPY ETF based on signals from the ES strategy. The performance of the derived strategy may not be as good, but in a product like SPY the capacity might be larger.

Intraday Stock Index Forecasting

In a previous post I discussed modelling stock prices processes as Geometric brownian Motion processes:

Understanding Stock Price Range Forecasts

To recap briefly, we assume a process of the form:

Where S0 is the initial stock price at time t = 0.

The mean of such a process is:

and standard deviation:

In the post I showed how to estimate such a process with daily stock prices, using these to provide a forecast range of prices over a one-month horizon. This is potentially useful, for example, in choosing which strikes to select in an option hedge.

Of course, there is nothing to prevent you from using the same technique over different timescales. Here I use the MATH-TWS package to connect Mathematica to the IB TWS platform via the C++ api, to extract intraday prices for the S&P 500 Index at 1-minute intervals. These are used to estimate a short-term GBM process, which provides forecasts of the mean and variance of the index at the 4 PM close.

We capture the data using:

then create a time series of the intraday prices and plot them:

If we want something a little fancier we can create a trading chart, including technical indicators of our choice, for instance:

The charts can be updated in real time from IB, using MATHTWS.

From there we estimate a GBM process using 1-minute close prices:

and then simulate a number of price paths towards the 4 PM close (the mean price path is shown in black):

This indicates that the expected value of the SPX index at the close will be around 4450, which we could estimate directly from:

Where u is the estimated drift of the GBM process.

Similarly we can look at the projected terminal distribution of the index at 4pm to get a sense of the likely range of closing prices, which may assist a decision to open or close certain option (hedge) positions:

Of course, all this is predicated on the underlying process continuing on its current trajectory, with drift and standard deviation close to those seen in the process in the preceding time interval. But trends change, as do volatilities, which means that our forecasts may be inaccurate. Furthermore, the drift in asset processes tends to be dominated by volatility, especially at short time horizons.

So the best way to think of this is as a conditional expectation, i.e. “If the stock price continues on its current trajectory, then our expectation is that the closing price will be in the following range…”.

For more on MATH-TWS see:

MATH-TWS: Connecting Wolfram Mathematica to IB TWS

Transfer Learning

Building a deep learning neural network is a complicated and time-consuming task. For some problems it can be easier to re-purpose an existing deep learning model, making minor adjustments to a small number of layers to achieve desired architecture.

In his new book Introduction to Machine Learning, Etienne Bernard gives a nice example of the technique.

We begin with a dataset comprising images of two types of mushroom, with the aim of building a specialized image classifier.

Using a Neural Network for Feature Extraction

We start from the Wolfram ImageIdentify net, which is a 24-layer convolutional neural network trained on around 4000 image types.

net=NetModel[“Wolfram ImageIdentify Net V1”]

In its current form the net is too generalized for our task:

Mushroom

So instead of using the net in raw form, we use the first 22 layers as a feature extractor. This will preprocess each image to obtain features that are semantically richer than simple pixel values. We can then train a classifier on top of these new features, producing a logistic regression model:

c=Classify[dsTrain,FeatureExtractor->NetTake[net,22]]

The classifier can now distinguish between different types of mushrooms:

The classifier achieves around 88% accuracy on a test set constructed from a web image search:

dsTest = AssociationMap[
WebImageSearch[#, “Thumbnails”, 10] &, {“Morel”,
“Bolete”}] /. $Failed -> Nothing // Quiet;
ClassifierMeasurements[c, test, “Accuracy”,
ComputeUncertainty -> True]

0.88 +/- 0.09

This is much better than training directly on the underlying pixel values, which would result in an accuracy of around 44%, worse than random guessing.

Neural Network Transfer Learning

Moving on from Etienne’s feature extraction approach, we next try creating a new neural network by re-using the first 22 layers of the ImageIdentify network, adding two new layers (a LinearLayer and SoftmaxLayer) for mushroom classification:

changedNet=Drop[net,-2];
newNet=NetJoin[changedNet,NetChain[Association[“classifier”->LinearLayer[],”probabilities”->SoftmaxLayer[]],”Output”->NetDecoder[{“Class”,{“Bolete”,”Morel”}}]]]

We re-train the new network using the mushroom training dataset, holding the weights for the first 22 layers at their existing values (i.e. only training the last two, new layers):

trainedNet=NetTrain[newNet,Normal@dsTrain,LearningRateMultipliers->{“classifier”->1,_->0},BatchSize->16,TargetDevice->”GPU”]

Next, create a new test dataset and test the new neural network’s accuracy:

dsTest=AssociationMap[WebImageSearch[#,”Thumbnails”,10] &,{“Morel”,”Bolete”}] /. $Failed->Missing[] //Quiet;

The new net achieves 100% accuracy on the test dataset:

trainedNet[DeleteMissing@dsTest[“Bolete”]]

{“Bolete”, “Bolete”, “Bolete”, “Bolete”, “Bolete”, “Bolete”, \
“Bolete”, “Bolete”, “Bolete”}

trainedNet[DeleteMissing@dsTest[“Morel”]]

{“Morel”, “Morel”, “Morel”, “Morel”, “Morel”, “Morel”, “Morel”, \
“Morel”}

DataScience| Handling Big Data

Handling Large Files in CSV format with NumPy and Pandas

One of the major challenges that users face when trying to do data science is how to handle big data. Leaving aside the important topic of database connectivity/functionality and the handling of data too large to fit in memory, my concern here is with the issue of how to handle large data files, which are often in csv format, but which are not too large to fit into available memory.

It is well known that, due to their generality, Mathematica’s Import and Export functions are horribly slow when handling large csv files. For example, writing out a list of 10 million 64-bit reals takes almost 5 minutes:

No alt text provided for this image

and reading is also unacceptably slow:

No alt text provided for this image

Performance results like these create the impression that Mathematica is suitable for handling only “toy” problems, rather than the kind of large and complex data challenges faced by data scientists in the real world.

Sure, you can speed this up with ReadLine, but not by much, after doing all the string processing. And while the mx binary file format speeds up data handling enormously, it doesn’t address the issue of how to get the data into the requisite file format, other than via the WL DumpSave function – in other words, the data already has to be in a Mathematica notebook in order to write an mx file.

With purely numerical data once way to address this by using non-proprietary binary file formats. For example, in Python we create a NumPy array and use the tofile() method to output the data in real64 binary format, in less then 2 seconds:

No alt text provided for this image

Then in Mathematica the read process is equally fast when processing a file of numerical data in binary format, around 50x faster than the time taken to process the same file in csv format:

No alt text provided for this image

The procedure is just as fast in the reverse direction, with binary data exports from Mathematica taking a fraction of the time required to process the same data in csv format (around 200x faster!):

No alt text provided for this image

And the data is extremely fast read back in Python using the numpy fromfile method:

No alt text provided for this image

This procedure is robust enough to accommodate missing data. For instance, let’s replace some of the values in our data array with np.nan values and export the file once again in binary format:

No alt text provided for this image

Reading the binary file into Mathematica, we find no reduction in speed, as the np.nan values are stored as decimals, which are replaced by the value Indeterminate in the imported Mathematica array:

No alt text provided for this image

So, for purely numerical data we have a fast and reliable procedure for transferring data between Python, R and Mathematica using binary format. This means that we can load very large csv files in Python, do some pre-processing in pandas and export the massaged data in binary format for further analysis in Mathematica, if required.

More Complex Data Structures: the HDF5 Format

A major step in the right direction has been achieved through the significant effort that WR has put into implementing the HDF5 binary file format standard in the Wolfram Language. This serves two purposes: firstly, it can speed up the storage and retrieval of large datasets, by orders of magnitudes (depending on the data type); secondly, unlike Wolfram’s proprietary mx file format, HDF5 is an open source format that can store large, complex & hierarchical datasets that are accessible via Python, R and MatLab, as well as other languages/platforms, including Mathematica. So, working with the same dataset as before, but using HDF5 format, we get an speed-up of around 500x on the file write and around 270x on the file read:

No alt text provided for this image

Another major benefit of working in binary format is the enormous saving in disk storage, compared to csv:

No alt text provided for this image

So it becomes feasible to envisage a workflow in which some pre-processing of a very large dataset in csv format takes place initially in e.g. Python Pandas, the results of which are exported to a HDF5 or binary format file for further processing in Mathematica.

This advance does a great deal to address some of the major concerns about using Mathematica for large data science projects. And I am not sure that users are necessarily aware of its significance, given all the hoopla over more glamorous features that tend to get all the attention in new version releases.

Machine Learning Based Statistical Arbitrage

Previous Posts

I have written extensively about statistical arbitrage strategies in previous posts, for example:

Applying Machine Learning in Statistical Arbitrage

In this series of posts I want to focus on applications of machine learning in stat arb and pairs trading, including genetic algorithms, deep neural networks and reinforcement learning.

Pair Selection

Let’s begin with the subject of pairs selection, to set the scene. The way this is typically handled is by looking at historical correlations and cointegration in a large universe of pairs. But there are serious issues with this approach, as described in this post:

Instead I use a metric that I call the correlation signal, which I find to be a more reliable indicator of co-movement in the underlying asset processes. I wont delve into the details here, but you can get the gist from the following:

The search algorithm considers pairs in the S&P 500 membership and ranks them in descending order of correlation information. Pairs with the highest values (typically of the order of 100, or greater) tend to be variants of the same underlying stock, such as GOOG vs GOOGL, which is an indication that the metric “works” (albeit that such pairs offer few opportunities at low frequency). The pair we are considering here has a correlation signal value of around 14, which is also very high indeed.

Trading Strategy Development

We begin by collecting five years of returns series for the two stocks:

The first approach we’ll consider is the unadjusted spread, being the difference in returns between the two series, from which we crate a normalized spread “price”, as follows.

This methodology is frowned upon as the resultant spread is unlikely to be stationary, as you can see for this example in the above chart. But it does have one major advantage in terms of implementation: the same dollar value is invested in both long and short legs of the spread, making it the most efficient approach in terms of margin utilization and capital cost – other approaches entail incurring an imbalance in the dollar value of the two legs.

But back to nonstationarity. The problem is that our spread price series looks like any other asset price process – it trends over long periods and tends to wander arbitrarily far from its starting point. This is NOT the outcome that most statistical arbitrageurs are looking to achieve. On the contrary, what they want to see is a stationary process that will tend to revert to its mean value whenever it moves too far in one direction.

Still, this doesn’t necessarily determine that this approach is without merit. Indeed, it is a very typical trading strategy amongst futures traders for example, who are often looking for just such behavior in their trend-following strategies. Their argument would be that futures spreads (which are often constructed like this) exhibit clearer, longer lasting and more durable trends than in the underlying futures contracts, with lower volatility and market risk, due to the offsetting positions in the two legs. The argument has merit, no doubt. That said, spreads of this kind can nonetheless be extremely volatile.

So how do we trade such a spread? One idea is to add machine learning into the mix and build trading systems that will seek to capitalize on long term trends. We can do that in several ways, one of which is to apply genetic programming techniques to generate potential strategies that we can backtest and evaluate. For more detail on the methodology, see:

I built an entire hedge fund using this approach in the early 2000’s (when machine learning was entirely unknown to the general investing public). These days there are some excellent software applications for generating trading systems and I particularly like Mike Bryant’s Adaptrade Builder, which was used to create the strategies shown below:

Builder has no difficulty finding strategies that produce a smooth equity curve, with decent returns, low drawdowns and acceptable Sharpe Ratios and Profit Factors – at least in backtest! Of course, there is a way to go here in terms of evaluating such strategies and proving their robustness. But it’s an excellent starting point for further R&D.

But let’s move on to consider the “standard model” for pairs trading. The way this works is that we consider a linear model of the form

Y(t) = beta * X(t) + e(t)

Where Y(t) is the returns series for stock 1, X(t) is the returns series in stock 2, e(t) is a stationary random error process and beta (is this model) is a constant that expresses the linear relationship between the two asset processes. The idea is that we can form a spread process that is stationary:

Y(t) – beta * X(t) = e(t)

In this case we estimate beta by linear regression to be 0.93. The residual spread process has a mean very close to zero, and the spread price process remains within a range, which means that we can buy it when it gets too low, or sell it when it becomes too high, in the expectation that it will revert to the mean:

In this approach, “buying the spread” means purchasing shares to the value of, say, $1M in stock 1, and selling beta * $1M of stock 2 (around $930,000). While there is a net dollar imbalance in the dollar value of the two legs, the margin impact tends to be very small indeed, while the overall portfolio is much more stable, as we have seen.

The classical procedure is to buy the spread when the spread return falls 2 standard deviations below zero, and sell the spread when it exceeds 2 standard deviations to the upside. But that leaves a lot of unanswered questions, such as:

  • After you buy the spread, when should you sell it?
  • Should you use a profit target?
  • Where should you set a stop-loss?
  • Do you increase your position when you get repeated signals to go long (or short)?
  • Should you use a single, or multiple entry/exit levels?

And so on – there are a lot of strategy components to consider. Once again, we’ll let genetic programming do the heavy lifting for us:

What’s interesting here is that the strategy selected by the Builder application makes use of the Bollinger Band indicator, one of the most common tools used for trading spreads, especially when stationary (although note that it prefers to use the Opening price, rather than the usual close price):

Ok so far, but in fact I cheated! I used the entire data series to estimate the beta coefficient, which is effectively feeding forward-information into our model. In reality, the data comes at us one day at a time and we are required to re-estimate the beta every day.

Let’s approximate the real-life situation by re-estimating beta, one day at a time. I am using an expanding window to do this (i.e. using the entire data series up to each day t), but is also common to use a fixed window size to give a “rolling” estimate of beta in which the latest data plays a more prominent part in the estimation. The process now looks like this:

Here we use OLS to produce a revised estimate of beta on each trading day. So our model now becomes:

Y(t) = beta(t) * X(t) + e(t)

i.e. beta is now time-varying, as can be seen from the chart above.

The synthetic spread price appears to be stationary (we can test this), although perhaps not to the same degree as in the previous example, where we used the entire data series to estimate a single, constant beta. So we might anticipate that out ML algorithm would experience greater difficulty producing attractive trading models. But, not a bit of it – it turns out that we are able to produce systems that are just as high performing as before:

In fact this strategy has higher returns, Sharpe Ratio, Sortino Ratio and lower drawdown than many of the earlier models.

Conclusion

The purpose of this post was to show how we can combine the standard approach to statistical arbitrage, which is based on classical econometric theory, with modern machine learning algorithms, such as genetic programming. This frees us to consider a very much wider range of possible trade entry and exit strategies, beyond the rather simplistic approach adopted when pairs trading was first developed. We can deploy multiple trade entry levels and stop loss levels to manage risk, dynamically size the trade according to current market conditions and give emphasis to alternative performance characteristics such as maximum drawdown, or Sharpe or Sortino ratio, in addition to strategy profitability.

The programatic nature of the strategies developed in the way also make them very amenable to optimization, Monte Carlo simulation and stress testing.

This is but one way of adding machine learning methodologies to the mix. In a series of follow-up posts I will be looking at the role that other machine learning techniques – such as deep learning and reinforcement learning – can play in improving the performance characteristics of the classical statistical arbitrage strategy.