The Gateway to Algorithmic and Automated Trading

Backtest overfitting in financial markets

Published in Automated Trader Magazine Issue 39 Q2 2016

Systematic traders are cursed by the tendency of strategies - and indeed even simple estimators - to overfit historical data. A group of university researchers provide an online tool to estimate the propensity to overfit, even for very parsimonious strategies.

In the context of mathematical finance, backtest overfitting means the usage of historical market data, known as backtest, to develop an investment strategy, where many variations of the strategy are tried on the same dataset. Backtest overfitting is now thought to be a primary reason why quantitative investment models and strategies that look good on paper - based on backtests - often disappoint in practice. Models suffering from this condition target the specific idiosyncrasies of a limited dataset, rather than any general behavior, and, as a result, often perform poorly when presented with new data.

Backtest overfitting is an instance of the more general phenomenon of multiple testing in scientific research, where a large number of variations of a model are tested on the same data, without accounting for the increase in false positive rates. Standard overfitting techniques, such as the hold-out method, fail to identify this problem, because they are designed to evaluate the complexity of a model relative to the dataset, still assuming that a single test has taken place.


David H. Bailey
Lawrence Berkeley National Laboratory (retired), USA

Jonathan M. Borwein
CARMA, University of Newcastle, Australia

Amir Salehipour
CARMA, University of Newcastle, Australia

Marcos López de Prado
Lawrence Berkeley National Laboratory, USA

Qiji Zhu
Department of Mathematics, Western Michigan University, USA

An example will clarify this difference: Suppose that a new compound XYZ is developed to treat headaches. We wish to test for the hypothesis that XYZ is actually effective. A false positive occurs when we incorrectly conclude that XYZ has been effective. This can occur for a variety of reasons: the patient was misdiagnosed, the pain associated with headache oscillated closely to the threshold level necessary to declare the condition, etc. Suppose that the probability of false positive is only 5%. We could test variations of the compound by changing an irrelevant characteristic (the color, the taste, the shape of the pill), and it is expected that at least 1 in 20 of those variations will be (falsely) declared effective.

The problem does not lie with biology or the complexity of the compound. Instead, the researcher has conducted multiple tests while treating each variation individually, not realizing that in doing so she has incurred an increasing probability of false positives. Full body scans and other current technology-driven medical diagnoses and methods are often compromised for the same reason.

Likewise, in finance it is common to conduct millions, if not billions, of tests on the same data. Authors do not typically provide the number of experiments involved in a particular discovery, and as a result it is likely that many published investment theories or models are false positives. For example, we have shown previously that if only five years of daily stock market data are available as a backtest, then no more than 45 variations of a strategy should be tried on this data, or the resulting strategy will be overfit. Overfit in the specific sense that the strategy's Sharpe Ratio (SR) is likely to be 1.0 or greater just by chance (even though the true SR may be zero or negative).

The Sharpe Ratio and similar metrics are used to allocate capital to the best performing strategy. SR quantifies the performance of an investment strategy. It is the ratio between average excess returns on capital, in excess of the return rate of a risk-free asset, and the standard deviation of the same returns. Thus, the higher the ratio, the greater the return relative to the risk involved.

Anyone who develops or even just invests in a systematic investment strategy (or in an exchange traded fund based on such a strategy) needs to understand the degree to which strategies can be overfit, in order to avoid unexpected financial losses. For this reason, we have developed two online tools: the Backtest Overfitting Demonstration Tool (BODT) and the Tenure Maker Simulation Tool (TMST). The major goal of the tools is to demonstrate how easy it is to overfit an investment strategy, and how this overfitting may impact the financial bottom-line performance. These two tools stem from two broad types of investment strategies:

Figure 01: In-Sample Optimization Result

Figure 01: In-Sample Optimization Result

Those based on general trading rules, e.g. seasonal opportunities (BODT targets this type)

Those based on forecasting equations, e.g. econometric models (TMST targets this type)

BODT employs a simplified version of the process many financial analysts use to create investment strategies, namely to use a computer program to find the optimal strategy based on historical market data (often termed 'in-sample' (IS) data), by adjusting variables such as the holding period, the profit-taking and stop-loss levels, etc. Similarly, TMST applies forecasting and econometric equations in order to find the 'optimal' strategy. If care is not taken to avoid backtest overfitting, such strategies may look great on paper, based on tests using historical market data, but then give rather disappointing results when actually deployed on a different dataset (often termed 'out-of-sample' (OOS) data). Figures 01 and 02 illustrate this phenomenon: the left plot shows how an optimal strategy (associated with the blue line) can be developed based on a historical dataset or IS dataset (which in this case is merely a pseudo-randomly generated set of daily closing prices and is associated with the yellow line) by varying entry day, holding period, stop loss and side parameters (we discuss these parameters in more detail later on). This optimal strategy has a Sharpe Ratio of 1.59 on the IS dataset. The right plot, on the other hand, illustrates that the same optimal strategy performs poorly on the OOS dataset and results in a SR of -0.18, demonstrating that the strategy has been overfit on the IS data; in fact, the optimal strategy actually lost money here.

The online BODT and TMST focus on demonstrating the impact of overfitting. We have also developed more technical versions. For the single testing case, we proposed the Minimum Backtest Length (MinBTL) as a metric to avoid selecting a strategy with a high SR on IS data, but zero or less on OOS data. We also proposed a probabilistic Sharpe Ratio (PSR) at some stage to calculate the probability of an estimated SR being greater than a benchmark SR. And for the multiple testing case, we developed the Deflated Sharpe Ratio (DSR) to provide a more robust performance statistic;

Figure 02: Out-of-Sample Result

Figure 02: Out-of-Sample Result

in particular, when the returns follow a non-normal distribution. Interested readers may want to consult the references section for additional reading.

The Backtest Overfitting Demonstration Tool

Seasonal strategies are very popular among investors, and are marketed every day in TV shows, business publications and academic journals. In this section we illustrate how trivial it is to overfit a backtest involving a seasonal strategy. The Backtest Overfitting Demonstration Tool (BODT) finds optimal strategies on random (unpredictable) and on real-world stock market data, and demonstrates that high Sharpe Ratios on backtest in-sample data are meaningless unless investors control for the number of trials.

BODT has two modules: the optimization module, which is the core of BODT (coded in the programming language Python), and the communication module, which is an online interface providing a bridge between the user and the optimization module. In particular, the online interface collects and/or sets the parameters values, supplies them to the optimization program, and reports the outcomes from the optimization program. BODT performs the following four steps:

Importing data and setting parameters. This includes importing/setting the parameters, and importing S&P 500 real-world stock market data/generating pseudo-random data, depending on the type of the experiment chosen by the user. If pseudo-random experiments are chosen, we give three parameters: the sample length (number of days or the length of the time series), the standard deviation and the seed. From this, daily closing prices of a stock are simulated by drawing returns from a Gaussian distribution with mean zero. If the real-world experiment is chosen, the data values are daily closing prices of the S&P 500 Index between January 1962 and February 2014. In each case, the sample data is equally divided into two sets: the in-sample (IS) dataset (also known as the 'training set'), and the out-of-sample (OOS) dataset (also known as the 'testing set').

Obtaining the 'optimal' strategy. BODT generates all investment strategies. Investment strategies are formed by successively adjusting the four parameters the holding period, the stop loss, the entry day, and the side (it performs a brute-force search by trying all combinations of the four parameters). Every strategy is evaluated by calculating the Sharpe Ratio, on the IS sample data, and the optimal trading strategy, in terms of optimizing the SR, is chosen.

Evaluating the optimal strategy on the OOS data. The 'optimal' strategy obtained above is then applied to the OOS data and the SR statistic is computed. In particular, the strategy is evaluated over the IS set in Step 2; then after exploring the best performing strategy, it is evaluated over the OOS set. Note that the OOS set is not used in the design of the strategy. A backtest is said to be realistic when the IS performance is consistent with the OOS performance, after controlling for the number of experiments that have taken place.

Visualization. The outcomes of BODT include three plots, a movie and a summary of the numerical values. The first two graphs in the online tool, which are similar to Figure 01 and Figure 02, show results on the IS set, i.e. the backtest and the OOS data. In these two graphs, the yellow line is the underlying time series, and the blue line shows the performance of the strategy. In most runs, the SR of the right graph (i.e., the final strategy on the OOS data) is either negative or at the very least much lower than the SR of the final left graph (i.e., the final strategy on the IS data), evidencing that the strategy has been overfit on the IS data.

Figure 03 shows the value of the advanced Deflated Sharpe Ratio (DSR) statistic over changes in the value of the number of trials as a blue line. The same is displayed as a red line for a benchmark setting (skewness: -3 and kurtosis: 10), only to give an idea of different behavior given a change in the values of skewness and kurtosis. Finally, it outputs a set of numerical values in a table similar to Table 01. These include the used parameters as well as values of SR and DSR statistics.

Figure 03: DSR change with respect to ‘Number of Trials’

Figure 03: DSR change with respect to 'Number of Trials'

The execution time of BODT is typically less than two minutes. The values for the maximum holding period, the stop loss and the sample length significantly affect the number of iterations performed by the program; the larger these values are, the longer the program will run. BODT is available to the public for free and can be accessed through the hyperlink at the end of this article.
A more detailed explanation and a tutorial are also available.

Table 01: Sample Result from BODT
Maximum Holding Period 20
Maximum Stop Loss 23
Sample Length 1,152
Standard Deviation 2
Seed 308
Real-World Stock Market Data Used No
Sharpe Ratio (SR) of OOS Data -0.2260
Deflated Sharpe Ratio (DSR) of IS Data 0.3744


Table 02 shows the parameters of BODT. The user has no control over some of these parameters, which are denoted by '●' in column 'Fixed Value'; for these parameters, BODT uses the default values as shown in column 'Default'. Note, if the user does not enter a value or enters a value that is outside the permissible ranges, a default value will be used. The reason for these feasible ranges is to place an upper limit to the number of trials (or optimization iterations) conducted. Such a limit does not imply a loss of generality with regards to the analysis. On the contrary, we show that overfitting can deliver significantly high performance (in-sample) even for a relatively small number of iterations. The parameters of BODT are:

Maximum holding period: the number of days that a stock can be held before it is liquidated (sold). It is given in a whole number of trading days. BODT tries all integer values less or equal to the maximum given by the user.

Maximum stop loss: the percentage of invested capital that can be lost before the position is liquidated (closed). BODT only tries integer percentages up till the maximum given by the user.

Sample length: the number of observations used in-sample

Standard deviation: the standard deviation of random returns used to generate daily prices

Seed: a seed for the pseudo-random numbers used to generate the random returns

Entry day: the day that one enters into the market in each trading month. Every trading month is assumed to have 22 entry days. All 22 possibilities are tried by BODT.

Side: the side of the held positions, either long, which is to make profits when stock prices are rising, or short, which is to make
profits when stock prices are falling. Both options are evaluated by BODT.

Four types of experiments

To study the impact of overfitting, BODT performs four different types of experiments, which are explained below. The first three are based on randomly generated data (daily closing prices) from the Gaussian distribution with the standard deviation and the seed values/ranges as given in Table 02. The last experiment is based on S&P 500 data.

Experiment 1: Replicating a specific example

The first experiment replicates a specific example which is associated with two plots of Figures 01 and 02 (the same plots are displayed on the webpage of BODT as well). Thus, the user can replicate this experiment by calling the pre-set values for parameters.

Experiment 2: Generating parameters randomly

The second experiment uses randomly generated integer parameters, from the ranges allowed for each parameter.

Experiment 3: User-defined parameter values

The third experiment asks the user to enter parameters. The user may enter any values from the specified ranges for the first five parameters of Table 02. If any parameter is left blank, then a random value is generated from the feasible ranges by BODT. In this experiment, the user has the option to impact the data generation by choosing the standard deviation and the seed values.

Experiment 4: Using actual stock market data

The fourth experiment asks the user to enter parameters for real financial data, i.e. for S&P 500 stock market data, where daily closing prices are taken from January 1962 to February 2014. Our preference for this index is motivated by its wide acceptance as a benchmark and financial instrument. Standard deviation is implied by the data and seed parameter is not relevant in this experiment. Note, that due to the size of the S&P 500 Index data, the ranges for the parameter sample length has changed.

Table 02: Sample Parameters for BODT
Parameters Fixed Value Default Random Data Experiments Real-World S&P 500
Experiment 1 Experiment 2 Experiment 3 Experiment 4
Maximum Holding Period X 7 20 [5, 20] [5, 20] [5, 20]
Maximum Stop Loss X 10 23 [10, 40] [10. 40] [10, 40]
Sample Length X 1,000 1,152 [1000, 2000] [1000, 2000] [5000, 6000]
Standard Deviation X 1 2 any positive integer From Data
Seed X 1 308 Not Relevant
Entry Date 1,...,22
Side -1,+1

The Tenure Maker Simulation Tool

The section above illustrated how easy it is to overfit a backtest involving a seasonal strategy. But what about other types of strategies? Are strategies based on academic econometric or statistical methods easy to overfit as well? Unfortunately, the answer is that these pseudo-mathematical investments are even easier to overfit. The Tenure Maker Simulation Tool (TMST) looks for econometric specifications that maximize the predictive power (in-sample) of a random, unpredictable time series. The resulting Sharpe Ratios tend to be even higher than in the 'seasonal' counterpart. The implication is that most scientific strategies published in rigorous academic journals are likely to be overfit. These publications are the basis on which lecturers receive a tenure, hence the tool's name.

Similar to BODT, the core of the Tenure Maker Simulation Tool is an optimization program coded in the programming language Python (the optimization module) and is communicated to the user via an online interface (the communication module). The online interface collects and/or sets the parameters values, supplies them to the optimization program, and reports the outcomes from the optimization program. Like BODT, TMST is a free tool. More details are documented on the web. Interested readers can consult the hyperlink section at the end of this article.

TMST performs the following four steps:

Generating returns. A series of IID (independent, identically distributed) normal returns are generated. This sample data is considered to be the in-sample (IS) set.

Generating time series model. A set of time series models are generated, where the series is forecast as a fraction of past realizations of that same series; the forecast series is considered the out-of-sample (OOS) set. The time series models include:

• Rolling sums of the past series;

• Polynomials of the past series;

• Lag of the past series; and

• Cross-products of the above.

Strategy evaluation. A forward-selection algorithm evaluates the generated strategies, in terms of optimizing SR, and selects the improved model.

Visualization. TMST outputs two graphs, which are shown in Figure 04 and Figure 05. Figure 04 shows the backtest, i.e. how the 'optimal' strategy is obtained. In this graph, the blue line represents the trading strategy behavior, and the yellow line represents the market behavior. Figure 05 shows 'inflation' progress in the annualized Sharpe Ratio (aSR).

Figure 04: Example of highly optimized Sharpe Ratio

Figure 04: Example of highly optimized Sharpe Ratio

As the program continues to optimize, the blue line in Figure 04 gets more and more profitable over time as the program fits historical data. In a matter of seconds or minutes the program creates what appears to be a very profitable equity curve (with a very high Sharpe Ratio) based on the input dataset. In fact, we are predicting future realizations of the series by using past realizations, which is of course impossible by construction. The Sharpe Ratio is even more inflated than in the 'seasonal' counterpart (those based on general trading rules). This is one justification for why econometric specifications are so flexible that it is even easier to generate a large number of independent trials.

TMST has six parameters. Five of these parameters are not available to the user (those denoted by '●' in column 'Fixed Value' in Table 03); for these parameters, TMST sets pre-specified values as shown in column 'Default'.

The six parameters are:

Sample length: the number of observations (IID returns) generated

Width: sample length used as the look-back period in the rolling sum regression models

Polynomial degree: degrees of the polynomial fit used in the polynomial regression model

Number of lags: number of lagged variables included in the lagged regression model

Number of cross products: size of the cross product regressors

Maximum computational time: this is the one parameter that is available to the user. It represents the total computational time in seconds, that the optimization module is allowed to generate the strategies for. The range is 30-900 seconds, and the default value is 90 seconds. Only integer values are allowed. Moreover, if the user does not enter any value for this parameter or if the value is out of the specified range, the default value will be used.

Figure 05: Infation in Annualized Sharpe Ratio

Figure 05: Infation in Annualized Sharpe Ratio

The following two options are available:

Experiment 1: Full

The program stops when all the strategies are generated, which may take up to 10 minutes.

Experiment 2: Limited

The user limits generation of the strategies (in the optimization module) by setting the maximum computational time.

Table 03: Sample Parameters for TMST
Parameters Fixed Value Default Experiments
Experiment 1 Experiment 2
Maximum Computational Time X 90 Not Relevant [30, 900]
Sample Length 1,250 1,250 1,250
Width 3 3 3
Polynomial Degree 3 3 3
Number of Lags 1 1 1
Number of Cross Products 3 3 3


Financial research is increasingly reliant on computational techniques to simulate a large number of alternative investment strategies on a given dataset. One problem with this approach is that the standard Neyman-Pearson hypothesis testing framework was designed for individual experiments. In other words, when multiple trials are attempted, the significance level (i.e. probability of a false positive) is higher than the value set by the researcher.

Academic articles and investment proposals almost never disclose the number of trials involved in a particular discovery. Consequently, it is highly likely that many published findings are just statistical flukes. The practical implication is that investors are being lured into allocating capital to irrelevant discoveries, financial theories or investment products.

The Backtest Overfitting Demonstration Tool (BODT) and the Tenure Maker Simulation Tool (TMST) are, to our knowledge, the first scientific software to illustrate how overfitting impacts financial investment strategies and decisions in practice. In particular, it shows how the optimal strategy identified by backtesting the in-sample data almost always leads to disappointing performance when applied to the out-of-sample data. Our main goal with BODT and TMST is to raise awareness regarding the problem of backtest overfitting in world of financial research.