The Gateway to Algorithmic and Automated Trading

Developing a short-term machine learning strategy

Published in Automated Trader Magazine Issue 42 Q1 2017

Open-source software for machine learning is widely available for standard data analysis packages. We examine how a stack built on these can be used for time series prediction.


Johann Lotter

Johann Christian Lotter has a degree in Physics and founded a software firm that started out developing computer games. For several years now he has been working with algorithms for data analysis and machine learning. His company has developed more than 350 trading systems for institutions and private traders.

To me, there are two main methods for developing automated trading strategies: the model-based and the machine learning approach. A model-based strategy starts with a model of a market inefficiency that results from trader psychology, the economy, market microstructure or any other force affecting price. This inefficiency produces an anomaly in the price time series - a deviation from the random walk - that can sometimes be used for price or trend prediction. Examples of model-based trading methods are trend following, mean reversion, price cycles, price clusters, statistical arbitrage and seasonality.

A machine learning strategy works the other way around. It data mines price time series or other data sources, looks for patterns that repeat and attempts to fit a trading algorithm to them. The market forces behind those patterns are of no interest. The only assumption is that patterns of the past might reappear in the future. Some popular machine learning algorithms are multivariate regression, naïve Bayes, k-means clustering, support vector machines, decision trees and various kinds of neural networks, especially 'deep learning' networks.

The advantage of pure machine learning is that it does not rely on any assumptions, models or market hypotheses. The disadvantage: Machine learning algorithms tend to detect a vast number of random patterns and generate a vast amount of worthless strategies. Distinguishing real patterns - caused by real market inefficiencies - from random patterns can be a challenging task. Still, machine learning is evolving fast and will probably dominate trading strategy development within the next decade. It is time to take a closer look.

In this article I will describe the typical development process of a machine learning strategy. I will use a deep learning algorithm for trading 'price action' - predicting short-term price moves from short-term price history. With today's software tools, only about 20 lines of script code are needed for this. You can easily reproduce the strategy since all steps are described, all code is available and only open-source standard software packages - R and Zorro - are used. Zorro is used for tests and live trading, while R is needed for the deep learning algorithm.

The strategy is mainly intended as an experiment for answering two questions: Does a more complex algorithm - such as adding more neurons to an artificial brain - produce a better prediction? And can 'price action' trading work at all?

For this we will collect information from the last candles of a price time series, feed it into a deep learning neural network and use its output to predict the next candles. My working hypothesis: Price action trading is irrational since a few candles do not contain any useful predictive information. Of course, a nonpredictive outcome of the experiment will not inevitably mean that I am right since I could have used wrong parameters or prepared the data badly. But a predictive outcome would be a hint that I am wrong and price action trading can indeed be made to work.

Machine learning principles

A machine learning algorithm is trained for predicting something. In trading, this is usually the direction of price or the outcome of the next trade. For this purpose it is fed with data samples, derived in some way from historical price time series or other data sources. Each sample consists of a number n of variables x 1 , x 2 , ..., x n , named predictors or features. The predictors can be the price returns of the previous n bars, a collection of classical indicators, any other imaginable functions of the price time series or external data such as a Twitter feed. I have even seen the pixels of a price chart image used as predictors for a trading strategy. The main requirement for predictors is that when combined they carry information sufficient to predict the outcome with some accuracy.

This outcome is represented by a target variable y . It might contain the return of the next trade or the next price movement. In a training process, the algorithm learns to predict the target y from the predictors x 1 , x 2 , ..., x n . The learned 'memory' is stored in a data structure that we named 'model' (not to be confused with a financial model). This data structure is specific to the algorithm. So the machine learning algorithm works in two modes, in training mode for generating the model and in prediction mode for predicting the target:

Training: x 1 , x 2 , ..., x n , y ---> model
Prediction: x 1 , x 2 , ..., x n , model ---> y

Neural networks

We will use a deep learning neural network for our experiment. A neural network consists of artificial neurons that are usually connected in an array of layers. There are other network structures but they are not of interest to us here. Every neuron can be understood as a sort of electrical device with many inputs and a single output. The signal at the output is a weighted sum of the signals at the inputs, modified by a mathematical function known as the 'activation function'. The input weights constitute the 'memory' of the neural network. The output of every neuron is connected to the inputs of all neurons of the next layer as shown in Figure 01.

Figure 01: Basic structure of a neural network

Figure 01: Basic structure of a neural network

A neural network learns by determining the weights that minimize the error between sample prediction and sample target. This requires an approximation process, normally with backpropagating the error from the output to the inputs, optimizing the weights on its way. This process imposes two restrictions. First, the neuron outputs must be continuously differentiable functions - binary on/off won't do. Second, a standard neural network cannot be too deep; it must not have too many 'hidden layers' of neurons between inputs and output. Otherwise, the usual training process will fail. This second restriction limits the complexity of problems that a standard neural network can tackle.

More complex problems that require neural networks with many hidden layers and thousands of neurons can be approached with 'deep learning' methods. They became popular in recent years for various tasks, such as beating the world's best human Go player. Deep learning applies various methods for pre-training the hidden neuron layers and this way achieves a more effective learning process. For our experiment we will use such a deep learning method: a stacked autoencoder. But whether you are using a complex or a simple algorithm for machine learning, developing a machine learning strategy is usually the same process that I have broken down into eight steps.

Step 1: The training target

First, we have to determine the target for training and prediction. A popular target, used in most papers, is the sign of the price return at the next bar. Better suited for prediction, as it is less susceptible to randomness, is the price difference to a more distant prediction horizon, such as three bars from now or the same day next week. Like almost anything in trading systems, the prediction horizon is a compromise between the effects of randomness (fewer bars are worse) and predictability (fewer bars are better).

Sometimes you are not interested in directly predicting price but in predicting some other parameter - such as the current leg of a Zigzag indicator - that could otherwise only be determined in hindsight. Or you want to know if a certain market inefficiency will be present again - especially when you are using machine learning not directly for trading, but for filtering trades in a model-based system. Or you want to predict something entirely different, for instance, the probability of a market crash tomorrow. All this is often easier to predict than tomorrow's return.

In our price action experiment we will use the return of a short-term trade as target variable. Once the target is determined, the next step is selecting the features for predicting it.

Step 2: The predictors

A price time series is the worst case for any machine learning algorithm. Not only does it carry little signal and is mostly noise, it is also nonstationary - meaning that its mean and standard deviation change over time - and the signal/noise ratio changes constantly. The exact ratio of signal and noise depends on what is meant by 'signal', but it is normally too low for any known machine learning algorithm to produce anything useful. We must derive features from the price time series that contain more signal and less noise. 'Signal', in that context, is any information that can be used to predict the target, whatever it is. All the rest is noise.

Feature selection is thus critical for success - much more so than deciding which machine learning algorithm you are going to use. There are two approaches for selecting features. The first and most common is extracting as much information from the price time series as possible. Since you do not know where the information is hidden, you just generate a broad collection of indicators with a wide range of parameters and hope that at least a few of them will contain the information that the algorithm needs. This is the approach that you normally find in the literature. The problem with this method: Any machine learning algorithm is easily confused by non-predictive predictors, so it won't help to just throw 150 indicators at it. You need some pre-selection algorithm that determines which of them carry useful information and which can be omitted. Without reducing the number of features this way to around eight or ten, even the deepest learning algorithm will not produce anything useful.

The other approach, normally for experiments and research, is using only limited information from the price time series. This is the case here: Since we want to look into price action trading, we only use the last few prices as inputs and must discard all the rest of the curve. This has the advantage that we don't need any pre-selection algorithm since the number of features is limited anyway.

Feature variables must fulfill two formal requirements for most machine learning algorithms. First, all values should be in the same range, like -1, ..., 1 or -100, ..., 100, so you need to normalize them in some way before sending them to the neural network. Second, the samples should be balanced, which means equally distributed over all values of the target variable. When you use the outcome of a trade as target variable, there should be about as many winning as losing trades. If you do not observe these two requirements, you won't get good results.

Listing 01 shows the two simple predictor functions that we use in our experiment (in lite-C).

Listing 01: Zorro predictor functions (in lite-C)

Listing 01: Zorro predictor functions (in lite-C)

The two functions are supposed to carry the necessary information for price action: per-bar movement and volatility. The change function is simply the difference of the current price to the price of n bars before, in relation to the current price. The range function is the high-low distance of the last n candles, also in relation to the current price. And the scale function - a standard function of the trading platform we will use - centers and compresses the values to the ±100 range. We divide them by 100 in order to normalize them to ±1.

Step 3: Pre-selection

It is not a good idea to use lots of predictors, since this simply causes overfitting and failure in out-of-sample operation. Therefore, machine learning strategies often apply a pre-selection algorithm that determines a small number of predictors out of a pool of many. There are many methods for reducing the number of features, for instance:

Determine the correlations between the signals. Remove those with a strong correlation to other signals, since they do not contribute to the information.

  • Compare the information content of signals directly, using algorithms like information entropy or decision trees.
  • Determine the information content indirectly by comparing the signals with randomized signals. There are some software libraries for this, such as the R Boruta package.
  • Use an algorithm like Principal Component Analysis (PCA) for generating a new signal set with reduced dimensionality.
  • Use genetic optimization for determining the most important signals based on the most profitable results from the prediction process. Great for curve fitting if you want to publish impressive results in a research paper.

For our experiment we do not need to pre-select or pre-process the features, but you can find useful information about this in the Further Reading section at the end of the article; see in particular Sisodiya (2015), Longmore (2016), Perervenko (2015) and Aronson and Masters (2013).

Step 4: The machine learning algorithm

Our tool of choice for machine learning is R, the most popular data analysis software. R is a script interpreter. It offers many different machine learning packages and each of them offers many different algorithms with many different parameters. Even if you already decided on a method - in this case, deep learning - you still have the choice of different approaches and different R packages. Most are quite new and you cannot find much empirical information that helps your decision. You have to try them and gain experience. For our experiment we have chosen the deepnet package, which is probably the simplest and easiest to use deep learning library. This keeps our code short. We are using its Stacked Autoencoder (SAE) algorithm for pre-training the network. There are other, more complex deep learning packages for R and you can spend a lot of time exploring them.

How pre-training works is easily explained, but why it works is a different matter. As to my knowledge, no one has yet come up with a solid mathematical proof that it works at all. Training a neural network means setting up the connection weights between the neurons. The usual method is error backpropagation. But it turns out that the more hidden layers you have, the worse it works. The backpropagated error terms become smaller and smaller from layer to layer, causing the initial layers of the net to learn almost nothing. This means that the predicted result becomes increasingly dependent on the random initial state of the weights. This severely limits the complexity of layer-based neural nets and the tasks that they can solve - at least it did until ten years ago.

In 2006, scientists in Toronto first published the idea of pre-training the weights with an unsupervised learning algorithm, specifically a restricted Boltzmann machine (Hinton and Salakhutdinov, 2006). This turned out to be a revolutionary concept. It boosted the development of artificial intelligence and allowed all sorts of new applications, from Go-playing machines to self-driving cars. In the case of a stacked autoencoder as we use here, it works this way:

  • Select the hidden layer to pre-train; begin with the first hidden layer. Connect its outputs to a temporary output layer that has the same structure as the network's input layer.
  • Feed the network with the training samples, but without the targets. Train it so that the first hidden layer reproduces the input signal, the features, at its outputs as exactly as possible. The rest of the network is ignored. During training, apply a 'weight penalty term' so that as few connection weights as possible are used for reproducing the signal.
  • Now feed the outputs of the trained hidden layer to the inputs of the next untrained hidden layer and repeat the training process so that the input signal is now reproduced at the outputs of the next layer.
  • Repeat this process until all hidden layers are trained. We now have a 'sparse network' with very few layer connections that does nothing but reproduce the input signals.
  • Now train the network with conventional error backpropagation for learning the target variable, using the pre-trained weights of the hidden layers as a starting point.

The hope is that the unsupervised pre-training process produces an internal noise-reduced abstraction of the input signals that can then be used for easier learning the target. And this indeed appears to work. No one really knows why, but several theories (see Erhan et al., 2010) try to explain the phenomenon.

Step 5: Generate a test data set

We first need to produce a data set with features and targets so that we can test our prediction process and try out parameters. The features must be based on the same price data as in live trading and for the target we must simulate a short-term trade. It makes sense to generate the data not with R, but with our trading platform. We will use the Zorro platform since it is very fast, easy to script and it has a direct connection to R. Listing 02 is a small Zorro script for producing the test data (DeepSignals.c).

Listing 02: Zorro script to produce test data (DeepSignals.c)

Listing 02: Zorro script to produce test data (DeepSignals.c)

We are generating about two years of data with features calculated by our previously defined change and range functions. Our target is the return of a trade with a lifetime of three bars. Trading costs are set to zero, so in this case the return is equivalent to the price difference at three bars in the future. The adviseLong function is described in the Zorro manual (see Lotter, 2017). It is a powerful function that automatically handles training and predicting and allows us to use any R-based machine learning algorithm just as if it were a simple indicator.

In our code, the function uses the next trade return as target, and the price changes and ranges of the last four bars as features. The SIGNALS flag tells it not to train the data, but to export it to a CSV file. The BALANCED flag makes sure that we get as many positive as negative returns; this is important for most machine learning algorithms. Run the script in Zorro's 'Train' mode with the asset EUR/USD selected. EUR/USD is good for testing algorithms since it contains almost any imaginable market inefficiency and price anomaly. The script generates a spreadsheet file named DeepSignalsEURUSD_L.csv that contains the features in the first eight columns and the trade return in the last column.

Step 6: Calibrate the algorithm

Complex machine learning algorithms have many parameters to adjust. Some of them offer great opportunities to curve-fit the algorithm for producing more impressive publications. Still, we must calibrate parameters since the algorithm rarely works well with its default settings. For this, Listing 03 is an R script that reads the previously created data set and processes it with the deep learning algorithm (DeepSignal.r).

Listing 03: R script to read and process data (DeepSignal.r)

Listing 03: R script to read and process data (DeepSignal.r)

We have defined three functions neural.train, neural.predict, and neural.init for training, predicting, and initializing the neural net. The function names are not arbitrary, but follow the convention used by Zorro's adviseLong() function. It won't matter now, but will matter later when we use the same R script for training, testing and trading the deep learning strategy. A fourth function, TestOOS, is used for out-of-sample testing of our setup.

The function neural.init seeds the R random generator with a fixed value (365 is my personal lucky number). Otherwise we would get a slightly different result any time, since the neural net is initialized with random weights. It also creates a global R list named Models. Most R variable types do not need to be created beforehand, some do (don't ask me why). The '<

The function neural.train takes as inputs a model number and the data set to be trained. The model number identifies the trained model in the Models list. A list is not really needed for this test, but we will need it for more complex strategies that train more than one model. The matrix containing the features and target is passed to the function as second parameter. If the XY data is not a proper matrix, which frequently happens in R depending on how you generated it, it is converted to one. It is then split into the features (X) and the target (Y) and finally the target is converted to 1 for a positive trade outcome and 0 for a negative outcome.

The network parameters are then set. Some are obvious, others are free to play around with:

  • The network structure is given by the vector hidden: c(50,100,50) defines three hidden layers, the first with 50, the second with 100 and the third with 50 neurons. This is the parameter that we will later modify for determining whether deeper is better.
  • The activation function converts the sum of neuron input values to the neuron output. Functions most often used are sigmoid, which saturates to 0 or 1, or tanh which saturates to -1 or +1. These are plotted in Figure 02.
  • Here, we use tanh since our signals are also in the ±1 range. The output of the network is a sigmoid function since we want a prediction in the 0, ..., 1 range. But the SAE output must be 'linear' so that the Stacked Autoencoder can reproduce the analog input signals on the outputs.
  • The learning rate controls the step size for the gradient descent in training; a lower rate means finer steps and possibly more precise prediction, but longer training time.
  • Momentum adds a fraction of the previous step to the current one. It prevents the gradient descent from getting stuck at a local minimum or saddle point.
  • The learning rate scale is a multiplication factor for changing the learning rate after each iteration (I am not sure what this is good for, but there may be tasks where a lower learning rate on higher epochs improves the training).
  • An epoch is a training iteration over the entire data set. Training will stop once the number of epochs is reached. More epochs mean better prediction, but longer training.
  • The batch size is the number of random samples - a mini batch - taken out of the data set for a single training run. Splitting the data into mini batches speeds up training since the weight gradient is then calculated from fewer samples. The higher the batch size, the better is the training, but the more time it will take.
  • The dropout is the number of randomly selected neurons that are disabled during a mini batch. This way the net learns only with a part of its neurons. This seems a strange idea, but can effectively reduce overfitting.
Figure 02: Sigmoid and hyperbolic tangent functions

Figure 02: Sigmoid and hyperbolic tangent functions

All these parameters are common for neural networks. Play around with them and check their effect on the result and the training time. Properly calibrating a neural network is non-trivial and interested readers should investigate the topic separately. The parameters are stored in the model together with the matrix of trained connection weights. Therefore they need not be entered again in the prediction function, neural.predict. It takes the model and a vector X of features, runs it through the layers and returns the network output, the predicted target Y. Compared with training, prediction is pretty fast since it only needs a couple of thousand multiplications. If X is a row vector, it is transposed and this way converted to a column vector, otherwise the nn.predict function would not accept it.

I recommend using RStudio or a similar environment for conveniently working with R. Edit the path to the CSV data in the file above, source it, install the required R packages (deepnet, e1071 and caret), then call the TestOOS function from the command line. If everything works, it should print something like the content of Listing 04.

Listing 04: Output of the TestOOS() function

Listing 04: Output of the TestOOS() function

You might get slightly different values when you use different historical price data. TestOOS first reads our data set from Zorro's data folder. It splits the data into 80% for training ( and 20% for out-of-sample testing (XY.ts). The training set is trained and the result is stored in the Models list at index 1. The test set is further split into features (X) and targets (Y). Y is converted to binary 0 or 1 and stored in Y.ob, our vector of observed targets. We then predict the targets from the test set, convert them again to binary 0 or 1 and store them in For comparing the observation with the prediction, we use the confusionMatrix function from the caret package.

A confusion matrix of a binary classifier is simply a 2×2 matrix that tells how many 0's and how many 1's had been predicted wrongly and correctly. A lot of metrics are derived from the matrix and printed in the lines in Listing 04. The most important at the moment is the ~60% prediction accuracy. This may hint that I bashed price action trading a little prematurely. But of course, the 60% might have been just luck. We will see that later when we run a WFO test.

A final advice: R packages are occasionally updated, with the possible consequence that previous R code suddenly might work differently or not at all. Test carefully after any update.

Step 7: The strategy

Now that we have tested our algorithm and got prediction accuracy above 50% with a test data set, we can finally code our machine learning strategy. In fact, we have already coded most of it already, we just need to add a few lines to the above Zorro script that exported the data set. The final script for training, testing and (theoretically) trading the system (DeepLearn.c) is shown in Listing 05.

Listing 05: Zorro script to run the strategy (DeepLearn.c)

Listing 05: Zorro script to run the strategy (DeepLearn.c)

Walk forward optimization (WFO) is a method for using almost the whole test data for an out-of-sample test. It applies a rolling window for separating in-sample and out-of-sample data. Here we are using a WFO cycle of one year, split into a 90% training and a 10% out-of-sample test period. You might ask why I earlier used two years of data and a different split, 80/20, for calibrating the network in step 5. This is to use differently composed data for calibrating and for walk forward testing. If we used the exact same data, the calibration might overfit it and compromise the test.

The selected WFO parameters mean that the system is trained with about 225 days data, followed by a 25 days test or trade period with no training. In the literature you will find recommendations to retrain a machine learning system after any trade, or at least any day. But this does not make much sense to me. When you used almost one year of data for training a system, it obviously cannot deteriorate after a single day. Or if it did, and only produced positive test results with daily retraining, I would strongly suspect that the results are artifacts of a bug in the code.

Training a deep network takes a long time, in our case about ten minutes for a network with three hidden layers and 200 neurons. Since this is repeated at any WFO cycle, using multiple cores is recommended for training many cycles in parallel. The NumCores variable at -1 activates all CPU cores but one.

In the script we now train both long and short trades. For this we have to allow hedging in Training mode, since long and short positions are open at the same time. Entering a position is now dependent on the return value from the advise function, which in turn calls either the neural.train or the neural.predict function from the R script. So we are entering positions when the neural net predicts a result above 0.5. Trading costs, like slippage, spread, rollover and commission, are set to zero for the experiment.

The R script is now controlled by the Zorro script (for this it must have the same name, NeuralLearn.r, only with a different extension). It is identical to our R script above since we are using the same network parameters. Only one additional function is needed for supporting a WFO test (Listing 06).

Listing 06: The R function

Listing 06: The R function

The function stores the Models list in Zorro's data folder after every training run. The list now contains two separate models for long and for short trades. Since the models are stored for later use, we do not need to train them again for repeated test runs. Figure 03 shows the WFO equity curve generated with the script above (EUR/USD, without trading costs).

Figure 03: EUR/USD equity curve with 50-100-50 network structure

Figure 03: EUR/USD equity curve with 50-100-50 network structure

Although not all WFO cycles get a positive result, it seems that there is some predictive effect. The curve is equivalent to an annual return of 89%, achieved with a 50-100-50 hidden layer structure. We will check in the next step how different network structures affect the results.

Since the neural.init, neural.train, neural.predict and functions are automatically called by Zorro's adviseLong/adviseShort functions, there are no R functions directly called in the Zorro script. Thus, the script can remain unchanged when using a different machine learning method. Only the DeepLearn.r script must be modified and the neural network, for instance, replaced by a support vector machine. Theoretically, you could now trade this machine learning system live, since the script contains all necessary commands. This is not recommended for the simple experiment here. For running such a machine learning system on a trading server, make sure that Zorro, R and the required R packages are installed and the path to the R terminal is set up in Zorro's INI file. Otherwise you will receive an error message when starting the trading session.

Step 8: The experiment

If our goal had been developing a real strategy, the next steps would be a reality check, risk and money management and preparing for live trading. For our experiment, we will just run a series of tests, increasing the number of neurons per layer from 10 to 100 in three steps, and one, two or three hidden layers (deepnet does not support more than three). So we are looking into the following nine network structures: c(10), c(10,10), c(10,10,10), c(30), c(30,30), c(30,30,30), c(100), c(100,100), c(100,100,100). For this experiment you need an afternoon even with a fast PC running in multiple core mode.

Table 01 shows the resulting Sharpe ratios and R2 coefficients. We can see that a simple net with only ten neurons in one single hidden layer will not work well for short-term prediction. Network complexity seems to improve the performance, though only up to a certain point. A good result for our system is already achieved with 3 layers × 30 neurons. Adding more neurons won't help much and can even produce a worse result. This is no real surprise, since for processing only eight inputs, 300 neurons likely cannot do a better job than 100.

Hidden layers x 10 neurons x 30 neurons x 100 neurons
1 SR = 0.55 R2 = 0.00 SR = 1.02 R2 = 0.51 SR = 1.18 R2 = 0.84
2 SR = 0.98 R2 = 0.57 SR = 1.22 R2 = 0.70 SR = 0.84 R2 = 0.60
3 SR = 1.24 R2 = 0.79 SR = 1.28 R2 = 0.87 SR = 1.33 R2 = 0.83

Table 01: Sharpe ratios (SR) and R2 coefficients for various neural network configurations


Our goal was to determine if a few candles can have predictive power and how the results are affected by the complexity of the prediction algorithm. The results seem to suggest that short-term price movements can indeed be predicted by analyzing the changes and ranges of the last four candles. The prediction is not very accurate - it is in the 58-60% range and most systems of the test series become unprofitable when trading costs are included. Still, I have to reconsider my opinion about price action trading. The fact that the prediction improves with network complexity is also an argument in favor of short-term price predictability.

To look at the long-term stability of predictive price patterns I had to run another series of experiments and modify the training period (WFOPeriod in the script in Listing 05) and the 90% IS/OOS split. This takes longer since more historical data is required. I have done a few tests and so far found that a year of data seems to be a good training period. The system deteriorates with periods longer than a few years. Predictive price patterns, at least of EUR/USD, have a limited lifetime.
Where can we go from here? There is a plethora of possibilities for further machine learning experiments. For instance:

  • Use inputs from more candles and process them with far bigger networks with thousands of neurons.
  • Use oversampling for expanding the training data. Prediction always improves with more training samples.
  • Compress time series, for example with spectral analysis, and analyze not the candles, but their frequency representation with machine learning methods.
  • Use inputs from more candles and pre-process adjacent candles with one-dimensional convolutional network layers.
  • Use recurrent networks. In particular, long short-term memory, or LSTM, could be very interesting for analyzing time series - and to my knowledge, has rarely been used for financial prediction.
  • Use an ensemble of neural networks for prediction, such as the "oracles" and "committees" described in Aronson and Masters (2013).