Once the data have been preprocessed, we obtain a matrix in which each row is a different day (since we work with daily data) and each column is one of the possible variable (close, volume, technical indicators, combination of some indicators, etc.). The reason why I started with decision tree instead of more “trendy” neural networks or support vector machines is because I prefer to begin with simple methods and then, if necessary, change to a more complex one.
One big advantage with decision tree is that one can understand the model by seeing it (i.e. by looking at the tree). It is very appreciable to understand why, at a given day, MSFT (ticker name for Microsoft) has been predicted to increase or decrease. However, this readability is only applicable as a pre-study in the project. Indeed, since the project is based on making one prediction a day (during all the backtesting period) for each selected stock, there are too many different models for a Human being to understand them.
Thus, the high number of models is due to the following processes which have to be done:
For each year to backtest
For each open day in the year
For each stock that has been selected
For each hyper-parameter value of the tree
For each fold of the cross-validation
Build a decision and evaluate it
If we consider that building a decision tree takes 1 second, then, for a backtest on 100 stocks from 2001 to 2008, we need:
8 * 252 * 100 * (10*10) * 10 = 201'600'000 seconds
This means more than 6 years of computation on a 4 CPU computer. At this stage, there are mainly two possibilities:
- Grid computing
- Computing the trees each month instead of each day
By applying these two ideas, it is possible to bring the processing time to around 3 hours of calculation (with a 6 x 4 CPU grid of computers). The next post of the series will discuss the risk management of the system.