Decision Tree for Stock Prediction: Data Preprocessing
Once the stock have been filtered, a list of stocks for every months of the shifting window system is available. Then, two steps need to be undertaken: calculation of technical indicators and standardization of data. First, data is separated in training and testing sets. One year is taken as training data and one month as test data. For example:
Training set: August, 31st 2004 -> August, 31st 2005
Test set: September, 1st 2005 -> September, 1st 2005
In fact, the date of August, 31st 2005 is not exact. Since we predict for n days in advance, we need to remove these n days from the training set.
Thus, technical indicators can be calculated. For most of them, we need data that are older than August, 31st 2004. An example is the calculation of a simple moving average (SMA) on 20 days. Below is a non-exhaustive list of basic and technical indicators that are used in the system:
- Close price
- Simple Moving Average (SMA)
- Relative Strength Index (RSI)
- Rate Of Change (ROC)
- On Balance Volume (OBV)
In addition to these indicators, combinations of them are used. We thus obtain a matrix for both training and test data, where each row is a day in the year and each column is one of the possible indicators. For the training data matrix, an additional column representing the output class (-1 or +1) is added.
Once this is done, data are standardized to obtain a zero mean and unit standard deviation. This will then allow decision tree to correctly choose parameters that are not in the same unit (e.g. close values with volumes). The next post will discuss about the main part of the system: the classification tree process.