# Stock Prediction using Decision Tree: Classification Tree

This is the fourth post in a series on using Decision Tree for Stock Prediction. For more information, feel free to read post 1, post 2 and post 3 of the series.

Once the data have been preprocessed, we obtain a matrix in which each row is a different day (since we work with daily data) and each column is one of the possible variable (close, volume, technical indicators, combination of some indicators, etc.). The reason why I started with decision tree instead of more “trendy” neural networks or support vector machines is because I prefer to begin with simple methods and then, if necessary, change to a more complex one.

One big advantage with decision tree is that one can understand the model by seeing it (i.e. by looking at the tree). It is very appreciable to understand why, at a given day, MSFT (ticker name for Microsoft) has been predicted to increase or decrease. However, this readability is only applicable as a pre-study in the project. Indeed, since the project is based on making one prediction a day (during all the backtesting period) for each selected stock, there are too many different models for a Human being to understand them.

Thus, the high number of models is due to the following processes which have to be done:

```For each year to backtest   For each open day in the year     For each stock that has been selected       For each hyper-parameter value of the tree         For each fold of the cross-validation           Build a decision and evaluate it```

If we consider that building a decision tree takes 1 second, then, for a backtest on 100 stocks from 2001 to 2008, we need:

`8 * 252 * 100 * (10*10) * 10 = 201'600'000 seconds`

This means more than 6 years of computation on a 4 CPU computer. At this stage, there are mainly two possibilities:

• Grid computing
• Computing the trees each month instead of each day

By applying these two ideas, it is possible to bring the processing time to around 3 hours of calculation (with a 6 x 4 CPU grid of computers). The next post of the series will discuss the risk management of the system.

#### 10 Comments on Stock Prediction using Decision Tree: Classification Tree

1. Shane Butler on Wed, 5th Nov 2008 7:04 am
2. Hi Sandro,
Could you elaborate on the hyper-parameter step please?
Shane

3. Sandro Saitta on Thu, 6th Nov 2008 11:18 am
4. Shane,
Thanks for asking the question. For setting the hyper-parameter of the decision tree (i.e. model selection step), I use a simple grid search.

The two parameters I fix are the deepness of the tree and the minimum number of element in a node for a split to occur. I thus try every possible combination of them. They both have a range of possible values.

I hope it is clearer now. Feel free to ask for more details.

5. Shane Butler on Thu, 6th Nov 2008 11:20 pm
6. Thanks Sandro, that makes perfect sense! Cheers, Shane

7. shailesh bohra on Wed, 14th Jan 2009 7:52 pm
8. Hi Sandro ,

i am also doing research work on data mining applications in stock market so i am looking for a relevant dataset so i need your help regarding the data set so what dataset you are taking ..can you please help me in getting any relevant dataset for any stock market..

thanks & cheers,
shailesh

9. Sandro Saitta on Thu, 15th Jan 2009 10:02 am
10. Hello shailesh,

I’m using data from our own internal database. We get data from Bloomberg.

If you want free data, I think you can use Google Finance and Yahoo! Finance

Hope it helps!

Also feel free to keep in touch with me since our work may be very close.

11. James on Mon, 7th Sep 2009 10:58 am
12. Dear Sandro

What is your Classification error in your decision tree model and how many days/years trading data are you using to train the model ?

Financial data is very volatile and hence the model might just be time inconsistent

13. Sandro Saitta on Fri, 11th Sep 2009 5:26 pm
14. @James: I was using a one year training set (based on daily data) and a one month moving window (only for computational reasons). As it is often the case in finance, the correct classication rate was very low, I think in the order of 55 to 60%.

15. Richard on Wed, 25th Nov 2009 7:41 am
16. Hi Sandro:
Have You studied any conferences before doing this research?
Thanks

17. Richard on Wed, 25th Nov 2009 7:47 am
18. Sorry ” reference “

19. Sandro Saitta on Sat, 28th Nov 2009 11:47 am
20. @Richard: Yes, but I don’t have them in mind. I just remember an interesting book named Data Mining in Finance.

• ## T-shirts, Mugs & Mousepads

All benefits given to a charity association
• ## Data Mining Search Engine

Supported by AnalyticBridge