Quantitative Trading, a blog owned by Ernest Chan has a pessimist post about data mining. To be brief, the author writes that data mining is often useless for financial purposes, and gives an example with stock picking. To his mind, the reasons are the lack of historical data and the noise. To the denoising methods that exist, he responds that the overfitting problem persists. He certainly is true when writing that mining small quantity of data having a lot of relationships is not easy. Although I think it is not easy, I do believe there are solutions such as cross-validation, dimensionality reduction, etc. to obtain good results. Since I have no experience in the financial field, I ask your opinion about that question. So feel free to give your mind by posting a comment.

## Recommended Reading

## 3 comments found on “Data mining useless in finance?”

Comments are closed.

what did you expect from a Ph.D in physics – it’s obvious that he does not believe in some esoteric data mining crap.

you really can’t take it very serious or have a long discussion if one’s statement is that a good property for successful predicting system is that “They involve linear regression only, and not fitting to some esoteric nonlinear functions”.

What I found interesting about the post is the principles he uses. He writes that the methods that the methods that work for him are characterized by:

• They are based on a sound econometric or rational basis, and not on random discovery of patterns;

• They have few or even no parameters that need to be fitted to past data;

• They involve linear regression only, and not fitting to some esoteric nonlinear functions;

• They are conceptually simple.

I don’t have a dog in any algorithm fight, so these comments are not driven by that kind of bias. But there are several interesting observations he makes here. To say that linear regression is conceptually simple (combining bullets 3 and 4) is not necessarily so, nor is it necessarily true that a nonlinear model is complex. For example, a model (linear regression no less!) containing x and x^2 as the inputs, just 2 inputs, is much simpler than another model with 20 inputs even though the latter is linear and the former is nonlinear (in the inputs).

The question then is related to the degrees of freedom in the model, not the complexity of the patterns. Any data miner worth his pay would guard against overfitting data by having a holdout sample, or if there isn’t enough data, using cross-validation or bootstrap sampling to assess the model performance and hence prevent overfit. While it is true that small data sets and noisy data sets lend themselves to simpler, linear models, I would never make such a blanket statement as the author does about other methods.

Dean

By the way, here is an abstract of a recent paper (2006) dealing with financial forecasting:

In this paper, I propose a genetic algorithm (GA) approach to instance selection in artificial neural networks (ANNs) for financial data mining. ANN has preeminent learning ability, but often exhibit inconsistent and unpredictable performance for noisy data. In addition, it may not be possible to train ANN or the training task cannot be effectively carried out without data reduction when the amount of data is so large. In this paper, the GA optimizes simultaneously the connection weights between layers and a selection task for relevant instances. The globally evolved

weights mitigate the well-known limitations of gradient descent algorithm. In addition, genetically selected instances shorten the learning time and enhance prediction performance. This study applies the proposed model to stock market analysis. Experimental results show that the GA approach is a promising method for instance selection in ANN.

Interested people can write to me and I will send them the pdf of the paper.