Stupid Data Miner Tricks
I have recently read an interesting article regarding data mining entitled Stupid Data Miner Tricks: Overfitting the S&P 500. Indeed, the paper is written in a somehow provocative manner (as can be seen in the title). Since the paper is old (1995), you may already have heard about it.
It is written by David J. Leinweber who was a PhD at Caltech. In his article Leinweber mentions the bad connotation of the expression data mining, a few decades ago. He then warns of the many dangers of data mining when it is badly used. By the way, if you want examples of reliable use of data mining nowadays (I mean outside universities), take a look at Super Crunchers (I will soon post a review of this interesting book).
As written by Leinweber, his paper gives an example of “[…] totally bogus application of data mining in finance.” For this, he shows the strong statistical correlation between the annual changes of the S&P 500 stock index and the butter production in Bangladesh (!) The focus of the paper is thus the problem of overfitting, which happens when a model fits to well a training set. The model has thus bad generalization abilities when evaluated on the test set. I will conclude by citing a sentence from the article, which is, to my opinion, always an issue in data mining: “When doing this kind of analysis [regression] it’s important to be very careful what you ask for, because you’ll get it.”