How to cheat with data mining

Usually data miners don’t cheat. The reason is simple: you cannot cheat with the future. In reality, it’s a bit more complicated. A data miner may be cheating without knowing it. Here are a few examples.

First, one may cheat by learning the training set by heart. If you cheat (in any way) on your training set, it will certainly be visible on the test set (overfitting).

Another way of cheating is to use predictor variables that you don’t have the right to use. For example, you don’t have the right to use the value of the euro of tomorrow to predict the value of the dollar of tomorrow). Straightforward, but this may happen in more subtle ways, believe me.

The next one is much more insidious. You use past data to predict the evolution of some given stocks. You train and test our model on past years (backtesting). For that, a set of stocks are choosen (as of end 2010), that were also present in 2001. This way, backtest can be performed on 10 years. Results are good over the last 10 years, so the strategy can be launched.

But you are cheating, in a subtle way. You are not overfitting and neither are you using forbidden predictor variables. So, what’s the problem? You are using non-allowed information. While backtesting, will should not know in 2001 which companies will still be there in the end of 2010. By doing so, only successful companies are selected and thus you cheat on your predicted return. Subtle isn’t it?

Share

Recommended Reading

Comments Icon5 comments found on “How to cheat with data mining

  1. Any list of classic traps would certainly include these three!

    There is a spin-off of the first that doesn’t get as much attention. Let’s say that you build 100 models (different inputs/different techniques) in your quest for the optimal model. On what basis will you choose the top performers? We use the test data set because we know that we can’t use the training data set. Well, we are getting an awful lot of use out of that test set. We might be basing decisions on it for several days of intense modeling work. Not a bad idea on that last modeling day to have a third data to insure that the top performers on the test data set continue to be the top performers on the third (validation) data set.

    The second one will eventually get a name that we can all agree on. It is a classic error, and it seems everyone has to make a few times before it gets trained out of them. I have met folks that like to call them “clairvoyant” variables because they are never wrong. I tend to like “anachronistic” because the variables are literally that – but when I call it that it lectures the term doesn’t always go over well.

    Nice job reminding us all of the last one – it is indeed easy to miss.

  2. Hi Sandro
    Often analysts misuse of some precious theories like bayesian posterior probability… building models based on observation of the future.
    But the most common mistake is the Domain: you train a model and test properly in a Domain regardless that often the Domain is not static but dynamic.
    …How many comments you can read blaming the algorithm chosen 🙂
    Nevertheless, make a mistake is the best way to learn. …the important thing is:
    don’t act in bad faith 🙂
    cristian

  3. Hi all
    Thanks all for your comments and contributions I’m abit new in data mining and I’m looking for a good phd topic on the field can anyone assist please. If anybody has anything for me please send it to gachiricn@gmail.com thanks

Comments are closed.