Discussion on data mining pitfalls

After a few comments on the post Garbage in, garbage out, I find interesting to discuss more precisely about existing pitfalls when applying data mining techniques. I warmly encourage you to give your ideas. Here are two possible pitfalls that I have now in mind:

  • Underfitting/overfitting
  • Data preparation (i.e. normalization, etc.)

Feel free to add elements, to discuss existing ones and perhaps give your personal experience with some of them.


Recommended Reading

Comments Icon5 comments found on “Discussion on data mining pitfalls

  1. Much of what is expertise in data mining amounts to awareness of the many subtle hazards which face the analyst and understanding how to contend with them. Managers and clients frequently have no awareness of this, and novices, when they do recognize the problems, often employ inappropriate strategies.

    Some of the issues are: sampling, missing values, outliers, very small data, very large data, imbalanced classes, model validation and variable selection.

  2. A “pitfall” is a hidden hazard, not simply a hurdle or challenge to be overcome. I think this is an important distinction since mere hurdles are not as dangerous to novices as true pitfalls. A hurdle either stops the process or leads to a less than optimal result. This is usually not a disaster, just disappointing.

    In my opinion, things like “very large data”, “imbalanced data” or “variable selection” are hurdles. They may be difficult problems, but you can pretty much see them coming.

    On the other hand, over-fitting can be properly characterized as a pitfall. Someone could train and deploy a model without being aware that they overfit during training.

    In my experience, most but not all pitfalls can be overcome with technology (I’ve worked for KXEN for over 5 years, so I’ve seen a lot of evidence of things like automated over-fitting control).

    So the things that are interesting to me are pitfalls that can’t be solved with technology. These seem to be primarily associated with domain and process knowledge, not statistics knowledge.

    For example, the pitfall of creating an input variable that contains information about the target or dependent variable (I call these “leaks”). Other than the perfect leak that is an exact replica of the target, I know of no technology to detect this pitfall. Only a human that understands where that variable came from, how it was created, and how it does (or doesn’t) relate to the target can identify this pitfall.

    I’ve found that I can successfully train novices in a few days to pick off these kinds of pitfalls, as long as the technology is handling the others. And by “novice” I mean a reasonably intelligent domain expert with no previous data mining experience, not “Cleatus the slack-jawed yokel”.

  3. Thanks to Innar and Will, we have now a quite complete (hopefully) list of data mining… hum, let’s say difficulties for non-specialists.

    Up to this day, I didn’t know KXEN. However, after the comment of Robert, I think it may be interesting to have a look at this company and what they do.

  4. I understand Robert’s point, although I’ll make the following responses.

    -For the record, I labeled these things “issues”. By Robert’s definition, these contain potential pitfalls for anyone who is not familiar with them (in my experience, a large fraction of people who attempt data mining). The pitfalls come in the form of inappropriate or inefficient responses to those issues.

    -Some of these items remain open problems in the research community. Provost and co-authors, for instance, recently published new findings on dealing with imbalanced classes.

    -Many commercial tools are still of the train-and-test variety (no cross-validation, little or no fitting control).

    -Hey, I like Cletus! “Your carpeted floor, feels good between mah toes!”

Comments are closed.