How many times have you read a paper about data mining which does not illustrate its results using a common database available on the web? This is a provocative question, of course. However, due to my personal reading experience in data mining, I estimate that 7 out 10 papers use common data available on the web, for example:
These data are clearly done for the precise purpose of testing new data mining or machine learning algorithms. They should represent real-world problems. The main drawback is that people keep using these few databases and think they represent a good proportion of real-world problems. Seriously, these databases certainly represent less than 0.1% of existing real-life problem that can take advantage of data mining methodologies.
I agree with the paper of Lavrac (1) stating that “[…] its existence (UCI database) has indirectly promoted a very narrow view of real-world data mining“. I discussed recently with a professor of Carnegie Mellon University. He is using data mining in civil engineering and he told me that for each application, he needed to adapt or develop a new data mining algorithm that fitted his task. This is an excellent example of the fact that data mining algorithms are not suited for real-world problems.
Finally, always according to (1), data mining algorithm may be “overfitting the UCI repository“. This is certainly true, and the best thing to do would be to collect more data from real-world problems and see the difficulty we have to apply standard algorithms on them.
(1) Lavrac N., Motoda H. and Fawcett T., Data Mining Lessons Learned, Machine Learning, 57, 5-11, 2004.