UCI/NIST Databases
How many times have you read a paper about data mining which does not illustrate its results using a common database available on the web? This is a provocative question, of course. However, due to my personal reading experience in data mining, I estimate that 7 out 10 papers use common data available on the web, for example:
These data are clearly done for the precise purpose of testing new data mining or machine learning algorithms. They should represent realworld problems. The main drawback is that people keep using these few databases and think they represent a good proportion of realworld problems. Seriously, these databases certainly represent less than 0.1% of existing reallife problem that can take advantage of data mining methodologies.
I agree with the paper of Lavrac (1) stating that “[…] its existence (UCI database) has indirectly promoted a very narrow view of realworld data mining“. I discussed recently with a professor of Carnegie Mellon University. He is using data mining in civil engineering and he told me that for each application, he needed to adapt or develop a new data mining algorithm that fitted his task. This is an excellent example of the fact that data mining algorithms are not suited for realworld problems.
Finally, always according to (1), data mining algorithm may be “overfitting the UCI repository“. This is certainly true, and the best thing to do would be to collect more data from realworld problems and see the difficulty we have to apply standard algorithms on them.
(1) Lavrac N., Motoda H. and Fawcett T., Data Mining Lessons Learned, Machine Learning, 57, 511, 2004.
Comments
One Comment on UCI/NIST Databases

find a freelancer on
Fri, 4th Mar 2011 2:41 pm
That data mining algorithms are not suited for realworld problems, This is an excellent example of the fact .
Tell me what you're thinking...