UCI/NIST Databases

August 14, 2006 by Sandro Saitta
Filed under: NIST, UCI, database 

How many times have you read a paper about data mining which does not illustrate its results using a common database available on the web? This is a provocative question, of course. However, due to my personal reading experience in data mining, I estimate that 7 out 10 papers use common data available on the web, for example:

These data are clearly done for the precise purpose of testing new data mining or machine learning algorithms. They should represent real-world problems. The main drawback is that people keep using these few databases and think they represent a good proportion of real-world problems. Seriously, these databases certainly represent less than 0.1% of existing real-life problem that can take advantage of data mining methodologies.

I agree with the paper of Lavrac (1) stating that “[...] its existence (UCI database) has indirectly promoted a very narrow view of real-world data mining“. I discussed recently with a professor of Carnegie Mellon University. He is using data mining in civil engineering and he told me that for each application, he needed to adapt or develop a new data mining algorithm that fitted his task. This is an excellent example of the fact that data mining algorithms are not suited for real-world problems.

Finally, always according to (1), data mining algorithm may be “overfitting the UCI repository“. This is certainly true, and the best thing to do would be to collect more data from real-world problems and see the difficulty we have to apply standard algorithms on them.

(1) Lavrac N., Motoda H. and Fawcett T., Data Mining Lessons Learned, Machine Learning, 57, 5-11, 2004.

No TweetBacks yet. (Be the first to Tweet this post)
  • Share/Bookmark

Comments

Tell me what you're thinking...





  • Data Mining Search Engine

  • Reading Recommandations

  • T-shirts, Mugs & Mousepads

  • Archives

  • Pages

  • Disclaimer

    The opinions discussed on Data Mining Research are my own and do not reflect the position of my current employer, FinScore. The views and opinions expressed by visitors to this blog are theirs and do not necessarily reflect mine.
  • Meta