Recently, I was reading the EPFL magazine and was surprised to see an article where they interviewed my master thesis adviser, Francois Fleuret. He explained data mining and gave an example about our project.
The goal of my master project was to predict the pollen concentration in the air for the following days. For that, we used different kinds of weather data (temperature, wind, sun, rain, etc.) available as daily data. The most interesting was not the data mining technique used, but rather the results obtained.
We used various techniques such as linear regression and decision tree. At the end, we also tried auto-correlation to study the effect of the quantity of pollen from one day on the following days. As the value was quite high, we plotted our prediction and saw that they were very close to yesterday pollen concentration. We could thus conclude with a sentence such as “tomorrow’s pollen concentration has a high probability to be like today’s concentration”.
The lesson that I learn from this project is to first try simpler methods (such as the cross- and auto-correlation in this case) before using any other, more complex, data mining techniques. This concept is related to the Occam’s razor which can be summarized by the well known quote “Entities must not be multiplied beyond necessity”. Of course, as it is recommended in data mining, you should always try more than one techniques to make predictions.