Yet another battle between companies mining personal data and privacy advocates is ongoing. Microsoft is playing the
bad in this story where the personal data consist of health records. According to
Government Health IT, Microsoft plans to develop "
a clinical data warehouse (CDW) that provides predefined queries of interest to clinicians and analysts". The next step is to apply data mining techniques such as clustering or supervised…
Continue reading...
Although often used in practice, K-means has several drawbacks. The number of clusters has to be defined in advance and the algorithm is dependent upon the starting centroid locations. More details on how to handle these issues can be found on Data Mining Research (search for
clustering in the upper bar).A weakness, which is common to clustering in general, concerns the visualization of the obtained clusters. A possible solution…
Continue reading... | 6 Comments

The
Auto Industry website has an article of a few lines summarizing a data mining application for the car industry. More precisely, Ford is using data mining for early warning of supplier failure. After North American suppliers, they plan to cover Europe and Latin America. An example of application is to track late shipment…
Continue reading... | 1 Comment
Unlike usual books on data mining discussed in this blog, Java Data Mining is a book written for data mining practitioners. Even if the word Java appears in the title, practitioners of other languages or software may be interested by the first part of the book (
Strategy), which is really worth reading. The other parts of the book focus on the JDM API itself (
Standards), problem solving with…
Continue reading...
When working in data mining, we often have to skip from one technique to another according to the task to perform.
Andrew Moore's webpage contains several tutorials about data mining. Most of the standard data mining algorithms are covered by his presentations. It's a good starting point when dealing with a new technique…
I have noticed that comments are now regularly posted on Data Mining Research. Moreover, people often comment on
old posts (i.e. more than 10 days old). People usually find these posts using a search engine. For these two reasons, I have added on the right part of the blog a list of recent posts. I hope it will help readers to know which topics are currently being discussed on…
Continue reading...
Correlation is often used as a preliminary technique to discover relationships between variables. More precisely, the correlation is a measure of the linear relationship between two variables. Pearson's correlation coefficient is defined as:

As written above, the main drawback of correlation is the linear relationship restriction. If the correlation is null between two variables, they may…
Continue reading... | 6 Comments
Next Page »