It’s a great pleasure for me to introduce Dominic Pouzin, from Data Applied. He kindly accepted to write a guest post for Data Mining Research. I found it really interesting and I hope you will enjoy it as well. Good reading and thanks Dominic.
When offered to write a blog post, I initially thought I would write about something fairly technical. Then, it struck me. As Data Miners, we are intimately familiar with data manipulation algorithms and statistics. We enjoy tweaking predictive models to reduce classification error rates by a mere 2%. We understand the principles behind stochastic gradient boosting or the Bayesian information criterion. But the reality is that it’s a pretty lonely place.
The promise to bring the full power of data mining techniques to the world has been broken. Ideally, a small dentist’s office with several years of customer records should be able to determine which dental services are frequently purchased together. A large restaurant owner should be able to find out which types of customers spend more (and why, and when). In these two cases, market basket analysis or decision trees might provide the answer. These examples might look a bit marginal, yet they represent what the data mining community should strive for. Unfortunately, today, data mining techniques are presented in a fairly unintelligible form. It doesn’t help either that existing data mining solutions cost so much (see “cost areas”). As a result, only large banks / insurance / retail companies have demonstrated sufficient expertise and budget to leverage these techniques and improve their ROI. Therefore, it should not be very surprising to see that public interest in data mining is fading away (see this chart).
Paper or spreadsheet?
There was a time when all accounting was done on paper. Then spreadsheet software came along. Accountants intimidated by the complexity refused to adopt it. Ultimately, spreadsheet software evolved to became more user-friendly. Suddenly, paper-based accounting was no match. Data mining might represent the next link in this chain of evolution. Analysts currently spend way too much time looking at dashboards and pie charts in the hope of discovering useful facts. Clearly, there are a huge number of permutations to consider, even for low-dimensional data. Automated analysis is the only way to solve this problem. But to unleash the power of data mining onto those users, we must make it a lot more affordable and more accessible. We also need to take the time to better explain the concepts behind automated analysis – or accept our fate as experts in a shrinking field.
Kohonen maps – huh?
As an example, self-organizing maps are both powerful and reasonably easy to understand. We can describe them as “Kohonen maps”, and explain that they project records onto a 2D map while trying to preserve high dimension topological properties. Or perhaps, we can call them “similarity maps”, and explain that they make it easy to visualize which customers are similar to others. Better, since we are in the Age of Google, allowing users to search the map and view how matching records are mapped makes sense. Yet, how many data mining packages present self-organizing maps this way? From what I’ve seen, most are more interested in offering different variants of Ward’s clustering method. I know what it means, but wouldn’t mention it at my wife’s Christmas party.
There is a revolution ahead. It will take time to find the right way to present data mining techniques to regular business users. It will also be difficult for us to put aside our love for and knowledge of algorithms to better educate business users about the concepts and benefits. But it’s only then that we will be able to grow data mining into something more than what it is today: a highly specialized field meaningful only to a few priests. What do you think? Can data mining become accessible to a broader base of users? And what will it take to get there?
Dominic Pouzin works for Data Applied, a business intelligence and data mining company. You can contact him at dominicp[at]data-applied[.]com.