Can we automate data mining?

AutoDMThat’s a big question! Back in 2006, we started the discussion on Data Mining Research, with the post about the book Java Data Mining. We were fortunate to get opinions from experts and one of the book’s authors. In 2010, we continued the discussion about specific aspects of data mining which could be automated.

Recently, I re-launched the debate on the Swiss Association for Analytics. However, I think it is worth a dedicated blog post. In order to answer this big question, we need to analyze the different phases of data mining and estimate which one can be automated. For this purpose, I have chosen the CRISP-DM methodology (I guess any other data mining process would lead to similar conclusions).

Business understanding

In this critical step, we transform a business problem into a data mining one. We need to understand what should be solved and why. Answers will lead to the following steps. It is clear that this step cannot be automated for a new project. The data miner has to interact with experts to define the data mining problem to solve.

Data understanding

This step consist in understanding the data, the way they have been collected, their particularities, etc. Again, the data miner works in collaboration with field experts to derive knowledge useful for preparing the data (next step). This is a manual task that cannot be automated.

Data preparation

In this step, we transform raw data into meaningful information to mine. An example is outlier detection (and removal). Some companies argue that their tools can automate this step. This is true to a certain extent, but there are limitations. Here is a simple example: what is the threshold for the variable “age” to be an outlier? 100, 110, 150 years old? This is problem dependent. The same issue happens for missing values. Detecting them is often straightforward, but deciding on the action to take needs manual intervention.

Another important aspect of data preparation is feature selection and extraction. While selection can be automated, extraction (through aggregation) needs understanding of the data. Finally, any data mining tool can automate the target variable detection. However, the final choice is left to the data miner, who knows the business problem to solve.

Modeling

This step is where we apply modeling algorithms to processed data. Among others, it involves selecting a data mining algorithm and tuning its parameters. This is certainly the task that can be the most easily automated. Some vendors claim that their tools can automate the model building process. The concept of testing several algorithms with different sets of parameters (tuning) can be automated to a certain extent. However, it supposes that there are enough data, that the choice of the algorithm is not business dependent (which is usually not the case) and that the evaluation criterion is known (see below).

crisp3

Cross Industry Standard Process for Data Mining (CRISP-DM)

Evaluation

In order to validate our data mining results, we need evaluation criteria. Although applying a criterion can be automated and different modeling algorithm can be compared, the choice of the criterion may be business dependent. In the case of forecasting, for example, different evaluation criteria exist such as Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Scaled Error (MASE). If we compare different forecasting algorithms on the same time series, we can use RMSE. If the goal is to compare different time series, MASE is more appropriate. This is business dependent and thus difficult to automate.

Deployment

In this phase, the goal is to transform our proof of concept or prototype into an industrialized solution. This step involves transforming our “one shot” project into a solution that can work with as few manual interventions as possible. Although standards such as Predictive Model Markup Language (PMML) are appearing, this step stills requires manual intervention. Questions such as where and how to integrate our data mining process within an overall solution/tool need to be explored.

As a conclusion, we have seen that most data mining steps from the CRISP-DM methodology cannot be automated and need manual intervention. Data preparation and modeling, to a certain extent, could be automated. However, as data mining professionals know, most of the effort in a data mining project concerns business and data understanding. Here is an excellent metaphor from Berry and Linoff (re-explained by David S. Coppock):

The camera can relieve the photographer from having to set the shutter speed, aperture and other settings every time a picture is taken. This makes the process easier for expert photographers and makes better photography accessible to people who are not experts. But this is still automating only a small part of the process of producing a photograph. Choosing the subject, perspective and lighting, getting to the right place at the right time, printing and mounting, and many other aspects are all important in producing a good photograph.

What about you? Do you think we can automate data mining?

For your information, here are other posts related to this topic:

Automation Will Change Data Science Beyond Recognition
Data Scientists Will Not Be Replaced by Automation
Data Scientist Scarcity: Automation Is the Answer
Data Mining Automation

Share

Recommended Reading

Comments Icon14 comments found on “Can we automate data mining?

  1. I agree that the only parts of data mining which are likely to be automated are data preparation and modeling. Many aspects of data mining depend on what problem we want to solve, and that decision requires human intervention.

  2. Hello Sandro.

    As you know, we may disagree on the overall topic, but I see in your comment that you often refer to the fact the ‘data miner’ needs to collaborate with the field experts. So, you write the conclusion yourself: What is absolutely required are the field experts!!! If equipped with the correct tools, the field experts can be self-sufficient and do not need data miner.
    Where the data miners are absolutely required is when the field experts, rightly equipped with the state of the art automated tools, are finding things that seems to be wrong or counter intuitive: at this time only they need to refer to a data mining guru or black-belt that can decipher the origin of the problems.

    Data miners are only required with automation fails, and there are plenty of techniques to recognize when automation fails so it is a solvable problem ;-).

  3. This post brings forth a very important question about Data Mining. Consider the various activities involved in data mining from the initiation process towards prediction. It goes through the cycle of problem understanding, data collection, data preprocessing, algorithm selection, analysis and prediction. None of these phases are trivial so as to automate them. The very start of the phase i.e. to try to know the problem in order to analyse, we do need a human. What is very obvious to a human brain such as: a crying baby needs to be fed with food, may not be so very obvious to a machine. The machine will start analysing the sound, the pitch and intensities of the crying signal. Every running company has a goal to increase its profits and revenues, however every company’s product/service is not the same. Unless someone understands what the comapany’s business is, it will be unwise to use the data that it has and do analysis on it. However, this business understanding can be assisted by tools for efficiency and speed. For example, the company’s excel records or metadata can help a data analyst to get a good understanding within a specific time frame.

    With the current advancements in technology there are numerous applications that collect data automatically. However, this huge collected data cannot be directly fed for analysis. And even if fed, it would not provide insights and will fail to retrieve useful informative patterns. The automated incoming data needs to be sampled or summarized in certain manner to undergo analysis. The sampling and summarization techniques involve tuning of many parameters that involve human judgement.

    Once business objective is known and the data is collected, most of the data miners put efforts in cleaning the data which is inherently dirty in terms of data that is missing, data that is incorrect, too less data, data that is junk. Identifying the data for these characteristics need data analysts. Once the data characteristics is known, further data can be cleaned using certain rules automatically, but there is no gaurantee that unclean-unseen data can get through in the analysis phase. Imagine the different categories of people that come for cure at a hospital. The people are of different ages, having different diagonosis (maybe multiple), getting different treatments and having different medical insurances. For certain analysis, which part of this data is essential and which part can be chopped/cleaned off is a decision that only a data analyst can make.

    With the data in hand, it is also critical as to which mining algorithm will give the best results. It is always not the accuracy of the algorithm that dictates whether the analysis is successful or not but it is the problem in hand that drives the data mining success. Is it possible to meet the business objective using certain algorithm? There are so many classification, clustering and prediction algorthms; to know which algorithm will meet the business expectations requires indepth knowledge of data and the predictions it can deliver.

    Ofcourse there was a time in history when people never imagined that they could talk/watch others from one continent to the other. And this is possible today. So, it may be possible down the line that datamining can be fully automated. But again we need significant amount of training for the computers to get the job done for which we do need human data analysts for many years for sure.

  4. I reckon that data mining can be automated (in fact, I believe that it MUST be automated to reach its original goal of profitability) but the key to success is to use more specific tools instead of more generalistic, flexible, highly configurable tools.

  5. @Erik: As you said a few years ago, it’s not funny if everyone agrees 🙂 I think the risk if you only call a data miner when results are “counter intuitive” is that you may produce bad results that are not counter intuitive. Let me take an example in finance. You use a data mining algorithm for stock picking. You want to back-test it in the last 10 years, so you select several stocks for that. No tool is (yet) able to tell you that you should not select current stocks only but also stocks that were available for stock picking in the past 10 years but have disappeared (company bankrupt, for example). The results you will obtain are not representative of a real system, but not counter-intuitive, just better than what you could achieve in reality.

    Again, I’m not against data mining automation, I just think we should be careful if non-expert try to solve data mining problems.

    @Meher: Thanks for your input. You highlight an interesting point about the fact that it may be automated in the future: I definitely agree and I just think that we are at the very beginning of data mining automation.

  6. In SPSS my typical stream has 3 steps:
    Import data –> Sample –> Train CHAID Decision Tree –> Check Lift –> Export

    If they add integrated node to Modeler 16 then DM is automated and discussion ends 🙂

  7. Thanks for your comment Jozo. Data aggregation is a key part that is difficult to automate. In your process you assume that data are ready when you obtain them, this is often not the case. Also the sampling type will depend on your business problem. Regarding decision tree, what do you penalize more: false positive or false negative? The is also business dependent and thus difficult to automate.

  8. Yes, we can and it is there already:

    http://is.gd/H5PyNp

    Self-organizing inductive modeling has a proven history of more than 30 years. It is highly automated and thus solves the core goal of knowledge extraction from data: objectivity of the results. It automates dimension reduction, feature selection and extraction, modeling (i.e., parameter AND model structure identification), noise filtering and validation to avoid overfitted models, and model description (by a self-organized equation).
    Problem description, data collection, and parts of data preprocessing cannot be automated, however.

  9. Whether you need plenty of mining models (e.g. for lots of products) or you suffer from a considerable concept drift in the underlying data: Its just a question of time when you will be lost in work.
    Indeed you can automate most of the standard process: Data preparation, Model Training & Retraining, Model Assessment and Score Generation. With the right processes in place, it is even possible to generate new models for new products without any user interaction. And these processes can also keep an eye on your existing models, evaluate them regulary and retrain them automatically. Feel free to contact me, if you need more information.

  10. I agree that automating the modelling phase is possible but i do not find many software packages that will automatically select a modelling technique and induce models using that technique. This would be truly automating the modelling phase in my opinion. Oracle Data Miner comes close but always uses SVN for classification problems when it might not always be the best technique. I would appreciate any more research work in this area from commentators.
    Thanks

Comments are closed.