Data Science automation is a hot topic recently, with several articles about it. Most of them discuss the so-called “automation” tools. Too often, editors claim that their tools can automate the Data Science process. This provides the feeling that combining these tools with a Big Data architecture can solve any business problems.
The misconception comes from the confusion between the whole Data Science process and the sub-tasks of data preparation (feature extraction, etc.) and modeling (algorithm selection, hyper-parameters tuning, etc.) which I call Machine Learning. This issue is amplified by the recent success of platforms such as Kaggle (www.kaggle.com) and DrivenData (www.drivendata.org). Competitors are provided with a clear problem to solve and clean data. Choosing and tuning a machine learning algorithm is the main task. Participants are evaluated using metrics such as test set accuracy. In industry, data scientists will be evaluated on the value added to the business, rather than algorithm accuracy. A project with 99% classification accuracy, but that isn’t deployed in production, is bringing no value to the company.
CRISP-DM methodology (source: www.crisp-dm.eu).
I recently read how the winner of a Kaggle competition, Gert Jacobusse, spent his time on solving the challenge: “I spent 50% on feature engineering, 40% on feature selection plus model ensembling, and less than 10% on model selection and tuning”. This is very far from what I have experienced in industry. It is usually more something like: data preparation and modeling (10%) and the rest (90%). I will explain below what I mean by “the rest”. When you read news about tools that automate Data Science and Data Science competitions, people with no industry experience may be confused and think that Data Science is only modeling and can be fully automated.
Automating Data Science (Source: Shutterstock)
On my blog, I listed the different Data Science steps and discuss the ones that can be automated. Most complex and time consuming tasks such as defining the problem to solve, getting data, exploring data, deploying the project, debugging and monitoring can’t be fully automated. This is without mentioning the iterative aspect of the whole process (see the CRISP-DM figure). In a recent study from MIT, researchers said their tool bested more than 600 teams out of 900. What was the benchmark used? Clearly defined and closed world problem from Kaggle competitions. Such challenges don’t represent the heart of data scientists’ activities. It’s not that available tools are useless, on the contrary, they can free up time for the data scientist. Still, they don’t automate Data Science.
Don’t get me wrong: Kaggle and the likes are really good places to start learning about Machine Learning algorithms and it will certainly improve your feature engineering and modeling skills. However, you won’t learn the main aspects of Data Science within these competitions: business problem definition, data gathering and cleaning, deployment, stakeholder management, email communications, presentation skills…well, “the rest”. A recent article mentions that Data Science will be automated within a few years. Machine Learning, as defined above, can be automated to a certain extend. A good example is the meta-mining framework described by Phong Nguyen in the Swiss Analytics Magazine #6. However, we are far from automating the whole Data Science process. Even for Machine Learning, we need specialists to develop new algorithms, adapted to our business challenges, people that will make the field progress. Here is an interesting metaphor, from Berry and Linoff relating Data Science to photography:
“The camera can relieve the photographer from having to set the shutter speed, aperture and other settings every time a picture is taken. This makes the process easier for expert photographers and makes better photography accessible to people who are not experts. But this is still automating only a small part of the process of producing a photograph. Choosing the subject, perspective and lighting, getting to the right place at the right time, printing and mounting, and many other aspects are all important in producing a good photograph.”
The main reason that makes Data Science difficult to automate is that business challenges are by definition ill-posed open world problems. To the often asked question “Will machines replace Data Scientists?”, my answer is “Yes, just after all the other jobs in the World”.
A first version of this article was published on KDnuggets (www.kdnuggets.com).
 Explained by David S. Coppock