Data Analytics for Internal Audit

This is a guest post from Marcel Baumgartner, Data Analytics Expert at Nestlé S.A.


Large publicly listed companies not only have external auditors who check the books, but often also a large community of internal auditors. These collaborators provide the company with a sufficient level of assurance in terms of adherence to internal and external rules and guidelines. This covers financial aspects (spend, invoices, investments, …), human resources (working time, payroll, …) but also production related aspects (e.g. food safety and quality).

One of the strongest trends observed in internal auditing communities is the more and more widespread use of Data Analytics. The term refers to the use of data, statistical methods and statistical thinking as a way of working, in addition to traditional auditing methods like interviews, document  and process reviews, etc. This is naturally a trend not only observed in audit: many other business processes now rely more and more on data-driven decision making, and this manifests itself in the buzz around “Big Data”, “Data Science” and “Business Analytics”.

In this article, we describe different approaches to ensure that Data Analytics is used efficiently in a large company for controlling and internal audit.

The Promise of Data Analytics

The main promise of data analytics is coverage. While 10 or 15 years ago, it was necessary to create a sample of financial documents in order to find potential issues, this is typically not needed anymore. Modern business software solutions (e.g. like those from SAP) allow the extraction of all financial documents, and therefore provide the basis for an exhaustive analyses of all of them. The efficiency gain is huge.

Line of Defense

Internal audit is considered to be the 3rd line of defense in most companies[1]. The 2nd line is typically provided by the functions (e.g. Sales, Marketing, Finance, Supply Chain, Manufacturing, …), who build compliance by design into processes and software solutions. In addition, companies may have an Internal Control department, which provides top down control mechanisms and analyses. The 1st line of defense is the operational management: through a cascading structure, they implement the control procedures and supervise the employees. Data Analytics is particularly relevant for both the 2nd and 3rd line of defense.

Top-down vs. Bottom-up

The analytics that are carried out to identify potential issues in business processes can be run top-down or bottom-up. Top-down refers to a global or regional organization, which runs scripts on all categories and geographies on a regular basis (monthly, quarterly), for selected, high risk processes. They then share the outcome with the local process owners, and ensure that proper actions are taken. Internal control organizations typically work like this. The advantage is clearly that all the risky processes are covered globally, and therefore the level of assurance is rather high. However, the nature of these checks make them also somehow “simple”: the algorithms behind such top-down typically don’t use advanced statistical techniques, and are essentially rule-based. The objective is to provide a clearly understandable output, whilst reducing the false positive rate as much as possible.

Bottom-up up data analytics refers to scripts and algorithms that are run by internal auditors, ad-hoc, within the scope of their audit mission. With such a framework, a company can develop more sophisticated scripts, using modern statistical methods like clustering and classification, or using graph networks, in order to find issues that nobody has seen before. But clearly, it will be difficult to do this on like a global or regional level.

At Nestlé, we do both: more top-down for the 2nd line of defense, and more bottom-up for the 3rd line of defense.

From “Data to Analytics” to “Analytics to Data”

The community of statisticians, data miners or data scientists, always worked under the principle “Data to Analytics”. To e.g. run a clustering algorithm in R or SAS, the data first need to be obtained from the source system in the form of a text file, and then be imported into the statistical software. This has naturally limitations, as soon as the data sources are too large. And it takes time to download, prepare and upload the data, and this is rarely done in one single iteration.

Since a few years, new technology provides the possibility to run the analytics directly on the source database, with immense performance improvements: in these “in-memory” systems (like SAP HANA),  an algorithm running for 24 hours previously can take a few minutes. At Nestlé, this journey has started: we have proven the feasibility e.g. to move in memory data from the source SAP HANA system into a R server, and then send the results back. Very little time is spent on the data transfer, and there is no need any more for downloads and interfaces: we have a seamless integration.

However, this generates other issues. Now suddenly, complex analytical algorithms run directly on the live system, potentially impacting business operations. Additionally, it is not allowed to develop algorithm directly on the live database: the code first needs to be written and tested on development systems, and then transported carefully into the production environment. However, the test systems don’t contain real data. These algorithms do generate false positives, and it is not possible to estimate the rate of them using test data. The truth on the performance of the algorithms is only revealed on real data, but then it’s too late to adapt the algorithm, and the development cycle starts again.

Therefore, the IS/IT organizations in such companies will need to develop other processes to ensure that data scientists can develop their code efficiently, running short cycles of development, testing, corrections, on systems using real data and having a similar computing performance.

Bottom-Up Is Driving Innovation

The internal audit organization at Nestlé has built its data analytics strategy strongly around the bottom-up approach. This is also referred to as “Self-Service Analytics”. That is the internal auditors are empowered via training, coaching, support and software solutions to run most of their analytics on their own.

The internal audit organization at Nestlé has a Data Analytics team, who has the mission to provide the framework so that internal auditors can take full advantage of the data the company has. The team works in close collaboration with the auditors to not only coach and train them, provide clear documentation, but also to innovate and develop new scripts regularly.

In the recent years, we have been able to generate valuable insight into financial and food production processes through the use of statistical and graphical methods. Here are two examples:

  • The financial documents behind the Accounts Payables (we buy materials and services from suppliers, are invoiced, and we pay) and Accounts Receivables (we obtain orders from customers, we invoice them, and we get paid) processes need deep controls and investigations. During an entire year, a mid-sized business can generate hundreds of thousands such documents. Internal control has developed numerous rule-based methods to identify documents that need investigation. In order to find red flags that are not identified by the rules, we have started to use bottom up hierarchical clustering. The difficulty was to develop the dissimilarity matrix between the financial documents, as these documents are characterized by both categorical (type of document, who created it, by which process, …) and numerical variables (the amount, debit or credit). Gower’s metric, as available in the function “daisy” of the R library “cluster”, solved this for us. Now internal auditors have a sample of say 50 documents out of 100’000, that are markedly different from all the “normal” ones, and they can start building the story of the document, and eventually decide whether it is a real red flag or a false positive. If something went wrong, chances are high that the issue will become visible in the sample.
  • We know how much of a specific semi-finished product H1 was consumed in the finished product F1, and we also know how much H1 consumed of the raw material R1. We have this information for all pairs of products in a given factory, over a period of say one year. The question now is: can we develop a more optimal algorithm to determine which finished materials are using a specific raw material? Representing the data as a graph network helped us greatly. The nodes are the materials, and the edges are given by the consumption matrix described above. Within this graph, we can then find the shortest paths between raw and finished materials, and therefore obtain exactly the information we are looking for. In the illustration below, R1 is used in F4, as there is a path from R1 through H1 and H2 to F4:

The R package “igraph” provides all the necessary functions to build the graphs, and to run the analytics (shortest path, distance, degrees of nodes, …).


Data-driven controlling and auditing will further accelerate, there’s no doubt about this. However, there are challenges. One is the needed mindset change of IS/IT organizations, to ensure that analytics can be developed and tested in a much faster way compared with traditional change management processes. The other challenge are the human resources: the statisticians / data scientists that develop the routines. These people not only need a deep knowledge of statistical methods, they also need to understand the business needs, i.e. having the ability to translate a controlling idea into a piece of code that works and is relevant. Outsourcing the coding might not be an option, therefore mastering programming languages like R, SAS, SQL, … to build the algorithms are also needed. The competition for these people will be fierce in the coming years. Those companies that provide them with interesting problems, large access to data and an environment where they can collaborate across functions and geographies will succeed.

About the author

Marcel Baumgartner works for Nestlé since 1994, in its headquarters in Vevey, Switzerland. Nestlé is the world’s leading Nutrition, Health and Wellness company. He has a diploma as an applied mathematician from the “Ecole Polytechnique Fédérale de Lausanne” (, Switzerland, and a masters in Statistics from Purdue University in West Lafayette, IN, US. He was the global lead for Demand Planning Performance and Statistical Forecasting, and since November 2014, he focuses on providing Data Analytics capabilities for internal auditors. He is also the president of the Swiss Statistical Society (

[1] See this paper for more details :

This article was originally published in the Swiss Analytics Magazine.


The academic tip: What is Deep Learning?

This is a guest post from Jacques Zuber, Data Science Teacher at HEIG-VD.

The commonly called deep learning or hierarchical learning is now a popular trend in machine learning. Recently during the Swiss Analytics Meeting Prof. Dr. Sven F. Crone presented how we can use deep learning in the industry in a forecasting perspective (beer forecasting for manufacturing, lettuce forecasting in… Continue reading...


Interview of Jerome Berthier, Head of BI and Big Data at ELCA

Data Mining Research (DMR): Can you tell us who you are and how you came to the field of Data Science?

Jerome Berthier (JB): My name is Jerome Berthier, I am an engineer in Computer Science and I have an MBA in management. After 10 years working in different roles for an IT provider (developer, sales representative, managing director), I joined… Continue reading...


Will Data Scientists be Replaced by Machines?

Data Science automation is a hot topic recently, with several articles about it[1]. Most of them discuss the so-called “automation” tools[2]. Too often, editors claim that their tools can automate the Data Science process. This provides the feeling that combining these tools with a Big Data architecture can solve any business problems.

The misconception comes from the confusion between the whole Data Science process… Continue reading...


Data Science Book Review: Statistics Done Wrong

If you read this blog, you are very likely to be involved in any kind of data collection, manipulation or analysis. When not performed wisely, your analysis will lead you to incorrect conclusions. Alex Reinhart, in his book Statistics Done Wrong, has listed several concepts that are key when analysing data, such as statistical power, correlation/causation and publication bias.

Data Science Book Review: Superforecasting

Superforecasting – by Tetlock and Gartner – explains the huge study performed by Tetlock about the ability of people to predict future events (mainly geo-political). The closed questions (i.e. choose between yes/no) are far from real numbers you will predict in business forecasting. Tetlock discusses skills that have been identified as driving accurate forecasts. The point of the authors is that forecasting… Continue reading...


What Could Big Data Mean for Debt Management?

This is a guest post from Yaakov Smith.

Big data is changing the way the financial world handles client interaction. No matter what sector data analytics are employed in (IT, marketing, sales etc.), its implications are leading to a new wave of Business Intelligence (BI).

Any company that uses analytics on a daily basis will understand the ability of big data to transform customer relations and optimize management… Continue reading...