Big datais enjoying major popularity at the moment, with algorithms and machine learning at the core of almost every business application. This cutting edge technology utilises massive datasets to run increasingly complex algorithmsto make decisions with far reaching consequences. Big data culture is becoming the norm as companies strive to acquire the business intelligence that comes as a result of predictive models and statistical analyses.

There is great value in this as it allows companies to draw conclusions and use prescriptive statistics to make intelligent business decisions. But at what point does the data start controlling the business user instead of the business user controlling the data? People seem to accept the power of big data at face value because if it was spewed out of a machine, it must be right. Right? Wrong!

There are inherent errors and weaknesses in most analytical models, as proven by Kurt Gödel with his incompleteness theorems, which predict that every formal system eventually fails. And unfortunately with big data, the scope of failure is correspondingly larger.

Here are the three most common underlying causes for issues with big data.

Phantom data

Most numbers we deal with in daily decision making, come from massive databases and have been analysed through complex analytical processes before we eventually see them. At first glance there is no way to tell if these numbers are accurate.

In most cases, the original numbers are punched into a machine on the manufacturing shop floor by front-line employees. Thus, the input data is subject to human error. At the front end, cashiers are still responsible for ringing up the correct bar codes, and stock personnel are still responsible for counting and placing stock correctly. We haven’t outsourced these necessities to machines yet and consequently, errors in this stage of the process can result in larger discrepancies further up the line and result in inappropriate purchasing and marketing decisions.

Thus, it is critical to control the numbers going into any system. The GIGO (garbage in = garbage out) theory holds true in this respect, and thus control points and cross referencing is crucial to any manufacturing or business process to negate human error and ensure the data being fed into the algorithms is accurate. For example, a study reveals that up to 65 percent of retailers inventory is inaccurate. Phantom stock causes problems in the retail arena, as the control system says the stock is there but for whatever reason (theft, fraud or miscounting) the items are not available when customers request them. This leads to customer dissatisfaction and possible legal problems higher up, as procurement is also impacted.

Blindly believing the numbers

Data has become so entrenched in our lives and we rely on it almost exclusively for certain functions. A particularly common algorithm is the one governing job performance evaluation. The job rating is used by superiors to assess performance, but what they don’t assess is the entire paradigm. They don’t take into consideration extenuating circumstances, like perhaps an outlier in the data set has skewed the results. Unfortunately, the powers that be take the end result regardless for evaluation purposes. If it was a person’s judgement, people wouldn’t hesitate to question, but the results of data analytics often go unchallenged.

The school system in America has been privy to this type of data misrepresentation. Declining SAT scores sent the school system reeling and making dramatic changes to rectify the situation. On closer analysis, the scores were actually subject to Simpson’s paradox. This is a statistical error that occurs when the mean declines as a result of increased number of lower scores in the dataset. In this instance, the causal factor was the growth in number of disadvantaged children taking the test.

It is essential to consider if your data set has changed in any way before direct comparisons can be made. A prime example is the increase in the incident rate on the factory floor, whereas it was actually just an increase in reporting of the incidents that skewed the results.

Statistical over-fitting

In many businesses, decisions are made based on statistical inferences drawn from past behaviours. This system is inherently flawed, especially where the data set is small enough for a few outliers to skew the outcome considerably.

There is an element of randomness in every data set, which assumes that the more precisely a predictive model is tailored to past events; the less likely it is to accurately reflect the future.

Don’t fall prey to the numbers game

Even the most complicated and high end models fail some times. And when they do, it’s usually catastrophic. For example with stock market prediction models, there are millions of dollars at stake. As we increasingly rely on machines for the information we use to make decisions, we should do so with open minds. Ask questions and find out how the figures are calculated and what assumptions are used in the models. To blindly accept numbers is foolhardy. It is imperative to know how the data was collected and how the inferences were drawn, that way informed decisions can be made.

