by Peter Wayner

The 12 dark secrets of data science

Opinion
Apr 30, 2019
AnalyticsBig DataData Mining

From hidden costs to highly suspect conclusions, data science is not without its detractions and limitations u2014 despite the ongoing hype.

data science certification scientist with beakers
Credit: Thinkstock

Data science is revolutionizing computational fields and providing a foundation for enabling computers to solve problems. From drug design to machine vision, smart algorithms are enriching our lives and sometimes even saving them. But beyond the success stories, there’s a vast amount of questionable and unreliable results. Everyone who approaches a new collection of data with the job of extracting meaningful insights needs to keep this dark side in mind.

Here are 12 rarely discussed downsides of data science that are obscured by the hype and should be kept in mind when mining data for insights.

Many data science discoveries are obvious

When the bank looked for a way to predict loan defaults, they found that people with no savings were more likely to stop paying their debts. When the hospitals looked for causes of doctor error, they found lack of sleep was a big indicator. Tall people hit their head more often. Bicyclists die from head injuries more often than couch potatoes.

Many of the problems we study have obvious answers that dominate the analysis. If the goal is to look for causes, well, the results are going to produce a mathematical confirmation of what we already know but with more significant digits. Is that worth the effort?

The statistical scientists have techniques for controlling for these dominant effects so that smaller effects can be examined, but finding subtle causes can require dramatically more data and study. Is the answer going to be valuable enough to justify this?

Sometimes there’s nothing there

The human mind is good at finding patterns, even when there are none. Casinos often post the last ten or twenty numbers that came up on the roulette wheel because they know gamblers’ brains love to look for sequences even when results are random. Many of the questions that come to data scientists are meant to validate connections noticed by a human brain. Sometimes there is something there and sometimes there isn’t.

Knowing there is no obvious statistical link is often a valuable result, but it can be unsatisfying. The people who thought there would be an answer think the statisticians missed something and the skeptics can only celebrate an empty victory. Data science can’t prove there are no connections at all, just that the particular analysis could find no pattern that was statistically strong enough. Do you want to spend more to drill more wells looking for a gusher?

Statistical answers can be harder to find than we think

John Ioannidis used the dramatic title “Why Most Published Research Findings Are False” for his paper explaining how sensitive statistical methods are to noise. When sample sizes are small and bias creeps in, the answers we get are more likely to be outright wrong, he argues.

The solution is more data, sometimes dramatically more. To analyze an effect that may not be  obvious, the costs of gathering enough data can skyrocket. But if the effect you are looking for is  only subtle, then the value of understanding it may be just as subtle or even nonexistent. In large, highly efficient markets such as equity trading, small effects can be valuable, but in many cases they’re not worth the effort, given how hard it can be to uncover them.

Algorithms imitate the past not the future

Some fields change so rapidly that data science can’t help us predict the future; it can only summarize the past. What can fashion companies do with the knowledge that skinny ties were common in the 1960s, but by the 1970s, customers were buying ties as wide as six inches. Smart data scientists can fit a rhythmic function to the oscillation, but that doesn’t help with the market fragments by 2010.

Data science won’t change the underlying dynamics of what we’re studying. It can only reveal what happened before and we need to guess on whether it will help us in the future.

Data is often messy, inconsistent or outright corrupted

Financial data may seem like a great fit for analysis because it explicitly involves numerical transactions — but still, it can be messy. On one project, I found that one bank was reporting withdrawals as negative values while another was using positive values and relying on a transaction code to identify the direction. The distinctions between the various fees and monthly charges were even harder to turn into a consistent column in the database.

Most topics don’t offer the same simplicity as money. Sensors have glitches. Errors creep into measurements. When even the Olympics can’t build a pool with 8 equal lanes for a fair race despite investing millions of dollars, is there any hope for the rest of us?

The good news is that dramatic effects are easier to find and these dramatic effects can overwhelm all the inconsistencies and noise. The bias in the Olympic pool in Rio de Janeiro was large enough and consistent enough that data scientists were able to quantify just how much went wrong with the pool construction. Alas, this isn’t the same as knowing how to fix the pool to be truly fair but it’s a start.

When data is cheap, filtering is expensive

Some data flows into our computers in endless waves. The log files from web servers overflow with terabytes of information about who wanted which GIF image or which CSS file. Security cameras fill up hard drives with unceasing streams of high-resolution images. When a problem appears, the challenge is not getting the data; it’s finding the right piece of data.

Searching through large collections is something that computers do well — if they begin with a solid model. Building that model is often the job for data scientists. But which comes first? Finding a model for distinguishing a needle from hay? Or finding the needle itself?

Human filters are expensive

Several startups have materialized to plow through data and use their human intelligence to build training sets for machine learning algorithms. They classify images, read documents or listen to audio tapes before filling out forms and, hopefully, checking the right boxes in a consistent way. A manager of one firm told me that people in Venezuela were popular gig workers for building AI training sets because they work for pennies.

Data science can’t begin until this preliminary work is done. If you’re lucky, the coding won’t be too complicated and the humans will produce a good data sample in a manageable amount of time.

Some data is impossible to get

A surprisingly large amount of data is maddeningly elusive. A few months ago, I set out to look at how the population in my neighborhood changed over the past 50 years by downloading U.S. Census data. The bureau shares a staggering amount of data online, but after a week of searching and the help of a good friend who works there, I still couldn’t find how this count has changed over the decades. The numbers are out there. I know it. There are 104 pages of data tables cataloged here, but that’s not the same as having them in my spreadsheet.

Many other forms of data just don’t exist. Humans are too busy to fill out surveys and so marketing teams make educated guesses. Cameras seem to be ubiquitous but resolutions may never be good enough, or they can be pointed the wrong way.

Data science can’t begin until the data is available and often it seems like 99.9 percent of the job is gathering the data in the first place.

Many algorithms teach us nothing

Some of the latest machine learning algorithms can produce dramatic results that can return results with stunning accuracy. If you wonder how they do it, though, no one can tell. The algorithms stack together thousands or millions of filters and tweak the responses in all of them until the results look good. Understanding what’s going on requires parsing millions of numbers.

These smart classifiers can be useful when the training set is a good representation of the job at hand but they’re often brittle and unstable. Unless we understand how the algorithms are making their decisions, we can’t predict when they might fail as the questions shift.

Hidden biases are everywhere

The world of data science is filled with anecdotes about how bias slipped into the data set despite the best efforts. In one, the scientist took photos of one collection in the morning and the other collection after lunch. The machine learning classifier ended up locking onto the difference between the morning and afternoon sun and the shadows it cast.

Finding biases like these is difficult and much of the lab work in science is devoted to isolating the experiments. But if biases were easy to find and remove, we would do it. Figuring out what to do with the ones that are left is often a bit of an art. Some statistical techniques can correct for biases and remove them from the analysis but they can’t be counted on to work. Nor are they as automatic as we would like.

When we’re done with the work and we’ve identified a signal, we still can’t be sure that it’s a real signal or one that’s an echo of a bias. If the economics are right, we can put the statistical truth to work where it might be verified by business success and then it won’t matter if it’s a hidden bias or a real truth.

Sometimes there’s always an answer — even if it’s wrong

Nobel Prize-winning physicist Richard Feynman supposedly said, “ I saw a car with the license plate ‘ARW 357.’ Can you imagine? Of all the millions of license plates in the state, what was the chance that I would see that particular one tonight?”

Data sets are always going to have answers for some questions like finding the maximum, minimum or average. Most algorithms will generate some solution.

One of the major challenges facing scientists is fighting “p-hacking,” the process of combing a data set looking for the results that look statistically significant. The nature of randomness means that there’s often one somewhere in the data. The tricky question is making sure that it’s an answer that will stand up with time.

Sometimes we’re just curious

Many data science projects yield reports filled with hundreds of pages of charts and graphs examining untold combinations and sub-combinations. This often isn’t a big help to the business managers who asked the question in the first place. They want an answer that will save money.

But sometimes this exploration yields something interesting and maybe even useful. Does it hurt to be curious?