Final Four: March Madness data lessons

Each year, as March Madness narrows to the Final Four and millions of Americans’ brackets go bust, diehards and casual fans alike ask themselves: What can this teach us about data mining?

It’s a natural question. And it’s an especially important one for those who work with large data sets that contain information about inherently non-uniform entities subject to random forces and influences.

You know, like people.

Let’s start at the beginning.

Why we didn’t win a billion dollars from Warren Buffett

As you may recall, in 2014 Warren Buffett offered up \$1B in prize money to anyone who could pick a perfect bracket. Despite the fact that 15 million of us applied our most diligent (or arbitrary) methodologies, no one successfully predicted the combination of wins and losses across the tournament’s 67 games. In fact, the contest ended long before the Final Four—there were no perfect brackets in the pool after just 25 games.

Certainly someone could have gotten it all right, whether through clairvoyance or sheer luck…yet the venerable Oracle of Omaha was willing to bet a billion that no one would. And it was a pretty safe bet. According to the NCAA, the longest verified streak of correct picks is 34 games. In their words: “Based on the reporting, we could not find verified brackets that have been perfect into the Sweet 16 (or 48 games) at all.”

More importantly (for our purposes), the odds of randomly picking a perfect bracket are one in 2^67 – one in 147,573,952,589,676,000,000. That’s important because, when it comes to picking a perfect bracket, everyone’s methods – no matter how arbitrary or carefully researched – are about equally likely to be imperfect. We are all, no matter what, attempting to (randomly) get it right.

You could flip a coin for each March Madness matchup – and you would have similar odds for a perfect streak of, say 36 games, as someone who made their selections based on teams’ regular-season record and historical Final Four performance. Both of you would have similar odds as the person who picked based on the number of letters in a mascot’s name.

Why? Because the individual outcomes we’re trying to predict ahead of March Madness are influenced and determined by factors that inherently defy prediction. A coach’s speech at halftime. The arc of a half-court shot. How well a given player (or referee) slept the night before. Momentum – the undeniable ability of the human psyche to affect performance, either positively or negatively. Innumerable other factors that happen at random, meaning that all that can be known about them is that they may (or may not) occur and may (or may not) materially impact the outcome if they do.

The risks of perceived but unverified patterns in big data

Of course, every year there are brackets that remain perfect far longer than most. An early upset might cut the number from millions to thousands. But a single (and separate) Cinderella story might keep hundreds of those brackets perfect through the next dozen games.

These brackets are outliers – and outliers are always of interest in the sense that they’re rare. But discrete pockets of outliers do not always represent a meaningful or actionable pattern, especially at the scale of large, complex data sets. In many cases they’re not a pattern at all—they’re simply a random concentration of rare instances.

Across millions of March Madness brackets—or millions of healthcare claims and hundreds of thousands of patient health records—there will inevitably be pockets of seemingly similar outliers. Commonalities. Correlations. But how meaningful these connections are can only be determined by validating the model(s) that identify them—through rigorous testing of the hypotheses their existence seems to suggest.

So for healthcare CIOs and other technology leaders, the question isn’t whether a system or vendor can identify patterns in their data. It’s whether those patterns are subsequently validated using additional, external data and iterative machine-learning—techniques that distinguish the random from the real and recurring.

``` ```