What March Madness can teach healthcare CIOs and technology leaders about data mining. Each year, as March Madness narrows to the Final Four and millions of Americans’ brackets go bust, diehards and casual fans alike ask themselves: What can this teach us about data mining? It’s a natural question. And it’s an especially important one for those who work with large data sets that contain information about inherently non-uniform entities subject to random forces and influences. You know, like people. Let’s start at the beginning. Why we didn’t win a billion dollars from Warren Buffett As you may recall, in 2014 Warren Buffett offered up $1B in prize money to anyone who could pick a perfect bracket. Despite the fact that 15 million of us applied our most diligent (or arbitrary) methodologies, no one successfully predicted the combination of wins and losses across the tournament’s 67 games. In fact, the contest ended long before the Final Four—there were no perfect brackets in the pool after just 25 games. Certainly someone could have gotten it all right, whether through clairvoyance or sheer luck…yet the venerable Oracle of Omaha was willing to bet a billion that no one would. And it was a pretty safe bet. According to the NCAA, the longest verified streak of correct picks is 34 games. In their words: “Based on the reporting, we could not find verified brackets that have been perfect into the Sweet 16 (or 48 games) at all.” More importantly (for our purposes), the odds of randomly picking a perfect bracket are one in 2^67 – one in 147,573,952,589,676,000,000. That’s important because, when it comes to picking a perfect bracket, everyone’s methods – no matter how arbitrary or carefully researched – are about equally likely to be imperfect. We are all, no matter what, attempting to (randomly) get it right. You could flip a coin for each March Madness matchup – and you would have similar odds for a perfect streak of, say 36 games, as someone who made their selections based on teams’ regular-season record and historical Final Four performance. Both of you would have similar odds as the person who picked based on the number of letters in a mascot’s name. Why? Because the individual outcomes we’re trying to predict ahead of March Madness are influenced and determined by factors that inherently defy prediction. A coach’s speech at halftime. The arc of a half-court shot. How well a given player (or referee) slept the night before. Momentum – the undeniable ability of the human psyche to affect performance, either positively or negatively. Innumerable other factors that happen at random, meaning that all that can be known about them is that they may (or may not) occur and may (or may not) materially impact the outcome if they do. The risks of perceived but unverified patterns in big data Of course, every year there are brackets that remain perfect far longer than most. An early upset might cut the number from millions to thousands. But a single (and separate) Cinderella story might keep hundreds of those brackets perfect through the next dozen games. These brackets are outliers – and outliers are always of interest in the sense that they’re rare. But discrete pockets of outliers do not always represent a meaningful or actionable pattern, especially at the scale of large, complex data sets. In many cases they’re not a pattern at all—they’re simply a random concentration of rare instances. Across millions of March Madness brackets—or millions of healthcare claims and hundreds of thousands of patient health records—there will inevitably be pockets of seemingly similar outliers. Commonalities. Correlations. But how meaningful these connections are can only be determined by validating the model(s) that identify them—through rigorous testing of the hypotheses their existence seems to suggest. So for healthcare CIOs and other technology leaders, the question isn’t whether a system or vendor can identify patterns in their data. It’s whether those patterns are subsequently validated using additional, external data and iterative machine-learning—techniques that distinguish the random from the real and recurring. Related content opinion If healthcare is so expensive, why are so many hospitals running on such thin margins? The role of technology in ensuring the financial sustainability of hospitals and health systems. By Paul Bradley Nov 04, 2016 4 mins Healthcare Industry Data Mining Analytics opinion Bringing the agile development model to the healthcare revenue cycle A look at implementing predictive analytics By Paul Bradley Aug 23, 2016 5 mins Analytics Software Development opinion Weather or not: Thunderstorms and healthcare predictive analytics The parallels between atmospheric science and healthcare predictive analytics. By Paul Bradley Jun 03, 2016 6 mins Healthcare Industry Predictive Analytics Analytics opinion How contract modeling could reshape the NFL—and help hospitals succeed Applying data-mining and predictive modeling to level the playing field during contract negotiation. By Paul Bradley May 02, 2016 5 mins Sports Software Healthcare Industry Data Mining Podcasts Videos Resources Events SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe