Each year, as March Madness narrows to the Final Four and millions of Americans\u2019 brackets go bust, diehards and casual fans alike ask themselves: What can this teach us about data mining?\nIt\u2019s a natural question. And it\u2019s an especially important one for those who work with large data sets that contain information about inherently non-uniform entities subject to random forces and influences.\nYou know, like people.\nLet\u2019s start at the beginning.\nWhy we didn\u2019t win a billion dollars from Warren Buffett\nAs you may recall, in 2014 Warren Buffett offered up $1B in prize money to anyone who could pick a perfect bracket. Despite the fact that 15 million of us applied our most diligent (or arbitrary) methodologies, no one successfully predicted the combination of wins and losses across the tournament\u2019s 67 games. In fact, the contest ended long before the Final Four\u2014there were no perfect brackets in the pool after just 25 games.\nCertainly someone could have gotten it all right, whether through clairvoyance or sheer luck\u2026yet the venerable Oracle of Omaha was willing to bet a billion that no one would. And it was a pretty safe bet. According to the NCAA, the longest verified streak of correct picks is 34 games. In their words: \u201cBased on the reporting, we could not find verified brackets that have been perfect into the Sweet 16 (or 48 games) at all.\u201d\nMore importantly (for our purposes), the odds of randomly picking a perfect bracket are one in 2^67 \u2013 one in 147,573,952,589,676,000,000. That\u2019s important because, when it comes to picking a perfect bracket, everyone\u2019s methods \u2013 no matter how arbitrary or carefully researched \u2013 are about equally likely to be imperfect. We are all, no matter what, attempting to (randomly) get it right.\nYou could flip a coin for each March Madness matchup \u2013 and you would have similar odds for a perfect streak of, say 36 games, as someone who made their selections based on teams\u2019 regular-season record and historical Final Four performance. Both of you would have similar odds as the person who picked based on the number of letters in a mascot\u2019s name.\nWhy? Because the individual outcomes we\u2019re trying to predict ahead of March Madness are influenced and determined by factors that inherently defy prediction. A coach\u2019s speech at halftime. The arc of a half-court shot. How well a given player (or referee) slept the night before. Momentum \u2013 the undeniable ability of the human psyche to affect performance, either positively or negatively. Innumerable other factors that happen at random, meaning that all that can be known about them is that they may (or may not) occur and may (or may not) materially impact the outcome if they do.\nThe risks of perceived but unverified patterns in big data\nOf course, every year there are brackets that remain perfect far longer than most. An early upset might cut the number from millions to thousands. But a single (and separate) Cinderella story might keep hundreds of those brackets perfect through the next dozen games.\nThese brackets are outliers \u2013 and outliers are always of interest in the sense that they\u2019re rare. But discrete pockets of outliers do not always represent a meaningful or actionable pattern, especially at the scale of large, complex data sets. In many cases they\u2019re not a pattern at all\u2014they\u2019re simply a random concentration of rare instances.\nAcross millions of March Madness brackets\u2014or millions of healthcare claims and hundreds of thousands of patient health records\u2014there will inevitably be pockets of seemingly similar outliers. Commonalities. Correlations. But how meaningful these connections are can only be determined by validating the model(s) that identify them\u2014through rigorous testing of the hypotheses their existence seems to suggest.\nSo for healthcare CIOs and other technology leaders, the question isn\u2019t whether a system or vendor can identify patterns in their data. It\u2019s whether those patterns are subsequently validated using additional, external data and iterative machine-learning\u2014techniques that distinguish the random from the real and recurring.