Just how big are big data? Not the big data hype bubble, mind you—we know that's enormous. Rather, how large do data sets have to be before we can consider them big data?
There is no one answer. Big data is a relative term. It refers to data sets, and the corresponding data challenges, so large that traditional data management and analytics approaches aren't up to the task of squeezing all the value we desire from the information we have. As a result, as our tools and techniques improve, the "bigness" threshold for big data will continue to rise.
This threshold also depends upon the context for the data, which generally aligns with the industry responsible for them. Genomics research, weather prediction and other scientific pursuits push the limit of data set size, but any business that collects information about its customers may also have big data challenges.
Keep in mind Parkinson's Law of Data: the amount of data available expands to fill the available space for it. As our technology for creating, moving and storing data improves, the big data threshold will continue to rise. If anything, it seems the relentless advance of technology is driving the ever-increasing acquisition of information—and this deluge promises to swamp even the most facile of big data strategies.
The central big data challenge, of course, is how to derive value from such immense data sets, essentially recovering those rare gems in the rough—identifying the important, meaningful and insightful nuggets in the onslaught of noise.
Counterintuitively, the more information we have, the less we actually desire, since we only prize the results of careful analysis of our big data, not the data themselves. A mountain containing gold is worthless, regardless of the size of the mountain, if the cost of extracting the precious material exceeds its value.
U.S. Government Sitting on Big Data Goldmine
Today, the U.S. government faces the mother of all big data mountains. From National Oceanic and Atmospheric Administration (NOAA) weather data to earth science information from the U.S. Geological Survey (USGS) to the genomics data at the National Institutes of Health (NIH), the government—and, therefore, the American people—own perhaps the largest collection of big data sets on this planet.
This is extraordinarily valuable in theory, true, but worthless if we're unable to extract the important nuggets. To mine this gold, the Obama Administration announced its Big Data Research and Development Initiative in March. Five agencies made about $200 million in new commitments toward improving big data tools and techniques: the aforementioned NIH and USGS plus the National Science Foundation, the Department of Defense (DOD), the Department of Energy (DOE). The data challenges these agencies and departments face range from better use of the DOE's supercomputers for crunching scientific data to facilitating "rapidly customizable visual reasoning" for diverse DOD missions.
These are valuable nuggets, to be sure, and, in the grand scheme of things, $200 million is a bargain. But the administration's investments in Big Data don't stop there. In August the White House announced its Presidential Innovation Fellows program, which brings a crack team of innovators together to collaborate on projects with the goal to "improve the lives of the American people, save taxpayer money and fuel job creation." On the initial list of target projects are Blue Button for America, an extension of the Department of Veterans Affair's Blue Button initiative, as well as an open-ended set of projects the White House calls Open Data Initiatives.
The Open Data Initiatives have a different mandate than the Big Data Initiative, but the synergy between them is obvious. Open Data focuses on "liberating" government data (as well as contributed corporate data) in order to achieve the strategic goals of the Innovation Fellows program.
What does it mean to liberate data? The two examples cited are NOAA weather data (now at the core of every weather report on television) and the Global Positioning System, without which we'd all literally be lost.
Of these examples, NOAA weather data most obviously present big data challenges. The value in such large data sets doesn't simply depend on the weather data themselves, but in the ability to forecast weather based upon those data—a classic big data problem. From the perspective of the American citizen, we value accurate forecasts; the immense quantity of historical weather data that feed the forecasting engines is merely the ore we must mine to find the nuggets we desire.
Such is the challenge facing the Open Data Initiative. The more data we have, the less we value the data sets themselves. The information we truly desire lies buried under increasing quantities of irrelevant or otherwise useless information. The danger is that the more data the government provides us, the better hidden are the nuggets we desire. In other words, in the absence of effective big data solutions, truly open government may be out of reach—or, worse, misapplied to obscure the very information that citizens would find most valuable.
Big Data Challenges Heightened by Citizens' Right to Information
This undesirable outcome is clearly not the intention of President Barack Obama's Open Government Initiative, which calls for a presumption of openness. True, there are types of information that the government may not or should not share, including military secrets, private data about individuals, and information relevant to ongoing criminal investigations. However, the list of such sensitive information categories is explicit and limited. All other government information is up for grabs.
If you want access to such information, typically all you have to do is go to the relevant agency website, as the Obama Administration ordered them to proactively make information available to all citizens. If you can't find what youre looking for, you may make a Freedom of Information Act request. The act was passed in the 1960s, and Congress extended FOIA in 1974 as a result of Watergate. Today, the Government receives more than 500,000 FOIA requests per year, with a current backlog of more than 80,000 requests.
Typically a citizen makes a FOIA request for a particular document or other information—Steve Job's FBI background check, for example. While such documents have a historical as well as human interest value, their worth pales in comparison to the nuggets of gold that Big Data analyses can potentially reveal.
However, it would be impossible to submit a FOIA request for a big data analysis conclusion, since there may be no way to form such a request. Big data analyses typically ask, "What are the important or interesting conclusions I can draw from these large data sets?" They don't request a particular piece of information. The best big data analytics tell you what information you should think is important, rather than expecting you to know what information is important ahead of time.
Government agencies, therefore, face two strategic big data challenges. First, they must avoid swamping relevant information with noise; second, they must let citizens request important information from the government without having to know ahead of time why that information is important. Furthermore, the larger the available data sets become, the greater these challenges will be.
Our government can talk about open data and open government all it want, but if it doesn't get big data solutions right, then we risk floundering in an ocean of irrelevant information. The Presidential Innovation Fellows have their work cut out for them.