The Challenges of Managing Mountains of Information
If you think the storage systems in your data centers are out of control, imagine having 449 billion objects in your database, or having to add 40 terabytes of new data each week.
Tue, October 18, 2011
Computerworld — If you think the storage systems in your data centers are out of control, imagine having 449 billion objects in your database, or having to add 40 terabytes of new data each week.
The challenges of managing massive amounts of big data involve storing huge files, creating long-term archives and, of course, making the data accessible.
While data management has always been a key function in corporate IT, "the current frenzy has taken market activity to a whole new level," says Richard Winter, an analyst with Wintercorp Consulting Services, a firm that studies big data trends.
New products appear regularly from established companies and startups alike. Whether it's Hadoop, MapReduce, NoSQL or one of several dozen data warehousing appliances, file systems and new architectures, the data analytics segment is booming, he says.
"We have products to move data, to replicate data and to analyze data on the fly," says Winter. "Scale-out architectures are appearing everywhere as vendors work to address the enormous volumes of data pouring in from social networks, sensors, medical devices and hundreds of other new or greatly expanded data sources."
Some shops know about the challenges inherent in managing really big data all too well. At Amazon.com, Nielsen, Mazda and the Library of Congress, this task has required adopting some innovative approaches to handling billions of objects and petascale storage media, tagging data for quick retrieval and rooting out errors.
Taking a metadata approach
The Library of Congress processes 2.5 petabytes of data each year, which amounts to around 40TB a week. Thomas Youkel, group chief of enterprise systems engineering at the library, estimates the data load will quadruple in the next few years as the library continues to carry out its dual mandates to serve up data for historians and preserve information in all its forms.
The library stores information on 15,000 to 18,000 spinning disks attached to 600 servers in two data centers. Over 90% of the data, or more than 3PB, is stored on a fiber-attached SAN, and the rest is stored on network-attached storage drives.
"The Library of Congress has an interesting model" in that part of the information stored is metadata -- or data about what is stored -- while the other is the actual content, says Greg Schulz, an analyst at consultancy StorageIO. Although plenty of organizations use metadata, Schulz explains that what makes the Library of Congress unique is the sheer size of its data store and the fact that it tags absolutely everything in its collection, including vintage audio recordings, videos, photos and files on other types of media.