How ComScore Is Using Hadoop to Tame Its Big Data Flow

Market researcher ComScore tames a torrent of data with open-source analytics

Mike Brown, the CTO at ComScore, knows a bit about managing big data. Every day 12 terabytes of information rushes into his cluster of 80 servers running the open-source software Hadoop, which sorts and analyzes the data for a host of clients who want to know things like which online vendor sold the most e-cards or how fast Facebook is growing in Brazil. "We ingest 32 billion new rows of a data a day," he says.

The torrent of data is swelling fast enough that Brown plans to be running 200 servers by the end of the year--and without the right data-integration software, he thinks that number could double. Brown has been wading through an ocean of information ever since he joined ComScore as its first software engineer in 1999, shortly after the startup landed its original venture financing. Today the Internet market research firm reports $232 million in revenue a year. "Our growth is pretty darn linear and should continue," Brown says.

ComScore started off on a homegrown grid processing stack and in 2000 added Syncsort's data integration software, the current version of which is DMExpress. "We were up and running in weeks," Brown says. "It literally made our software run 5-10 times faster. You're not just adding storage, but you're adding compute as well."

In 2009, ComScore began migrating to Hadoop, becoming an early adopter of the technology, which has recently begun gaining traction in the enterprise market.

"We decided it was better to leverage the community than invest in building our own," Brown says. "In general, Hadoop is harder to bring into an enterprise when you have mixed operating systems. DMExpress, with their connector, is helping to solve this issue."

That's a typical experience, notes James Kobielus, in a recent report for Forrester Research, where he was an analyst. Hadoop, he wrote, "lacks some critical enterprise data warehouse features, such as real-time integration and robust high availability. The Hadoop market includes many vendors that have focused on these and other deficiencies in the core Hadoop stack. Vendors have, of necessity, either built proprietary extensions to address these requirements or have leveraged various NoSQL tools and open source code to provide the requisite functionality."

In ComScore's case, Brown found that Syncsort's software made the Hadoop migration a piece of cake. "You don't have to change any code, except the push code," he says. "We use DMExpress in [more than] 30 different apps. It's our tool for any situation [where] we have to adjust the data."

"We can store twice as much data on the cluster," he continues, "and we also use it to improve performance. One big problem it solved was the ability to chunk and split the large files we have into files that fit perfectly into the chunks on Hadoop. This enables us to have a higher rate of parallelism on compressed files while reducing our costs for disk on the cluster."

That, Brown says, translates into saving 75 terabytes of data storage a month. That, too, is big data.

To comment on this article and other CIO content, visit us on Facebook, LinkedIn or Twitter.
NEW! Download the State of the CIO 2017 report