comScore is no stranger to big data. The digital analytics company was founded in 1999 with the goal of providing intelligence to companies about what's happening online.
In those early days, the volume of data at the company's fingertips was relatively modest. But that wouldn't last long.
"Things got really interesting for us starting back in 2009 from a volume perspective," says Mike Brown, the company's first software engineer, now its CTO. "Prior to that, we had been in the 50 billion to 100 billion events per month space."
Let the big data flow
Starting in the summer of 2009, it was like someone opened the sluice gates on the dam; data volume increased dramatically and has continued to grow ever since. In December of last year, Brown says comScore recorded more than 1.9 trillion events — more than 10 terabytes of data ingested every single day.
[ Related: Clorox CIO discusses the real challenge of big data ]
Back in 2004, before Hadoop was a twinkle in the eyes of Doug Cutting and Mike Carafella, comScore had begun building its own grid processing stack to deal with its data. But in 2009, five years into that project, comScore was struggling to implement its new Unified Digital Measurement (UDM) initiative and the volume of data and processing requirements were growing fast.
[ Related: 4 ways to beat the big data talent shortage ]
"The census has been huge," Brown says. "Ninety percent of the top 100 media properties participate in that program now and every page is sending us a call."
The company now has about 50 different data sources across the census and panel classes, Brown says.
To accommodate the rising tide of data, comScore started a new round of infrastructure upgrades. It became apparent that its custom-built grid processing stack wasn't going to be able to scale with the need. Fortunately, there was a promising new technology gaining steam that might fit the bill: Apache Hadoop.
Putting data on the MapR
After experimenting with Apache Hadoop, the company decided to go with MapR Technologies' distribution.
"I think we were the first production MapR client," Brown says. "Our cluster has grown to a decent size. We have 450 nodes in our production cluster and that has 10 petabytes of addressable disk space, 40 terabytes of RAM and 18,000+ CPUs."
[ Related: 7 principles of data-driven transformation ]
One of the big deciding factors in favor of MapR's distribution is its support for NFS.
"HDFS is great internally, but to get data in and out of Hadoop, you have to do some kind of HDFS export," Brown says. "With MapR, you can just mount [HDFS] as NFS and then use native tools whether they're in Windows, Unix, Linux or whatever. NFS allowed our enterprise systems to easily access the data in the cluster."
CIOs should think small on big data (at first)
So given his long experience with Hadoop in production, what advice does Brown have for CIOs that are just starting to implement big data technologies? First, start small.
"Everybody is thrilled by the notion of big data, but start small," Brown says. "The technology is there to allow you to scale up, but taking a subset of your data and pounding on that for a while and working on it, that will allow you to demonstrate value to the business much faster."
The really important thing, he says, is to get past the proof of concept (PoC) and put your project into production.
"Choose one thing to try to provide value to show that this does work," he says. "Then get that into production. I'm fearful that some places choose to leave their big data projects as the evergreen PoC. It doesn't get real until you've got it in production. It can be hard, but that's the big thing to do. Once you do that, then it's quick to build momentum."
Brown notes that it's also essential to give careful consideration to the hardware you select. One of the things that really helped Hadoop catch on, he says, is that you can scale with commodity hardware. But that doesn't mean you can skimp.
"When we first started out, I think the conventional wisdom out there from a Hadoop perspective was to go with high-density, low-speed drives," he says. "But when you get into analytics, slow drives make the shuffle be kind of slow."
comScore ran into headfirst into that problem when it first began working with Hadoop. In a MapReduce job, there is a 'shuffle and sort' handoff process that occurs after the 'Map' phase and before the 'Reduce' phase. The data from the mapper are sorted by key and partitioned (assuming there are multiple reducers) and then moved to nodes that will run the reducer tasks, where they are written to disk. This is where low-speed drives can become a big bottleneck, Brown says.
"It's worth doing some traditional IOPS testing on what your drives can do," he says.
"IOPS is really driving a lot of this stuff," he adds. "I've heard of some shops putting in all SSDs now."
Another area that's important to focus on, Brown says, is quality assurance — of your data.
Keep up with your algorithms
"I think the big thing, especially in the data area, is you actually need data QA," he says. "Did the algorithm do what it's supposed to do? Algorithms can need maintenance just like software does."
Finally, he says, make sure that you're thinking long-term.
"The biggest thing that's going on with technology right now is making sure you know where you want to be, not just short-term but long-term, and making sure the technologies you choose are something that's going to help you get on that path," he says. "It may require you to reinvent yourself periodically. That's the big thing. What happens if volume grows by 10X or 100X? Those things to help guide your decision-making."