Hadoop Creator Outlines the Future of Big Data Platform
Doug Cutting, creator of Hadoop and founder of the Apache Hadoop Project, says big data is not hype and it's not a bubble. He lays out his vision of how Hadoop will become the Holy Grail of big data systems
Fri, October 26, 2012
CIO — Apache Hadoop, the open source software framework at the heart of big data, is a batch computing engine. It is not well-suited to the online, interactive data processing required for truly real-time data insights. Or is it? Doug Cutting, creator of Hadoop and founder of the Apache Hadoop Project (and chief architect at Cloudera) says he believes Hadoop has a future beyond batch.
"I think batch has its place," Cutting says. "If you're moving bulk amounts of data and you need to really analyze everything, that's not about interactive. But the combination of batch and online computation is what I think people will really appreciate."
"I really see Hadoop becoming the kernel of the mainstream data processing system that businesses will be using," he adds.
Where Hadoop Stands Now
Speaking at the O'Reilly Strata Conference + Hadoop World in New York City, Cutting explains his thoughts on the core themes of the Hadoop stack and where it's heading.
"Hadoop is known as a batch computing engine and indeed that's where we started, with MapReduce," Cutting says. "MapReduce is a wonderful tool. It's a simple programming metaphor that has found many applications. There are books on how to implement a variety of algorithms on MapReduce."
MapReduce is a programming model, designed by Google for batch processing massive datasets in parallel using distributed computing. MapReduce takes an input and breaks it down into many smaller sub-problems, which are distributed to nodes to process in parallel. It then reassembles the answers to those sub-problems to form the output.
"It's also very efficient," Cutting says. "It permits you to move your computation to your data, so you're not copying data around as you're processing it. It also forms a shared platform. Building a distributed system is a complicated process, not something you can do overnight. So we don't want to have to re-implement it again and again. MapReduce has proved itself a solid foundation. We've seen the development of many tools on top of it such as Pig and Hive."
"But, of course, this platform is not just for batch computing," he adds. "It's a much more general platform, I believe."
Defining Characteristics of the Hadoop Platform
To illustrate this, Cutting lays out what he considers the two core themes of Hadoop as it exists today, together with a few other things that he considers matters of "style."
First and foremost, he says, the Hadoop platform is defined by its scalability. It works just fine on small datasets stored in-memory, but is capable of scaling massively to handle huge datasets.