Apache Hadoop, the open source software framework at the heart of big data, is a batch computing engine. It is not well-suited to the online, interactive data processing required for truly real-time data insights. Or is it? Doug Cutting, creator of Hadoop and founder of the Apache Hadoop Project (and chief architect at Cloudera) says he believes Hadoop has a future beyond batch.
"I think batch has its place," Cutting says. "If you're moving bulk amounts of data and you need to really analyze everything, that's not about interactive. But the combination of batch and online computation is what I think people will really appreciate."
"I really see Hadoop becoming the kernel of the mainstream data processing system that businesses will be using," he adds.
Where Hadoop Stands Now
Speaking at the O'Reilly Strata Conference + Hadoop World in New York City, Cutting explains his thoughts on the core themes of the Hadoop stack and where it's heading.
"Hadoop is known as a batch computing engine and indeed that's where we started, with MapReduce," Cutting says. "MapReduce is a wonderful tool. It's a simple programming metaphor that has found many applications. There are books on how to implement a variety of algorithms on MapReduce."
MapReduce is a programming model, designed by Google for batch processing massive datasets in parallel using distributed computing. MapReduce takes an input and breaks it down into many smaller sub-problems, which are distributed to nodes to process in parallel. It then reassembles the answers to those sub-problems to form the output.
"It's also very efficient," Cutting says. "It permits you to move your computation to your data, so you're not copying data around as you're processing it. It also forms a shared platform. Building a distributed system is a complicated process, not something you can do overnight. So we don't want to have to re-implement it again and again. MapReduce has proved itself a solid foundation. We've seen the development of many tools on top of it such as Pig and Hive."
"But, of course, this platform is not just for batch computing," he adds. "It's a much more general platform, I believe."
Defining Characteristics of the Hadoop Platform
To illustrate this, Cutting lays out what he considers the two core themes of Hadoop as it exists today, together with a few other things that he considers matters of "style."
First and foremost, he says, the Hadoop platform is defined by its scalability. It works just fine on small datasets stored in-memory, but is capable of scaling massively to handle huge datasets.
"A big component of scalability that we don't hear a lot talked about is affordability," he says. "We run on commodity hardware because it allows you to scale further. If you can buy 10 times the amount of storage per dollar, then you can store 10 times the amount of data per dollar. So affordability is key, and that's why we use commodity hardware, because it is the most affordable platform."
Just as important, he notes, Hadoop is open source.
"Similarly, open source software is very affordable," he adds. "The core platform that folks develop their applications against is free. You may pay vendors, but you pay vendors for the value they deliver, you don't keep paying them year after year even though you're not getting anything fundamentally new from them. Vendors need to earn your trust and earn your confidence by providing you with value over time."
Beyond that, he says, there are what he considers elements of Hadoop's style.
"There's this notion that you don't need to constrain your data with a strict schema at the time you load it," he says. "Rather, you can afford to save your data in a raw form and then, as you use it, project it to various schemas. We call this schema on read.
Another popular theme in the big data space is that oftentimes simply having more data is a better way to understand your problem than to have a more clever algorithm. It's often better to spend more time gathering data than to fine-tune your algorithm on a smaller data set. Intuitively, this is much like having a higher-resolution image. If you're going to try to analyze it, you'd rather zoom in on the high-resolution image than the low-resolution image."
HBase Is an Example of Online Computing in Hadoop
Batch processing, he notes, is not a defining characteristic of Hadoop. As proof he points to Apache HBase, the highly successful open source, nonrelational distributed database—modeled on Google's BigTable—that is part of the Hadoop stack. HBase is an online computing system, not a batch computing system.
"It performs interactive puts and gets of individual values," Cutting explains. "But it also supports batch. It shares storage with HDFS and with every other component of the stack. And I think that's really what's led to its popularity. It's integrated into the rest of the system. It's not a separate system on the side that you need to move data in and out of. It can share other aspects of the stack: It can share availability, security, disaster recovery. There's a lot of room to permit folks to only have one copy of their data and only one installation of this technology stack."
Looking Ahead to the Hadoop Holy Grail
But if Hadoop is not defined by batch, if it is going to be a more general data processing platform, what will it look like and how will it get there?
"I think there are a number of things we'd like to see in the sort of "Holy Grail" big data system," Cutting says. "Of course we want it to be open source, running on commodity hardware. We also want to see linear scaling: If you need to store ten times the data, you'd like to just buy ten times the hardware and have that work automatically, no matter how big your dataset gets.
Similarly with performance, Cutting says, for both batch performance if you need greater batch throughput or short, smaller batch latency, you'd like to increase the amount of hardware. As for interactive queries, the same thing holds. Increased hardware should give you linear scalability in both performance and magnitude of data process."
"There are other things we'd like to see," he adds. "We'd like to see complex transactions, joins, a lot of technologies which this platform has lacked. I think, classically, folks have believed that they weren't ever going to be present in this platform, that when you adopted a big data platform, you were giving up certain things. I don't think that's the case. I think there's very little that we're going to have to need to give up in the long term."
Google Provided a Map
The reason, Cutting says, is that Google has shown the way to establish these elements in the Hadoop stack.
"Google has given us a map," he says. "We know where we're going. They started out publishing their GFS and MapReduce papers, which we quickly cloned in the Hadoop Project. Through the years, Google has produced a succession of publications that have in many ways inspired the open source stack. The Sawzall system was a precursor to Pig and Hive; BigTable directly inspired HBase, and so on. And I was very excited to see this year Google publish a paper called Spanner about a system that implements transactions in a distributed system—multitable transactions running on a database at a global scale. This is something that I think a lot of us didn't think we'd see anytime soon, and it really helps us to see that the sky's the limit for this platform."
Spanner, Cutting notes, is complicated technology and no one should expect to see it as part of Hadoop next spring. But it provides a route to the Holy Grail, he says. In the meantime, he points to Impala, a new database engine released by Cloudera at the conference this week, which can query datasets stored in HBase using SQL.
"Impala is a huge step down this path toward the Holy Grail," he says. "Now, no longer can you [only] do online puts and gets of values, you can do online queries interactively with Impala. And Impala follows some work from Google, again, that was published a few years ago, and it's very exciting. It's a fundamental new capability in this platform that I think is a tremendously valuable step on its own and will help you build more and better applications on this platform. But also I think it helps to make this point, that this platform isn't a niche. It isn't a one-point technology. It's a general purpose platform."
We know where we're going with it," Cutting says, "and moreover we know how to get there in many cases. So I encourage you to be comfortable adopting it now and know that you can expect more in it tomorrow. We're going to keep this thing advancing."
Thor Olavsrud covers IT Security, Big Data, Open Source, Microsoft Tools and Servers for CIO.com. Follow Thor on Twitter @ThorOlavsrud. Follow everything from CIO.com on Twitter @CIOonline and on Facebook. Email Thor at firstname.lastname@example.org