Enterprise Hadoop: Big Data Processing Made Easier

Amazon, Cloudera, Hortonworks, IBM, and MapR mix simpler setup of Hadoop clusters with proprietary twists and trade-offs

It's been a big year for Apache Hadoop, the open source project that helps you split your workload among a rack of computers. The buzzword is now well known to your boss but still just a vague and hazy concept for your boss's boss. That puts it in the sweet spot when there's plenty of room for experimentation. The list of companies using Hadoop in production work grows longer each day, and it probably won't be long before "Hadoop cluster" takes over the role that the words "crazy supercomputer" used to play in thriller movies. The next version of the WOPR is bound to run Hadoop.

The area is flourishing as the core project attracts a wide collection of helper projects that organize the workload and make it simpler to manage a collection of jobs to run at particular times. There's HDFS, a standard file system that can organize the data spread out around the cluster; Hive, a data warehousing layer for making sense of this data; Mahout, a collection of routines for trying to learn something from said data; and ZooKeeper, a tool for keeping all of the balls in the air. At least a half-dozen or more other open source tools live in a stable orbit around Hadoop.

[ Explore the current trends and solutions in BI with InfoWorld's interactive Business Intelligence iGuide. | Read about InfoWorld's 2012 Technology of the Year Award winners. | Read about InfoWorld's top 10 emerging enterprise technologies. | Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. ]

The open source projects are just the beginning -- a surprisingly large number of companies are emerging with the plan of helping people actually use Hadoop. Some are just selling support, and others are building their own tools that sit alongside Hadoop and make it easier to use.

This kind of competition is usually seen as open source at its best. There is a core collection of packages that serve like a standard to keep everyone in synchrony. Each of the groups is competing to add the right sauce that will attract customers, both paying and nonpaying. There continues to be controversy over just how much is rolled into the central collection, as there can be in any major open source project, but the amount of experimentation is so large that it's hard to be too focused on the amount of sharing.

To get a feel for the excitement, I took four major collections out for a test-drive. I powered up a cluster of nodes on Rackspace, installed the tools, pushed the buttons, and ran some sample jobs. It's getting to be surprisingly easy to spend a few pennies for an hour or two of machine time -- so much so that I found myself debating whether it was worth leaving my cluster idling over lunchtime. Lest anyone doubt the efficiency of cloud computing, I noticed that the rate for my cluster of relatively fat machines with 4GB of RAM was less than the cost to park a car around the corner. The parking meters spin faster.

The not-so-good news is that these collections are far from perfect. None of the tools I tried worked exactly as promised. There were always small glitches. I often found myself reading the log files and paging through endless lists of Java stack dumps. (Someone is going to have to apply Hadoop to analyzing the endless stack dumps. They're getting so involved that I doubt a single machine will be able to parse them for much longer.) After a few seconds, I could usually get things on track again. These tools may not require someone with much experience to use once they're running, but they can't be installed unless you're fairly adept with the ways that the Java stack is organized.

Despite these impediments, I spent most of my time churning through data. The good news is that all of these tools make it pretty easy to get a cluster of computers working together to solve problems. Using these tools is much easier than downloading and configuring the source code yourself. They're designed to be one-button applications, and they come close to achieving that goal.

Amazon Elastic MapReduce

It should be no surprise that Amazon, one of the pioneers of cloud computing, offers a mechanism for spinning up Hadoop clusters on its EC2 cloud. Elastic MapReduce is tightly integrated with all of Amazon's other elastic offerings, and it sits as another tab on the Amazon Web Services main page. You store your data in S3, then fire up a job to churn through it.

The integration is nicely done. Amazon provides a Java-based Web interface that does a great job of hand-holding, taking care of many of the glitches that often occur when you're first trying software. When it wanted to store data in an S3 bucket, it flipped me over to a page for creating the bucket.

If the Web GUI is a bit too babyish, there's also a classic Web service API that's been wrapped up in software by a number of other programmers. I played a bit with a Ruby-based collection of tools that submits the jobs and starts them running. The standard start and end is the S3 cloud.

With Elastic MapReduce, Amazon is essentially offering nicer packaging on top of EC2 for those who are willing to plunge deeper into Amazon Web Services. I could have built my own cluster of machines on EC2 and used any of the Hadoop distros to spin them up, but Elastic MapReduce offers a nice set of shortcuts. Amazon has already built and integrated the infrastructure, and you just push the buttons to choose which version of Hadoop (0.18 or 0.2) you want to use. There's no need to worry about which version of Linux is running underneath.

The infrastructure is quite nice. You can choose to pay a stock price for your machines or just bid for empty machines on the spot market. This is the kind of extra feature that thrills the free-market fans, but I found it confusing. You choose your bid and take your chances. If you bid too little, you could end up waiting a long time, perhaps even forever.

It should be noted that the cloud doesn't respond instantaneously. It took from 5 to 18 minutes to execute tiny jobs that would take microseconds to execute on a fully configured cluster in your own server closet. The overhead wouldn't make a difference for a big job, but it's not the same as having your own cluster waiting patiently for you to push the Start button.

Taking advantage of all of these features means buying into Amazon's storage system. If you're already using S3 for your data, you'll be ready to go. If you're not, you'll have to make some decisions. Some people find that S3 is too expensive for bulk data that's rarely accessed. You're paying for all of the engineering that's been built for people who need a fairly good response time, and that price is built into the retrieval costs.

I think all of Amazon's extra features are good options for two classes of users. If you already have most of the relevant data in Amazon's cloud, Elastic MapReduce makes it easy to spin up jobs to analyze it. The piping is already well in place.

The other group would be those who don't need a cluster most of the time but want to do short, intensive calculations once a week, once a month, or once a quarter. It's not much work to create a full Hadoop cluster using the other tools in this review, but it's kind of silly to request new machines from scratch every now and again. Amazon offers a nice shortcut to uploading a Python script or a JAR file and going straight to computation.

Cloudera CDH, Manager, and Enterprise

Cloudera is a startup that has collected Hadoop experts from all of the major companies using Hadoop. The CTO came from Yahoo, the chief scientist from Facebook, and the CEO from Oracle. The staff is filled with the names of people who learned Hadoop by building it.

The company is selling training, support, professional services, and some tools for managing your cluster. The Cloudera distribution and basic manager are free for clusters with fewer than 50 machines, while the subscription-based enterprise edition offers many more features for handling standard data formats.

The free version is quite useful for starting up a cluster and monitoring the jobs as they flow through the system. The manager takes a list of IP addresses, logs into all of them with SSH, and installs the major tools.

The automation makes it pretty easy to run the Cloudera distro, but I still had to patch a few glitches to install it on CentOS. One component wanted a certain version of zip, and it ground to a halt until I logged into the machines and installed it myself. At another point, the Web-based graphical user interface wouldn't work until I logged in again and installed a widget library, ExtJS. The open source licenses probably weren't compatible.

The logging in reminded me of a small point. The IBM installer can use a different root password for each machine. Cloudera's installer wants to use either the same root password or the same RSA key. This meant I had to log into all of the machines and change the password because I was using a stock version of CentOS to start up the rack.

The fact that I noticed this small point and remembered it says much about what is for sale here. The tools are open source and the companies are selling ease-of-use. Little delays can multiply when you're not running exactly the same code.

I think Cloudera has done a better job of making its tools work with different Linux distros. It lists Ubuntu, Suse, Red Hat, CentOS, and Debian. Although I had to do a bit of patching with CentOS, it was relatively simple.

The difference between the free and enterprise versions is a bit bigger than I often see. The proprietary version will not only handle more than 50 machines, but it also includes plenty of monitoring, reporting, and data analysis tools.

In other words, the free version is a great way to start up a Hadoop cluster and make sure that everything is running, but you'll have to do some poking around to monitor it. The enterprise version includes more tools that automate the poking around and double-checking.

IBM InfoSphere BigInsights

IBM bundles Hadoop into something it calls InfoSphere BigInsights. The word "Hadoop" is on the main page, but the advertising copy clearly suggests that this is a product to help people who want "deep insights" into "big data." It's a tool for data analysis that just happens to use Hadoop for all of the structure.

There are two tiers: basic and enterprise. The basic edition is available completely for free, but you can buy support if you like. The enterprise edition, available through a commercial license, includes a number of extra features like BigSheets, a spreadsheetlike tool for drilling down into the data sitting in the cluster.

The collection includes all of the usual suspects and a few that aren't always mentioned -- such as Lucene. Lucene makes sense because BigInsights includes more than a few mechanisms for taking apart text. There's an entire collection of TextExtractors that will do things like search for addresses and flag certain words. The meat of the text analytics is in the enterprise edition.

IBM's literature says the BigInsights package is for Linux, but I found that it ran smoothly with only Red Hat's Enterprise distribution. The installation script would limp to the finish with a few of the others I tried, but it often reported that it failed to install entire tools like Hive or Pig. Even CentOS wasn't close enough to get much running. I think it may still be possible to get BigInsights running if you're adept with Linux and happy to poke around the log files, but it achieves labor savings only if you're running Red Hat Enterprise.

There are several nice touches in this installation script, by the way. As I plowed along looking for a good distribution, the software was careful to remember all of my inputs, so it wouldn't need to be reconfigured each time. This should be useful in a cloud where people may try to spin up a cluster, then tear it down. The software also includes a number of little features, like the ability to remember a different root password for each node; these can be quite helpful.

The center of the IBM tool is a console that helps you set up some jobs and kick them off. It's completely browser-based -- like the install script -- and you can simply upload your JAR files directly through the Web browser. You can even drill down into the HDFS file system layer and read the results without leaving the browser.

The Web GUI is a big advance over using the command line, but I easily found a number of ways that the console in the basic edition could be improved. As far as I can tell, there's no way to delete the old jobs. The information for each job includes basic details about the start and stop time, but almost everything else is just dumped as raw text. It wouldn't be too hard to parse some of this and do a nicer job displaying the log information.

The monitoring is also rudimentary. You can see that the nodes in your cluster are running and the components have started, but you don't get any cool dials or widgets that show the load or the progress. If you ask for the "details" about a component, you get a popup with some Log4J lines related to that component. A Java programmer won't blink an eye, but others might find it spare and uninviting.

1 2 Page 1
Page 1 of 2
7 secrets of successful remote IT teams