Large Data Set Analysis in the Cloud: Amazon, Cloudera Improve Hadoop

Traditional business intelligence solutions can't scale to the degree necessary in today's data environment. One solution getting a lot of attention recently: Hadoop, an open-source product inspired by Google's search architecture.

Traditional business intelligence solutions can't scale to the degree necessary in today's data environment. One solution getting a lot of attention recently: Hadoop, an open-source product inspired by Google's search architecture. Twenty years ago, most companies' data came from fundamental transaction systems: Payroll, ERP, and so on. The amounts of data seemed large, but usually were bounded by well-understood limitations: the overall growth of the company and the growth of the general economy. For those companies that wanted to gain more insight from those systems' data, the related data warehousing systems reflected the underlying systems' structure: regular data schema, smooth growth, well-understood analysis needs. The typical business intelligence constraint was the amount of processing power that could be applied. Consequently, a great deal of effort went into the data design to restrict the amount of processing required to the available processing power. This led to the now time-honored business intelligence data warehouses: fact tables, dimension tables, star schemas.

Today, the nature of business intelligence is totally changed. Computing is far more widespread throughout the enterprise, leading to many more systems generating data. Companies are on the Internet, generating huge torrents of unstructured data: searches, clickstreams, interactions, and the like. And it's much harder—if not impossible—to forecast what kinds of analytics a company might want to pursue.

Today it might be clickstream patterns through the company website. Tomorrow it might be cross-correlating external blog postings with order patterns. The day after it might be something completely different. And the system bottleneck has shifted. While in the past the problem was how much processing power was available, today the problem is how much data needs to be analyzed. At Internet-scale, a company might be dealing with dozens or hundreds of terabytes. At that size, the number of drives required to hold the data guarantees frequent drive failures. And attempting to centralize the data imposes too much network traffic to conveniently migrate data to processors.

One thing is clear: the traditional business intelligence solutions can't scale to the degree necessary in today's data environment.

Fortunately, several solutions have been developed. One, in particular, has gotten a lot of attention recently: Hadoop. Essentially, Hadoop is an open source product inspired by Google's search architecture. Interestingly, unlike previous open source products that were usually implementations of previously-existing proprietary products, Hadoop has no proprietary predecessor. The innovation in this aspect of big data resides in the open source community, not in a private company.

Hadoop creates a pool of computers, each with a special Hadoop file system. A central master Hadoop node spreads data across each machine in a file structure designed for large block data reads and writes. It uses a clever hash algorithm to cluster data elements that are similar, making processing data sets extremely efficient. For robustness, three copies of all data is kept to ensure that hardware failures do not halt processing.

When it comes time to mine the data, the programmer can avoid all details of how the data is laid out. A single function is used to organize the overall data set by reading through it and outputting an aggregation organized as key/value pairs. This is known as a map function. A second function—known as the reduce function —then goes through the aggregation output by the map function and selects the desired data, outputting it to a temporary file, organizing it in a table in memory, or even putting it into a data mart to be analyzed with traditional BI tools.

The advantage of this approach is that very large sets of data can be managed and processed in parallel across the machine pool managed by Hadoop. The map/reduce approach is sometimes criticized for being inefficient, since the overall data pool is processed each time a new analysis is desired. While it's true that repeated processing is typical, it's also true that in today's world it's impossible to know beforehand what kinds of analyses are going to be desired in the future. This means that optimizing to reduce the "inefficient" repeated processing would also limit the potential for exploring unforeseen ongoing analytical requirements. Moreover, processing is significantly cheaper than data, relatively speaking. A general rule of thumb is to optimize for the least efficient, most costly resource; with Internet-scale data, that means "wasting" processing to optimize storage.

While Hadoop may seem like a product you have no need for, don't dismiss it. The changing nature of IT, as well as the rapid evolution of business processes, means that you'll likely face the need for this kind of analytical tool in the very near future.

The power of Hadoop can be seen in how the NY Times used it to convert a 4 Tb collection of its pages from one format to another. The programmer assigned the task uploaded the data to a number of Amazon EC2 instances and then ran a Hadoop map/reduce transformation on the pages. Two days later, all of the pages had been converted, speeded up by the parallel processing across 20 machines made possible by Hadoop. Attempting to do the same conversion on one machine would have taken well over a month; attempting to perform the parallel processing without Hadoop would have necessitated creation of a large, complex grid fabric. Hadoop abstracted all the "plumbing" (i.e., spreading the data across machines, coordinating the parallel processing, ensuring that the job was executing properly, etc.) away from the programmer, enabling him to focus on the actual task: creating the document conversion routine executed in the reduce phase of the process.

Hadoop has attracted vendor attention. A new startup, Cloudera, offers certified releases and regular updates, as well as technical support. (Incidentally, Cloudera offers an set of online Hadoop courses, which are excellent). And just last week, Amazon announced it will offer Hadoop support in its Elastic MapReduce, making Hadoop available in the cloud.

The interesting thing about the two Hadoop offerings is that they both bring something unique to the table. Amazon removes the need for a Hadoop user to locate spare computing resources, always a tough task to accomplish in a typical corporate data center—I mean, who has 15 or 20 machines sitting around idle, just waiting to be used for Hadoop? On the other hand, Cloudera's offering avoids the need to upload large amounts of data to Amazon—a challenge given the limited bandwidth available to most companies; using Amazon's offering also imposes data movement costs, since Amazon charges for data movement in and out of AWS.

I suspect that both offerings will prove popular going forward. Each will be used by companies grappling with the need to analyze Internet-scale data. Depending upon the particular project or company constraints, one solution or the other will end up being preferred. In fact, I would not be surprised to see many companies embrace both approaches to Hadoop, once they begin to understand its power.

Bernard Golden is CEO of consulting firm HyperStratus, which specializes in virtualization, cloud computing and related issues. He is also the author of "Virtualization for Dummies," the best-selling book on virtualization to date.

Join the discussion
Be the first to comment on this article. Our Commenting Policies