Large Data Set Analysis in the Cloud: Amazon, Cloudera Improve Hadoop

Traditional business intelligence solutions can't scale to the degree necessary in today's data environment. One solution getting a lot of attention recently: Hadoop, an open-source product inspired by Google's search architecture.

By Bernard Golden
Thu, April 09, 2009

CIO — Traditional business intelligence solutions can't scale to the degree necessary in today's data environment. One solution getting a lot of attention recently: Hadoop, an open-source product inspired by Google's search architecture. Twenty years ago, most companies' data came from fundamental transaction systems: Payroll, ERP, and so on. The amounts of data seemed large, but usually were bounded by well-understood limitations: the overall growth of the company and the growth of the general economy. For those companies that wanted to gain more insight from those systems' data, the related data warehousing systems reflected the underlying systems' structure: regular data schema, smooth growth, well-understood analysis needs. The typical business intelligence constraint was the amount of processing power that could be applied. Consequently, a great deal of effort went into the data design to restrict the amount of processing required to the available processing power. This led to the now time-honored business intelligence data warehouses: fact tables, dimension tables, star schemas.

Today, the nature of business intelligence is totally changed. Computing is far more widespread throughout the enterprise, leading to many more systems generating data. Companies are on the Internet, generating huge torrents of unstructured data: searches, clickstreams, interactions, and the like. And it's much harder—if not impossible—to forecast what kinds of analytics a company might want to pursue.

Today it might be clickstream patterns through the company website. Tomorrow it might be cross-correlating external blog postings with order patterns. The day after it might be something completely different. And the system bottleneck has shifted. While in the past the problem was how much processing power was available, today the problem is how much data needs to be analyzed. At Internet-scale, a company might be dealing with dozens or hundreds of terabytes. At that size, the number of drives required to hold the data guarantees frequent drive failures. And attempting to centralize the data imposes too much network traffic to conveniently migrate data to processors.

One thing is clear: the traditional business intelligence solutions can't scale to the degree necessary in today's data environment.

Fortunately, several solutions have been developed. One, in particular, has gotten a lot of attention recently: Hadoop. Essentially, Hadoop is an open source product inspired by Google's search architecture. Interestingly, unlike previous open source products that were usually implementations of previously-existing proprietary products, Hadoop has no proprietary predecessor. The innovation in this aspect of big data resides in the open source community, not in a private company.

Hadoop creates a pool of computers, each with a special Hadoop file system. A central master Hadoop node spreads data across each machine in a file structure designed for large block data reads and writes. It uses a clever hash algorithm to cluster data elements that are similar, making processing data sets extremely efficient. For robustness, three copies of all data is kept to ensure that hardware failures do not halt processing.

When it comes time to mine the data, the programmer can avoid all details of how the data is laid out. A single function is used to organize the overall data set by reading through it and outputting an aggregation organized as key/value pairs. This is known as a map function. A second function—known as the reduce function —then goes through the aggregation output by the map function and selects the desired data, outputting it to a temporary file, organizing it in a table in memory, or even putting it into a data mart to be analyzed with traditional BI tools.

The advantage of this approach is that very large sets of data can be managed and processed in parallel across the machine pool managed by Hadoop. The map/reduce approach is sometimes criticized for being inefficient, since the overall data pool is processed each time a new analysis is desired. While it's true that repeated processing is typical, it's also true that in today's world it's impossible to know beforehand what kinds of analyses are going to be desired in the future. This means that optimizing to reduce the "inefficient" repeated processing would also limit the potential for exploring unforeseen ongoing analytical requirements. Moreover, processing is significantly cheaper than data, relatively speaking. A general rule of thumb is to optimize for the least efficient, most costly resource; with Internet-scale data, that means "wasting" processing to optimize storage.

While Hadoop may seem like a product you have no need for, don't dismiss it. The changing nature of IT, as well as the rapid evolution of business processes, means that you'll likely face the need for this kind of analytical tool in the very near future.

The power of Hadoop can be seen in how the NY Times used it to convert a 4 Tb collection of its pages from one format to another. The programmer assigned the task uploaded the data to a number of Amazon EC2 instances and then ran a Hadoop map/reduce transformation on the pages. Two days later, all of the pages had been converted, speeded up by the parallel processing across 20 machines made possible by Hadoop. Attempting to do the same conversion on one machine would have taken well over a month; attempting to perform the parallel processing without Hadoop would have necessitated creation of a large, complex grid fabric. Hadoop abstracted all the "plumbing" (i.e., spreading the data across machines, coordinating the parallel processing, ensuring that the job was executing properly, etc.) away from the programmer, enabling him to focus on the actual task: creating the document conversion routine executed in the reduce phase of the process.

Hadoop has attracted vendor attention. A new startup, Cloudera, offers certified releases and regular updates, as well as technical support. (Incidentally, Cloudera offers an set of online Hadoop courses, which are excellent). And just last week, Amazon announced it will offer Hadoop support in its Elastic MapReduce, making Hadoop available in the cloud.

The interesting thing about the two Hadoop offerings is that they both bring something unique to the table. Amazon removes the need for a Hadoop user to locate spare computing resources, always a tough task to accomplish in a typical corporate data center—I mean, who has 15 or 20 machines sitting around idle, just waiting to be used for Hadoop? On the other hand, Cloudera's offering avoids the need to upload large amounts of data to Amazon—a challenge given the limited bandwidth available to most companies; using Amazon's offering also imposes data movement costs, since Amazon charges for data movement in and out of AWS.

I suspect that both offerings will prove popular going forward. Each will be used by companies grappling with the need to analyze Internet-scale data. Depending upon the particular project or company constraints, one solution or the other will end up being preferred. In fact, I would not be surprised to see many companies embrace both approaches to Hadoop, once they begin to understand its power.

Bernard Golden is CEO of consulting firm HyperStratus, which specializes in virtualization, cloud computing and related issues. He is also the author of "Virtualization for Dummies," the best-selling book on virtualization to date.

In this paper, Forrester Consulting examines the total economic impact and potential return on investment (ROI) realized by three Enterprise organizations as they virtualized mission-critical Oracle databases on the VMware vSphere platform. The purpose of this study is to provide readers with a framework to evaluate the potential financial impact of VMware vSphere on their organizations.
Even though virtualization has brought positive change to enterprise IT over the last decade, some skepticism remains about how valuable virtualization can be in the way companies deliver and run business applications. Uncover the truth about how you can run your business critical applications with confi dence without sacrifi cing
availability or service quality-and at lower costs.
This IDG whitepaper highlights key findings based on the Quickpoll Survey conducted with more than 300 Enterprise and Commercial IT decision makers worldwide about the state of their virtualization of business critical applications. This paper answers such questions as: What drivers are pushing companies to extend virtualization beyond servers? and What value are they realizing? Central to the paper are key results that expose risks of the past (fears of limited ISV support, performance impact) no longer are a factor for companies moving to 80+% virtualized.
The Kelley School of Business at Indiana University deployed VMware Infrastructure which decreases costs, streamlines server deployment, and reduces energy consumption.
New study quantifies how VMware improved TCO and ROI for three companies' IT landscapes.
This IDC white paper explains how much of the Enterprise IT community is at a crossroads in extending their journey to the private cloud: Companies must virtualize their business critical applications in order to reap the benefits of cloud computing. The paper also includes two case studies and a sidebar highlighting the experiences of three enterprises with virtualizing their business-critical applications, which include Oracle and Microsoft SQL databases, SAP and enterprise Java, and a Microsoft Exchange email system.
As greater numbers of datacenter servers transition from the physical to the virtual world, the components of virtualization success come to the fore. What scores of organizations have discovered is that success is derived from an optimal pairing of the right software platform with the right hardware platform.
Virtualizing business-critical applications is an essential step in your journey to the cloud. Microsoft SQL Server, Exchange and SharePoint, and Oracle applications, are often the backbone of business IT. The benefits of virtualizing these applications extend far beyond mere consolidation. Understanding how VMware improves quality of service and agility while reducing costs will help you make the case for taking virtualization to the next level in your company.
Virtualizing business-critical applications has become a key focus for organizations as they move along their virtualization journey. With the launch of VMware vSphere® 5, VMware is helping customers accelerate the deployment of business-critical applications, including Exchange, SQL, SAP and Oracle.
Want to say goodbye to missed SLAs? VMware can help you virtualize mission-critical applications such as Oracle, MS Exchange and SharePoint to achieve dramatic improvements in uptime, performance and responsiveness. In this webcast, we'll discuss the key benefits of virtualizing your agency's most critical applications and Oracle databases as a necessary first step in fulfilling OMB's mandate to move IT services to the cloud. With VMware, you'll be on the way to quick, effective and full compliance.
Federal IT managers are on the forefront of realizing the benefits that a secure, easy-to-manage virtual desktop environment can provide. The key is how to deliver the end-user experience that is comparable to a physical desktop. This webcast will show how the recently released VMware View 5 environment is being used to deploy virtual desktops to provide mission-critical solutions around Disaster Recover/COOP, telework and secure mobile applications to federal organizations. View this webcast and learn how new features and benefits of the VMware View 5 environment meet the needs of Federal customers
This video webcast is designed to help those with little to no virtualization experience understand why virtualization and VMware are so important to driving down both capital and operational costs. The session will start with the introduction of the key concepts and technologies of virtualization, introduce the vSphere Hypervisor, and build up to an overview of VMware vSphere® 5, the world's most robust and complete virtualization platform. This session will also discuss new solutions such as the vSphere Storage Appliance and VMware GO that are making it easier than ever before to get started with virtualization.
Newsletter Sign-Up »

Receive the latest news test, reviews and trends on your favorite technology topics

Choose a newsletter
  1. View all Newsletters | Privacy Policy
Resource Center