by Frank J. Ohlhorst

How to Use Hadoop to Overcome Storage Limitations

Apr 18, 20126 mins
Business IntelligenceData CenterData Management

Big data is all about storing and accessing large amounts of structured and unstructured data. However, where to put that data and how to access it have become the biggest challenges for enterprises looking to leverage the information. If you haven't yet considererd the open source Hadoop platform, now's the time.

Storage technology has evolved and matured to the point where it has started to approach commodity status in many data centers. Nevertheless, today’s enterprises are faced with evolving needs that can strain storage technologies—a case in point is the push for big data analytics, an initiative that brings business intelligence (BI) capabilities to large data sets.

However, the big data analytics process demands capabilities that are usually beyond the typical storage paradigms—simply put, traditional storage technologies, such as SANs, NAS and others cannot natively deal with the terabytes and petabytes of unstructured information that come with the big data challenge.

Success with big data analytics demands something more—a new way to deal with large volumes of data—in other words, a new storage platform ideology.

Let’s Hear It for Hadoop

Enter Hadoop, an open source project that offers a platform to work with big data. Although Hadoop has been around for some time, more and more businesses are just now starting to leverage its capabilities.

The Hadoop platform is designed to solve problems caused by massive amounts of data, especially data that contain a mixture of complex, unstructured and structured information, which does not lend itself well to being placed in tables. Hadoop works well in situations that require support of analytics that are deep and computationally extensive, like clustering and targeting.

So what exactly does Hadoop mean for IT professionals seeking to leverage big data? The simple answer is that Hadoop solves the most common problem associated with big data: efficiently storing and accessing large amounts of data.

The intrinsic design of Hadoop allows it to run as a platform that is able to work across a large number of machines that don’t share any memory or disks. With that in mind, it becomes easy to see how Hadoop offers additional value—network managers can simply buy a number of commodity servers, place them in a rack and run the Hadoop software on each one.

What’s more, Hadoop helps to remove much of the management overhead associated with large data sets. Operationally, as an organization’s data is being loaded into a Hadoop platform, the software breaks down the data into manageable pieces, which are then automatically spread across different servers.

The distributed nature of the data means there is no one single place to go to access the data. Hadoop keeps track of where the data resides, and further protects that information by creating multiple copy stores. Resiliency is enhanced, because if a server goes offline or fails, the data can be automatically replicated from a known good copy.

How Hadoop Goes Further

The Hadoop paradigm goes several steps further when it comes to working with data. Take, for example, the limitations associated a traditional, centralized database system, which may consist of a large disk drive connected to a server class system that features multiple processors. In that scenario, analytics is limited by the performance of the disk and, ultimately, the number of processors that can be bought to bear.

With a Hadoop deployment, every server in the cluster can participate in the processing of the data through Hadoop’s capability to spread the work and the data across the cluster. In other words, an indexing job works by sending code to each of the servers in the cluster and each server then operates on its own little piece of the data. Results are then delivered back as a unified whole. With Hadoop, the process is referred to as MapReduce, where the code and processes are mapped to all the servers and the results are reduced into a single set.

That process is what makes Hadoop so good at dealing with large amounts of data. Hadoop spreads the data out and can handle complex computational questions by harnessing all of the available cluster processors to work in parallel.

Understanding Hadoop and Extract, Transform and Load

However, venturing into the world of Hadoop is not a plug-and-play experience. There are certain prerequisites, hardware requirements and configuration chores that must be met to ensure success. The first step consists of understanding and defining the analytics process.

Luckily, most IT leaders are familiar with business analytics (BA) and BI processes and can relate the most common process layer used—the extract, transform and load (ETL) layer—and the critical role it plays when building BA/BI solutions.

Big data analytics requires that organizations choose the data to analyze, consolidate it and then apply aggregation methods before it can be subjected to the ETL process. What’s more, that has to occur with large volumes of data, which can be structured, unstructured or from multiple sources, such as social networks, data logs, websites, mobile devices, sensors and other areas.

Hadoop accomplishes that by incorporating pragmatic processes and considerations, such as a fault-tolerant clustered architecture and the capability to move computing power closer to the data and perform parallel and/or batch processing of large data sets. It also provides an open ecosystem that supports enterprise architecture layers from data storage to analytics processes.

Not all enterprises require the capabilities that big data analytics has to offer. However, those that do must consider Hadoop’s capability to meet the challenge. But Hadoop cannot accomplish everything on its own—enterprises will need to consider what additional Hadoop components are needed to build a Hadoop project.

For example, a starter set of Hadoop components may consist of HDFS and HBase for data management, MapReduce and Oozie as a processing framework, Pig and Hive as development frameworks for developer productivity and open source Pentaho for BI.

From a hardware perspective, a pilot project does not require massive amounts of equipment thrown at it. Hardware requirements can be as simple as a pair of servers with multiple cores, 24 or more gigabytes of RAM and a dozen or so hard disk drives of two terabytes each, which should prove sufficient to get a pilot project off the ground.

However, be forewarned that effective management and implementation of Hadoop require some expertise and experience, and if that expertise is not readily available, IT management should consider partnering with a service provider that can offer full support for the Hadoop project. That expertise proves especially important when it comes to security. Hadoop, HDFS and HBase offer very little in the form of integrated security, so data still need additional protections against compromise or theft.

All things considered, an in-house Hadoop project makes the most sense for a pilot test of big data analytics capabilities. After the pilot, a plethora of commercial or hosted solutions are available to those who want to tread further into the realm of big data analytics.

Frank J. Ohlhorst is a New York-based technology journalist and IT business consultant.