Is there life after Hadoop? The answer is a resounding yes.

hpe20170512008 1600 0 72 rgb
HPE

Over the past few years there have been numerous eulogies given for Hadoop – the powerful, open-source framework for storing and processing data named after a toy elephant.  Of course, one could argue that Hadoop isn’t all dead - just mostly dead.  Still, many organizations who invested heavily in the Hadoop ecosystem have found themselves at a crossroads, wondering what life after Hadoop is like and what lies ahead. This article addresses life after Hadoop and lays out a strategy for organizations entering the post-Hadoop era.

Remembering the elephant

For many organizations, life with Hadoop was pretty good and it provided some real muscle for handling large amounts of unstructured data. For some, SQL-on-Hadoop solutions even helped offload work from even more complex (and expensive) data warehouses. That said, the care and feeding of Hadoop, like other elephants, was, well… not trivial – especially when it came time for baths and cleaning out the pen. I’ll not digress any further with my personal pet trauma but suffice to say that living with large animals had its downsides.

For instance, the 3x replication scheme for data stored in the Hadoop distributed filesystem (HDFS 2.X) incurred a 200% overhead in storage and other resources. Also, the lack of separation between compute and storage in Hadoop clusters resulted in chronic under-utilization of compute (i.e., CPU) resources while constantly maxing out the storage on the servers. This fueled another unfortunate byproduct – Hadoop cluster sprawl. As data growth exploded, organizations found themselves with an ever-growing number of Hadoop clusters, each with its own complex configuration, poor compute utilization, and voracious appetite for storage.

Yes, life with Hadoop wasn’t always easy and Hadoop-based applications were no exception.  Hadoop MapReduce provided a lot of power for manipulating large amounts of data but at a price. The lackluster performance of large-scale, disk-based MapReduce applications made them poorly suited for supporting the new wave of data-driven applications. Moreover, the Hadoop market itself went through major upheavals in the past few years that understandably gave organizations pause when thinking about Hadoop’s future. Given the challenges and the uncertainty, many organizations reached the conclusion that despite its usefulness, it was time for the elephant to go. 

A double-sided dilemma

Of course, for organizations considering life after Hadoop there were two central questions to answer:

  • What do I do with my Hadoop Distributed File System (HDFS) data?
  • What do I do with my Hadoop-based (e.g., MapReduce) applications that consume it?

The answers may have seemed simple and obvious at first, but many organizations learned that the answers weren’t as clear or as simple as first thought. Managing mountains of data was never easy and the challenges didn’t disappear with the advent of distributed filesystems like HDFS. Moreover, MapReduce didn’t magically eliminate the challenges with distributed, data-driven applications.

The essential truth is that Hadoop was designed and optimized for the data needs of another time. Today’s data landscape is a lot different than it was a decade ago.  Still, the two main drivers for data technology adoption remain the same – price and performance. That said, Hadoop is no longer a leader in either category. This has raised some difficult questions with few easy answers. However, two clear strategies have emerged to help organizations transition into the post-Hadoop era.

1 – Build a better lake

For a long time, the Hadoop data lake was the preferred strategy for managing large amounts of unstructured data. Just pump everything into the lake and let MapReduce applications process it. However, things were never quite that simple and most data lakes still involved a lot of copying and inefficient data movement. Moreover, as emerging technologies challenged key assumptions for data management and gradually supplanted Hadoop services like HDFS and MapReduce, it became clear that a better approach was needed for managing large amounts of data.

Enter the Data Fabric. Fundamentally, a Data Fabric is a means for efficient access to and sharing of data in a distributed environment; bringing together disparate data assets and making them accessible through a managed set of augmented data services. Basically, the Data Fabric picked up where the Hadoop Data Lake left off; providing efficient, multi-protocol access to extremely large volumes of data while minimizing data movement and helping to provide much needed separation between compute and storage. 

In today’s data landscape, a single-protocol, Hadoop data lake is simply inadequate to meet the current challenges. In contrast, modern data fabrics like the HPE Ezmeral Data Fabric provide greatly enhanced capabilities while still providing HDFS-based access to centrally managed data assets.

2 – Optimize the compute

As mentioned earlier, Hadoop MapReduce applications have provided a lot of muscle over the years for wrangling data and it performs well for certain tasks (e.g., distcp). However, its performance for a wide variety of other use cases has never been great.  As a result, newer services like Spark have emerged to address the shortcomings of MapReduce.

Spark introduced important innovations and moved beyond MapReduce’s limited functional vocabulary (map and reduce working with rows of data) to embrace a columnar approach to data and seeing data structured as Directed Acyclic Graphs (DAGs). This approach worked well for handling sophisticated workloads like machine learning and graph analytics. Also, Spark’s innovations combined with its in-memory processing model have yielded dramatic performance improvements – as much as 100 times faster than MapReduce in some cases.

Spark owes its greatly improved performance to several factors, including:

  • Spark is not I/O bound. With its in-memory processing model, Spark doesn’t incur a disk I/O performance penalty each time it runs a selected part of a task.

  • Spark’s DAGs enable optimizations between task steps. Unlike Spark, Hadoop doesn’t have any cyclical connection between MapReduce steps, meaning no performance tuning can occur at that level.

Moreover, Spark benefits from a flexible deployment architecture and Spark clusters can be deployed using a variety of cluster managers including Kubernetes as well as Hadoop YARN.  Hadoop MapReduce is still the best choice for batch processing of large amounts of data but for most other use cases, Spark is the better choice.

Given Spark’s flexibility, suitability for AI and machine learning, and its vastly improved performance, its adoption has increased dramatically in recent years. Investing in Spark and related technologies is a sound strategy for the future. 

Life after Hadoop

The past few years have seen a dramatic shift towards AI- and data-driven applications as well as more diverse data storage. This shift, combined with Hadoop complexity, performance challenges, and market consolidation, has resulted in a sharp decline in Hadoop use – prompting many organizations to wonder about life after Hadoop.

Going forward, organizations should consider shrinking their Hadoop investment and embracing a Spark + Data Fabric strategy as an alternative. HPE Ezmeral software covers both facets with the HPE Ezmeral Data Fabric providing an enterprise data fabric and the HPE Ezmeral Container Platform providing enhanced support for Spark as well as the ability to manage remaining Hadoop assets with containers using a common control plane for both Spark and Hadoop workloads.

By adopting HPE Ezmeral, organizations can ease their transition into a post-Hadoop era while freeing up time and resources to focus on the emerging data challenges of the business.

____________________________________

About Randy Thomasson

randy thomasson headshot
As a Global Solution Architect for HPE Ezmeral software, Randy provides technical leadership, strategy and architectural guidance spanning a wide range of technologies and disciplines including application development and modernization, Big Data and advanced analytics, infrastructure automation, in-memory and NoSQL data technologies and DevOps.

Copyright © 2021 IDG Communications, Inc.