Lifting big data to the sky: Hadoop-as-a-Service is gaining rapid traction

The cloud is becoming the location of choice for organizations expediting their digital activities. Here's what to consider when evaluating a HaaS provider.

hadoop aws primary
Thinkstock

Hadoop is an open source-based software framework that enables high throughput processing of big data quantities across distributed clusters.

What started as niche market several years ago is now entering the mainstream. With the rapid expansion of the digital universe, Hadoop provides ample use cases allowing big data processing utilizing plain commodity hardware.

It’s also highly scalable from a single server to multiple server farms with each cluster running its own compute and storage. Hadoop provides high availability at the application layer, hence cluster hardware can be off-the-shelf, making the nodes easily interchangeable and cost efficient.

The cloudification trend

While early adopters typically used an on-premise deployment leveraging one of the several Apache distributions, organizations are increasingly taking advantage of the cloud. In contrast, a “do-it-yourself” (DIY) approach can be tedious and time consuming.

As demand outweighs supply, skilled engineers with in-depth Hadoop experience are rare and expensive. Buying hardware is one thing, but building an analytics platform in a trial-and-error attempt can be lengthy and quite costly, too.

As time-to-market matters a great deal in the digital age, an increasing number of companies are taking advantage of Hadoop-as-a-Service (HaaS) offerings that are emerging quickly and enjoying high rates of adoption.

Using the cloud as the preferred destination can make a lot of sense from a user perspective. With lower costs per unit due to economies of scale, organizations gain efficiencies, avoid capital expenditures, and achieve much greater flexibility.

Besides commercial benefits, the cloud most importantly just opens up a completely new array of digital use cases—especially in the context of the IoT and other scenarios requiring real-time data processing. AWS’ Elastic Map Reduce (EMR) was one of the pioneering offerings in this space.

But not only have basically all large service providers meanwhile added a cloud-based Hadoop hosting to their portfolio, but the distro vendors themselves are also making efforts to “cloudify” their frameworks, with Cloudera’s Altus being one of the recent examples. Altus allows users to run data processing jobs leveraging either Hive on MapReduce or Spark on demand. Cloudera already publically announced their intention to extend their services toward other leading public clouds such as Microsoft Azure, with other vendors likely to follow.

Market developments

In strong pursuit toward the cloud, more and more organizations are opting for Hadoop-as-a-Service. HaaS is essentially a Platform-as-a-Service (PaaS) sub-category, comprising virtual storage and compute resources as well as Hadoop-based processing and analytics frameworks. Service providers typically operate a multi-tenant HaaS environment, allowing the hosting of multiple customers on a shared infrastructure.

As organizations are increasingly embracing a “cloud-first” mindset, the HaaS market is projected to garner $16.1 billion in revenues by 2020, registering a stellar compound annual growth rate (CAGR) of 70.8 percent from 2014 to 2020, as reported by Allied Market Research. North America is still the leading region from a revenue perspective, followed by Europe and Asia Pacific.

The outburst of HaaS is expected to overcast the growth of the on-premises Hadoop market through 2020. According to IDC’s research, public cloud deployments already account for 12 percent of the overall worldwide business analytics software market and are expected to grow at a CAGR of 25 percent through 2020. Besides large corporates, small and medium-sized firms are also increasingly opting for HaaS to derive actionable insights and create data-centric business models.

Things to contemplate when considering HaaS

While there are undoubtedly plenty of use cases when leveraging HaaS, there are some drawbacks too. Moving shiploads of data into the cloud might have latency implications and require additional bandwidth. While a highly standardized HaaS environment can be conveniently deployed with just a few clicks, the design authority is at the sole discretion of the service provider. Moreover, data in the cloud unfolds gravity and leads to a lock-in effect. Here are some examples of what else to consider when evaluating a HaaS provider:

Elasticity

Hadoop supports elastic clusters for a wide range of workloads, which is even more important when considering a cloud-based deployment. What are the available compute and storage options to support different use cases? For example, what additional compute blades are available for high I/O workloads? How scalable is the environment and how easily can additional resources (compute, storage) be commissioned?

Persistent use of HDFS

Although HDFS as a persistent data store isn’t required, there are clear benefits when utilizing it. HDFS uses commodity direct attached storage (DAS) and shares the cost of the underlying infrastructure. Furthermore, HDFS seamlessly supports YARN and MapReduce, enabling it to natively process queries and serve as a data warehouse.

Billing

What’s the underlying price metric of the service provider (billed as ordered, as consumed, etc.)? How flexible can services be decommissioned if capacity is underutilized for example? Most importantly, keeping in mind the fast expansions of the data lake, how do prices scale over time?

High availability

Achieving “zero outage” is a delicate but very important matter. What’s the provider’s SLA and fail-over concept? How is redundancy being accomplished? For example, is the provider capable of isolating and restarting a single machine without interrupting an entire job (aka “non-stop operations”)?

Interoperability

Since use cases tend to gain sophistication over time, how easy is it to integrate other services you might already be using or are planning to use? Which data streams and APIs are supported, and how well are they documented?

Need for talent

While there is significantly less manpower needed when setting up a HaaS environment as opposed to a DIY approach, Hadoop doesn’t work entirely out of the box. The nodes will be running with just a few clicks, but this is when the actual work begins. The customization will still require time and effort.

This article is published as part of the IDG Contributor Network. Want to Join?

Participate in the 2019 State of the CIO survey. Make your voice heard.