Spark a New Way of Thinking in Your Second Phase of Digital Transformation

In part one of this two-part series, I described what many organizations have done in their first phase of digital transformation. Here, in part two, I explain how organizations should think differently as they embark on transforming their data and analytics estates – with Spark as the foundation of those changes.

gettyimages 539121576 1600 0 72 rgb
Getty Images

Many organizations have approached the first phase of digital transformation by embracing application modernization, DevOps, and cloud principles. While many have learned valuable lessons from these recent transformation efforts, the concepts within the application modernization space don’t directly translate to the data and analytics space.

In this article, I will expand on those topics and ideas – and describe how organizations can think differently to successfully transform their data and analytics environments as they embark on the second phase of digital transformation. Thinking differently means they must consider changes in organization structure, operational processes, and technologies. Taking an industrialized approach to data and analytics should serve as the foundation for digital transformation efforts in data and analytics. Let’s explore how this can be done.  

Post-Hadoop Era

We have entered an era where Hadoop is “past its prime,” as evidenced by the consolidation in the market by Hadoop distribution vendors, the lack of data unification that was promised, and continued investments in existing enterprise data warehouses. In fact, the enterprise data warehouse market continues to grow, while traditional, on-premises HDFS-Hadoop solutions are being replaced with cloud-based deployments against other file and object systems. As organizations consider their next-generation data and analytics platform, a common theme has emerged for many of the workloads: Spark.

According to Wikipedia: “Apache Spark is an open-source unified analytics engine for large-scale data processing.” The key terms here are “analytics engine” and “data processing” which means Spark is a highly flexible platform supporting a variety of workloads in the data and analytics space.

As the technical landscape has evolved beyond Hadoop, so has Spark. Within the Hadoop ecosystem, Spark was deployed as part of a broader ecosystem of components and workloads managed within Spark via YARN (a Hadoop resource scheduler). Taking a cue from application modernization, organizations are looking to embrace open-source tools and cloud principles (as I mentioned in my previous article) such as on-demand provisioning and elastic scaling of resources. The requirement now for Spark is to run independent of Hadoop and managed via Kubernetes. Spark 2.3 was released in 2018 and provided support for running on Kubernetes. In the three years since, significant engineering effort has been put into the project with version 3.1.1 now generally considered ready for enterprise adoption!

The two primary workloads that Spark has been optimized for are: Data Engineering and Data Science, each with its own set of specific deployment patterns, processes, and workload-specific libraries to provide functionality to the respective users.

Data Engineering

In many organizations that I speak to, data platform teams have assigned data engineers to build out next-generation platforms that extend the lives of existing data platforms (e.g., Teradata, Oracle, and Hadoop) while enabling new capabilities to better exploit the value of data throughout the enterprise.

Although every organization is unique, the common engine and framework in every architecture is Spark. This is because Spark is a familiar framework with consistent APIs coming from the Hadoop ecosystem and has evolved quickly to meet the needs of data engineering workloads. What data engineers are looking for are ways to submit jobs via Apache Livy APIs against elastically scaling and serverless compute farms. With the release of Spark 3.1.1, that is now possible; the thing holding many organizations back is that Spark is now deployed via Kubernetes, not Hadoop.

Kubernetes brings an increased level of complexity to these environments and I’ve found many data platform teams frustrated. These teams see the successful deployments of cloud-native applications using Kubernetes but are unable to leverage those successes because traditional data engineering workloads can’t run Spark and those data platform teams are unable to deploy and support Kubernetes environments.

Data Science

I see the same challenges that data platform teams experience in data science organizations. It boils down to the following: data science teams want to iterate quickly using the latest open-source tools, libraries, and frameworks, but IT teams are unable to implement systems to support that due to the lack of cloud-native and Kubernetes solutions for data science workloads. Even if a data scientist can provision a development environment on their laptop or workstation, when it comes time to train or deploy the model, enterprise IT must be involved. That is because under the covers of that Jupyter notebook is usually an environment running—you guessed it, Spark!

Spark serves as the execution layer of all that Python code and abstracts the complexity of the underlying infrastructure so that jobs run quickly and against large data sets. The trouble for IT teams is keeping up with the blistering pace of innovation and development from the open source community while deploying stable and secure environments, on-premises. The ecosystem of Spark-compatible libraries from Pandas to PyTorch and workflows like KubeFlow and MLflow have greatly simplified things for data scientists, but unfortunately that isn’t the case for IT teams. Because these environments don’t conform to the cloud principles and practices seen with cloud-native applications, most IT teams are stuck either manually configuring data science environments or writing cumbersome scripts that require constant updates and monitoring.

Considering the above, many organizations are looking to the public cloud for solutions that provide Spark on Kubernetes without being tied to a Hadoop vendor. Although there are maturing and compelling offerings in the cloud, from both the cloud vendors and independent software companies, all the options are either constrained to the environments managed by the former or are cloud-only and engineered versions of Spark by the latter. While organizations are considering the trade-offs between deploying and managing open-source Spark themselves or getting locked-in to an engineered version of Spark, they have failed to consider a third option.

HPE Ezmeral Spark

HPE Ezmeral provides the best of both worlds: A 100% open-source Spark on open-source Kubernetes with the full features, libraries, and workflows that data engineering and data science teams need, with 24/7 support for both the Spark Operator and Kubernetes deployments!

HPE Ezmeral delivers a number of pre-built open source applications as well as a marketplace of data science and data engineering solutions to meet the needs any organization. Those solutions are deployed in isolated tenants that come with pre-configured security, data access, source-code integration, and model registries. This allows the users to get straight to work without needing to build containers, wire-up workflows, or wait for IT tickets to be completed.

By adopting HPE Ezmeral, data engineering and data science teams can increase their productivity and focus on solving the critical data-centric problems of the organization. IT teams go from being the bottleneck to the enabler of the next stage of digital transformation.


About Matt Maccaux

As Global Field CTO for HPE Ezmeral software, Matt brings deep subject-matter expertise in big data analytics and data science, machine learning, application development & modernization, and IoT as well as cloud, virtualization, and containerization technologies.

Copyright © 2021 IDG Communications, Inc.