Revving up data analytics with Apache Spark and Kubernetes

spark fuselage
Dell Technologies

The ability to process a large amount of data, be it batch or streaming data, is a must-have for organizations that want to leverage data analytics to drive better business decision-making and power the next generation of machine learning applications. Today, Dell Technologies makes this journey easier with a new Dell EMC Ready Solutions for Data Analytics architecture that deliver the capabilities of Apache Spark on Kubernetes.

A quick primer on these technologies. Apache Spark is a unified analytics engine that leverages in memory computing for large scale data processing. To run Spark processes that are distributed across multiple systems, you need container virtualization for automating deployments and scaling. Enter Kubernetes, a popular open-source platform that is designed for containerized workloads. Kubernetes provides the ability to orchestrate the creation, placement and lifecycle management of Spark processes across a cluster of x86 servers.

Dell EMC Ready Solutions for Data Analytics – Spark on Kubernetes brings everything together in a tested, validated architecture that describes the system building blocks for leveraging the growing capabilities of Kubernetes to manage infrastructure for Spark analytics.

With this purpose-built architecture, your data scientists and data engineers can collaborate to build a full analytics pipeline without having to go outside the Spark ecosystem for data ingestion, data cleansing, data merging, model training and API development for inferencing.

The validated architecture includes a demonstration of Jupyter notebooks to enable rapid prototyping and visualization capabilities for your data science team. This method uses the same container or Kubernetes management toolset as all the other Spark-specific services.

Even better, this architecture offers infrastructure guidance for general-purpose data analytics involving all stages of an analytics pipeline using Apache Spark and Kubernetes. Key components of this architecture include Dell EMC Dell EMC PowerEdge servers with 2nd Generation Intel® Xeon® Scalable processors, Isilon H-series storage and Dell EMC PowerSwitch networking. Apache Spark and Redhat Openshift, an enterprise Kubernetes-powered platform, has been validated on this hardware stack.

So who can put this solution to work? In short, many organizations across a wide variety of industries. From manufacturing and retail to healthcare and finance, companies can leverage this Apache Spark and Kubernetes-powered data analytics architecture on Dell EMC infrastructure to gain insights from large amounts of batch and streaming data.

To learn more

For a deeper dive, explore these resources:

Copyright © 2020 IDG Communications, Inc.