7 tools to fire up Spark's big data engine

For data processing, Spark is catching fire. Here's a look at the kindling fueling Spark's big data appeal

7 way Spark stokes big data's fire
Dawn Hudson via Public Domain Pictures

Visual tour of the Spark ecosystem

Apache Spark didn’t merely make big data processing faster; it also made it simpler, more powerful, and more convenient. Spark isn't only one thing; it's a collection of components under a common umbrella. And each component is a work in progress, with new features and performance improvements constantly rolled in.

Here’s an introduction to each of the major components in the Spark ecosystem -- what each piece does, why it matters, how it has evolved, where it might fall short, and where it’s likely to go from here.

Spark Core

Spark Core

At the heart of Spark is the aptly named Spark Core. In addition to coordinating and scheduling jobs, Spark Core provides the basic abstraction for data handling in Spark, known as the Resilient Distributed Dataset (RDD).

RDDs perform two actions on data: transformations and actions. The former makes changes on data and serves them up as a newly created RDD; the latter computes a result based on an existing RDD (such as an object count).

Spark is fast because both transformations and actions are kept in memory. Actions are lazily evaluated, meaning they’re only performed when the data in question is needed; however, it can be hard to find out what runs slowly.

Spark’s speed is a work in progress. Java’s memory management tends to gum up the works for Spark, so Project Tungsten plans to increase its memory efficiency by sidestepping the JVM’s memory and garbage collection subsystems.

Spark APIs

Spark APIs

Spark is written mainly in Scala, so the primary APIs for Spark have long been for Scala as well. But three other, far more widely used languages are also supported: Java (upon which Spark also relies), Python, and R.

By and large you’re best off picking the language you’re most comfortable with, since odds are the features you need will be directly supported in the language. One exception: Support for machine learning in SparkR is less robust by comparison, with only a subset of algorithms currently available there. That’s bound to change over time.

Spark SQL

Spark SQL

Never underestimate the power or convenience of being able to run a SQL query against a batch of data. Spark SQL provides a common mechanism for performing SQL queries (and requesting columnar DataFrames) on data provided by Spark, including queries piped through ODBC/JDBC connectors. You don’t even need a formal data source. Support for querying flat files in a supported format, à la Apache Drill, was added in Spark 1.6.

Spark SQL isn’t really for updating data, since that’s orthogonal to the whole point of Spark. It is possible to write resulting data back to a new Spark data source (say, a new Parquet table), but UPDATE queries aren’t supported. Don’t expect features of that ilk anytime soon; most of the improvements in mind for Spark SQL are for increasing its performance since it’s become the underpinning for Spark Streaming as well.

Spark Streaming

Spark Streaming

Spark’s design makes it possible to support many processing methods, including stream processing -- hence, Spark Streaming. The conventional wisdom about Spark Streaming is that its rawness only lets you use it when you don’t need split-second latencies or if you aren’t already invested in another stream-processing solution -- say, Apache Storm.

But Storm’s been losing popularity; longtime Storm user Twitter has since changed to its own project Heron. What’s more, Spark 2.0 promises a new “structured streaming” model that allows interactive Spark SQL queries of live data, including using Spark’s machine learning libraries. Whether it’ll be performant enough to beat the competition remains to be seen, but it’s worth taking seriously.

MLlib (Machine Learning)

MLlib (Machine learning)

Machine learning technology has a reputation for being both miraculous and difficult. Spark allows you to run a number of common machine learning algorithms against data in Spark, making those types of analyses a good deal easier and more accessible to Spark users.

The list of algorithms available in MLlib is broad and expanding with each revision of the framework. That said, some types of algorithms aren’t available -- anything involving deep learning, to name one. Third parties are leveraging Spark’s popularity to fill in that gap; for instance, Yahoo can perform deep learning with CaffeOnSpark, which leverages the Caffe deep-learning system through Spark.

GraphX (Graph Computation)

GraphX (Graph computation)

Mapping relationships between thousands or millions of entities typically involves a graph, a mathematical construct that describes how those entities interrelate. Spark’s GraphX API lets you perform graph operations on data using Spark’s methodologies, so the heavy lifting of constructing and transforming such graphs is offloaded to Spark. GraphX also includes several common algorithms for processing the data, such as PageRank or label propagation.

One major limitation to GraphX as it currently stands: It’s best suited to graphs that are static. Processing a graph where new vertices are added severely impacts performance. Also, if you’re already using a full-blown graph database solution, GraphX isn’t likely to replace it -- yet.

SparkR (R on Spark)

SparkR (R on Spark)

The R language provides an environment to perform statistical, numerical analysis, and machine learning work. Spark added support for R in June 2015 to match its support for Python and Scala.

Aside from having one more language available to prospective Spark developers, SparkR allows R programmers to do many things they couldn’t previously do, like access data sets larger than a single machine’s memory or easily run analyses in multiple threads or on multiple machines at once.

SparkR also allows R programmers to make use of the MLlib machine learning module in Spark to create general linear models. Unfortunately, not all MLlib features are supported yet in SparkR, although the R support gap is being closed with each successive revision of Spark.