In the past several years, Pentaho Labs, the research arm of business analytics specialist Pentaho, set out to map the various big data use cases organizations were putting into production as part of an effort to provide big data blueprints, a big data stack if you will.
Recently, Pentaho Labs has pursued the same path with Apache Spark, and today it announced the native integration of Pentaho Data Integration (PDI) with Apache Spark, which will enable the orchestration of Spark jobs.
PDI is essentially a portable ‘data machine’ for ETL, which can be deployed as a stand-alone Pentaho cluster or inside a Hadoop cluster through MapReduce or YARN. Tuesday’s announcement adds Spark to the mix, enabling even faster big data ETL processing. ETL designers can design, test and tune ETL jobs in PDI using its graphical design environment and then run them at scale on Spark.
Apache Spark is a cluster computing framework designed to sit on top of Hadoop Distributed File System (HDFS) in place of Hadoop MapReduce. With support for in-memory cluster computing, Spark can achieve performance up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk.
Spark can be an excellent compute engine for data processing workflows, advanced analytics, stream processing and business intelligence/visual analytics. But Spark is still young — it only got its v1.0 release 12 months ago — and it’s still pretty tricky to work with.
For one thing, says Pentaho Cofounder and CTO James Dixon, the Spark use cases in production out in the wild are almost all data science use cases.
“That’s what it was created for — a single-user data science tool,” Dixon says. “It wasn’t designed for streaming, but there’s Spark Streaming. It wasn’t designed for SQL, but there’s Spark SQL.”
Memory management with Spark is particularly difficult, he says.
“As a user of Spark, you’re expected to know whether the amount of data you have will fit into memory,” he says. “There are four different memory modes and you have to choose the right one.”
It gets more complicated if you add multiple users. Then you need to understand the memory footprint of everyone that wants to use Spark concurrently.
That said, Spark SQL is orders of magnitude faster than Hive, Dixon says, and even has significant promise compared with Impala.
“There’s an enormous amount of promise,” he says. “I’m not skeptical of the technology, but I’m skeptical of a lot of the hype out there at the moment. There are people out there saying things about Spark that are very unrealistic.”
Dixon notes that Pentaho Labs has been experimenting with possible Spark use cases based on its big data blueprints and sizing the enterprise market opportunity for Spark for the past two years. In the Hadoop market, Pentaho has seen use cases coalesce around three broad categories in the past several years: data warehouse optimization, streamlining data sources into a ‘data refinery’ and blending operational data sources and big data sources to obtain a 360-degree view of customers.
“For the first five to seven years of Hadoop, we didn’t have these patterns,” he says. “Now throw Spark into the mix and we’re back at square one. [As an industry] we’re not really sure what this technology can be used for, what it should be used for.”
But that’s the reason Pentaho Labs exists, Dixon says. For now, Pentaho Data Integration for Apache Spark is available in Pentaho Labs. Pentaho plans to make it generally available in June 2015.