Calling Apache Spark "the most important new open source project in a decade that is being defined by data," IBM today announced that it will embed the compute engine into its analytics and commerce platforms and offer Spark as a service on IBM Bluemix.
As part of its new commitment to Spark, Big Blue also says it will assign more than 3,500 IBM researchers and developers to work on Spark-related projects at more than a dozen labs worldwide and will donate its IBM SystemML machine learning technology to the Spark open source ecosystem. It also pledged to educate more than one million data scientists and data engineers on Spark.
Adding Spark to Hadoop Distributed File System
Spark is a cluster computing framework designed to sit on top of Hadoop Distributed File System (HDFS) in place of Hadoop MapReduce. With support for in-memory cluster computing, Spark can achieve performance up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk. Spark is a compute engine geared for data processing workflows, advanced analytics, stream processing and business intelligence/visual analytics.
That said, Spark is still young — it only got its v1.0 release about a year ago ̬ and it has challenges. It was originally designed as a single-user data science tool and memory management can be particularly tricky.
"As a user of Spark, you're expected to know whether the amount of data you have will fit into memory," says James Dixon, cofounder of data integration specialist Pentaho, whose Pentaho Labs unit has been experimenting with possible Spark use cases based on its big data blueprints and sizing the enterprise market opportunity for Spark for the past two years. "There are four different memory modes and you have to choose the right one."
[Related: Apache Spark jumps on the R bandwagon ]
It gets more complicated if you add multiple users. Then you need to understand the memory footprint of everyone that wants to use Spark concurrently.
Even so, Dixon says Spark has enormous promise — Pentaho is pushing forward with native integration of Spark for its Data Integration platform to support the orchestration of Spark jobs. IBM says that its new commitment to Spark will help the open source community rapidly accelerate access to advanced machine learning capabilities and drive speed-to-innovation in the development of smart business apps.
"IBM has been a decades-long leader in open source innovation," Beth Smith, general manager, Analytics Platform, IBM Analytics, said in a statement today. "We believe strongly in the power of open source as the basis to build value for clients, and are fully committed to Spark as a foundational technology platform for accelerating innovation and driving analytics across every business in a fundamental way. Our clients will benefit as we help them embrace Spark to advance their own data strategies to drive business transformation and competitive differentiation."
Lighting a fire under Spark ecosystem
IBM has committed to taking the following actions to accelerate open source innovation in the Spark ecosystem:
- It will build Spark into the core of the company's analytics and commerce platforms.
- IBM will open source its IBM SystemML machine learning technology and collaborate with Databricks to advance Spark's machine learning capabilities.
- IBM will offer Spark as a service on IBM Bluemix to make it possible for any app developer to load data, model it and derive the predictive artifact to use in their app.
- IBM will commit more than 3,500 researchers and developers to work on Spark-related projects at more than a dozen labs worldwide, and open a Spark Technology Center in San Francisco for the Data Science and Developer community to foster design-led innovation in intelligent applications.
- IBM will educate at least one million data scientists and data engineers on Spark through extensive partnerships with AMPLab, DataCamp, MetiStream, Galvanize and Big Data University MOOC.