In his keynote at Spark Summit 2014 in San Francisco today, Databricks CEO Ion Stoica unveiled Databricks Cloud, a cloud platform built around the Apache Spark open source processing engine for big data.
Spark, which got its v 1.0 release just one month ago, is a cluster computing framework designed to sit on top of Hadoop Distributed File System (HDFS) in place of Hadoop MapReduce. With support for in-memory cluster computing, Spark can achieve performance up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk.
Spark can be an excellent compute engine for data processing workflows, advanced analytics, stream processing and business intelligence/visual analytics. But Spark clusters can be difficult beasts, Stoica says. Databricks hopes to change all that with its hosted Databricks Cloud platform as a turnkey solution.
"Getting the full value out of their big data investments is still very difficult for organizations," Stoica says. "Clusters are difficult to set up and manage, and extracting value from your data requires you to integrate a hodgepodge of disparate tools, which are themselves hard to use. Our vision at Databricks is to dramatically simplify big data processing and free users to focus on turning data into value. Databricks Cloud delivers on this vision by combining the power of Spark with a zero-management hosted platform and an initial set of applications built around common workflows."
Databricks Cloud provides support for interactive queries (via Spark SQL), streaming data (Spark Streaming), machine learning (MLlib) and graph computation (GraphX) natively with a single API across the entire data pipeline. Stoica says that provisioning new Spark clusters is a snap: just specify the desired capacity of the cluster and the platform handles everything else — provisioning servers on the fly, streamlining import and caching of data, security and patching and updating Spark.
The platform comes with three built-in applications:
- Notebooks. A rich interface for performing data discovery and exploration, Notebooks can plot results interactively, execute entire workflows as scripts and enables advanced collaboration features.
- Dashboards. Dashboards allows users to create and host dashboards by picking any outputs from previously created notebooks. Dashboards then assembles the outputs in a one-page dashboard with a WYSIWYG editor that can be published to a broader audience.
- Job Launcher. The Job Launcher application enables anyone to run arbitrary Apache Spark jobs and trigger their execution, simplifying the process of building data products.
"One of the common complaints we heard from enterprise users was that big data is not a single analysis; a true pipeline needs to combine data storage, ETL, data exploration, dashboards and reporting, advanced analytics and creation of data products," Stoica says. "Doing that with today's technology is incredibly difficult. We built Databricks Cloud to enable the creation of end-to-end pipelines out of the box while supporting a full spectrum of Spark applications for enhanced and additional functionality. It was designed to appeal to a whole new class of users who will adopt big data now that many of the complexities of using it have been eliminated."
Building a Vibrant Spark Big Data Ecosystem
Stoica notes that the built-in applications are only the beginning. Databricks Cloud is built on 100 percent open source Apache Spark, meaning that all current and future "Certified on Spark" applications will run on the platform out of the box — including the more than a dozen Spark applications that Databricks has certified since launching its application certification program in February of this year.
And, Stoica says, you can turn the equation around. Any Spark application developed on Databricks Cloud will work across any "Certified Spark Distribution," meaning users won't be locked in to the hosted platform. Databricks launched its distribution certification program last week and noted that five vendors had already completed the process: Datastax, Hortonworks, IBM, Oracle and Pivotal.
"We are really looking forward to turning Databricks Cloud into a vibrant ecosystem," Stoica says.
The Databricks Cloud is currently in closed beta with several users and will open to a limited availability beta in August, Stoica says. He adds that the platform will follow a tiered pricing model based on usage. To start the platform will only be available on Amazon Web Services (AWS), those Stoica says that expanding to additional cloud providers is on the road map.