by Thor Olavsrud

Databricks lays production data pipelines

Apr 13, 2017
AnalyticsCloud ComputingOpen Source

The new Databricks for Data Engineering edition of the Apache Spark-based cloud platform is optimized for combining SQL, structured streaming, ETL and machine learning workloads running on Spark.

data pipeline primary
Credit: Thinkstock

Aiming to provide data engineers with new and better tools for creating production data pipelines, Databricks yesterday released Databricks for Data Engineering, a new version of its Apache Spark-based cloud platform optimized specifically for data engineering workloads.

Databricks, founded by the creators of Apache Spark, already provides a version of the cloud platform geared toward supporting data science workloads. But Databricks CEO and Co-founder Ali Ghodsi says the overwhelming majority of the company’s nearly 500 enterprise customers and 50,000 community edition users are seeking to combine SQL, structured streaming, ETL and machine learning workloads running on Spark to deploy data pipelines into production.

Cleaning fuzzy data

“What they really are doing is taking data that is maybe skewed, fuzzy, maybe has errors in it, and they’re using Spark to create a pipeline that cleans the data and puts it in structured form,” Ghodsi says. “That’s really the main use case that we saw. They’re using the interactive APIs to explore their data sets, but once they explore it, they’re turning it into production data pipelines where there’s no human in the loop.”

Ghodsi notes that building these pipelines with Databricks for Data Engineering is much more cost-effective than with the existing Databricks offering, representing 50 percent to 75 percent cost savings.

Features of the new Databricks for Data Engineering offering include the following:

  • Performance optimization. Databricks I/O (DBIO) technology provides a tuned and optimized version of Spark for a wide variety of instance types, in addition to an optimized AWS S3 access layer. Databricks says DBIO accelerates data exploration by up to 10x.
  • Cost management. Cluster management capabilities, such as auto-scaling and AWS Spot instances reduces operational costs by avoiding time-consuming tasks to build, configure and maintain complex Spark infrastructure. “It automatically determines the best number of machines to compute your workload,” Ghodsi says. “We’ve seen a lot of people have a lot of machines on all the time. They have a hard time figuring out how many machines they should be using for their workloads.”
  • Optimized integration. The platform provides a set of REST APIs to programmatically launch clusters and jobs and integrate tools or services ranging from Amazon Redshift and Amazon Kinesis to machine learning frameworks like Google’s TensorFlow. An integrated data sources catalog makes the data sources immediately available to Databricks users without duplicating data ingest work.
  • Enterprise security. Databricks for Data Engineering includes turnkey security standards including SOC 2 Type 1 certification and HIPAA compliance, end-to-end data encryption, detailed logs accessible in AWS S3 for debugging and IT admin capabilities like Single Sign-On with SAML 2.0 support and role-based access controls for clusters, jobs and notebooks.
  • Collaboration with data science. The platform is integrated with the data science workspaces in Databricks, enabling a seamless transition between data engineering and interactive data science workloads.

That last feature is extremely important, Ghodsi says.

“It’s actually really hard to transition between interactive computations and production pipelines,” he says. “I think a lot of people have this mental model that there are two different things you can do: either you’re doing interactive analysis or you’re building data pipelines. That’s not how developers work. While they’re developing a data pipeline, they have to explore the data, debug and test to make sure the data pipeline is actually working. During this process, they need interactive analysis.”

Moving among modes

And while you want your data pipelines to run without humans in the loop, if you do run into problems, you need to be able to seamlessly enter an interactive mode to further develop it.

“We want to make sure that you can easily and seamlessly move between these two modes,” Ghodsi says.

“Databricks’ latest developments for data engineering make it exceedingly easy to get started with Spark — providing a platform that is apt as both an integrated development environment and deployment pipeline, “Brett Bevers, engineering manager, Data Engineering, at Dollar Shave Club, added in a statement Wednesday. “On our first day using Databricks, we were equipped to grapple with an entirely new class of data challenges.”

The new offering is immediately available. It’s priced based on data engineering workloads such as ETL and automated jobs ($0.20 per Databricks Unit plus the cost of AWS).