Facebook Releases Query Engine to Open Source Community
Facebook has released Presto -- a distributed SQL query engine optimized for running ad-hoc interactive queries on data sources up to petabytes in size -- to the open source community under the Apache license. It says Presto is 10x better than Hive.
Wed, November 06, 2013
CIO — Do you have a data warehouse that stores more than 300 petabytes of data and struggle with the latency of queries? Well, few companies have data at Facebook's scale, but the performance of queries against your data warehouse is often still a serious productivity issue. Facebook has been developing a solution to that problem and today it offered that answer up to the open source community. It calls the answer Presto.
Today, Facebook released Presto to the open source community under the Apache 2.0 license.
"Facebook's warehouse data is stored in a few large Hadoop/HDFS-based clusters," writes Martin Traverso, software engineer at Facebook, in a blog post Wednesday.
"Hadoop MapReduce and Hive are designed for large-scale, reliable computation, and are optimized for overall system throughput. But as our warehouse grew to petabyte scale and our needs evolved, it became clear that we needed an interactive system optimized for low query latency."
The Magic of PrestoPresto is a distributed SQL query engine—now an open source distributed SQL query engine—optimized for running ad-hoc interactive analytic queries against data sources ranging in size from gigabytes to petabytes. It is designed to allow organizations to query data where it lives. A single Presto query can combine data from multiple sources and provide responses in times ranging from sub-second to minutes.
Presto supports standard ANSI SQL, including complex queries, aggregations, joins and window functions. The engine was designed with a simple storage abstraction that, Traverso says, makes it easy to provide SQL query capability against HDFS, other well-known data stores like HBase and even custom systems like the Facebook News Feed backend. Storage plugins, which Facebook calls connectors, provide interfaces for fetching metadata, getting data locations and accessing the data itself.
"Presto is 10x better than Hive/MapReduce in terms of CPU efficiency and latency for most queries at Facebook," Traverso says.
"Presto is amazing," says Chris Gutierrez, data scientist at Airbnb, which is among the small number of external companies with which Facebook has already shared the Presto code and binaries. "A lead engineer got it into production in just a few days. It's an order of magnitude faster than Hive in most of our use cases. It reads directly from HDFS, so unlike Redshift, there isn't a lot of ETL before you can use it. It just works."
"We're really excited about Presto," adds Fred Wulff, a software engineer at Dropbox, which has also been testing the engine. "We're planning on using it to quickly gain insight about the different ways our users use Dropbox, as well as diagnosing problems they encounter along the way. In our tests so far it's been rock solid and extremely fast when applied to some of our most important ad-hoc use cases."