When you have a data warehouse measured in petabytes, optimizing query performance is essential.
That is what led Facebook to create Presto, a distributed SQL query engine optimized for running ad-hoc interactive analytics queries against data sources ranging in size from gigabytes to petabytes. In late 2013, Facebook released Presto to the open source community under the Apache 2.0 license.
Presto allows you to query data where it lives, whether it's in Hive, Cassandra, Kafka, relational databases or even a proprietary data store. A single Presto query can combine data from multiple sources and provide responses in times ranging from sub-second to minutes.
Presto supports standard ANSI SQL, including complex queries, aggregations, joins and window functions. The engine was designed with a simple storage abstraction that makes it easy to provide SQL query capability against HDFS, other well-known data stores like HBase and even custom systems like the Facebook News Feed backend. Connectors provide interfaces for fetching metadata, getting data locations and accessing the data itself.
Why Netflix moved to Presto
Presto was rapidly adopted by companies like Netflix, Airbnb and Dropbox. Netflix, for instance, uses Presto to allow users from different organizations to query diverse data sets across expansive data ranges in its 10 petabyte data warehouse on Amazon S3. Facebook uses it internally to query its 300+ petabyte data warehouse.
Startup Argyle Data, which specializes in fraud detection and fraud analytics, particularly in the mobile networks space, is another example. Argyle uses deep packet inspection and machine learning to monitor live traffic from its customers, on the order of tens of millions of packets per second says Arshak Navruzyan, vice president of Product Management at Argyle. Apache Accumulo, an open source distributed key/value store created by the NSA, is used to store the raw data, while Presto is used to query it.
"The level of SQL functionality that's implemented in Presto I think is unprecedented in the Hadoop world," Navruzyan says. "They decoupled the distributed SQL query engine from the underlying data store. This was a really great design decision."
He notes that with a small design team, Argyle was able to integrate Presto with Accumulo in a matter of weeks.
"That says something about the openness and modularity of this architecture," he says.
But while Presto queries are fast, Facebook knew it could get more performance out of the engine.
"We are always pushing the envelope in terms of scale and performance," writes Dain Sundstrom, a Facebook software engineer and creator of Presto. "We have a large number of internal users at Facebook who use Presto on a continuous basis for data analysis. Improving query performance directly improves their productivity, so we thought through ways to make Presto even faster. We ended up focusing on a few elements that could help deliver optimal performance in the Presto query engine."
Those few elements were columnar reads, predicate pushdown and lazy reads.
Presto is a columnar query engine. For optimal performance, the reader should provide columns directly to Presto. But Presto's original Hive-based Optimized Row Columnar (ORC) file format reader provides data in row form, which Presto has to reorganize into columns.
Additionally, in ORC, the minimum and maximum value of each column is recorded per file, per stripe (about 1 million rows) and every 10,000 rows. For optimal performance the reader should skip segments that couldn't match the query predicate. When that doesn't work — for instance, if the data has a large number of distinct values and is well shuffled — the reader should employ lazy reads, in which it inspects the columns needed to evaluate the query filter and only reads other columns for segments that match the filter.
Building from the ground up
To answer these needs, Facebook initially worked on Hive 13, a vectorized ORC reader that provides columns instead of rows and can filter segments. Ultimately, though, Facebook engineers decided they needed to write a new ORC reader for Presto from the ground up.
"The decision to write a new ORC reader for Presto was not an easy one," Sundstrom says. "We spent a lot of time working with the Hive 13 vectorized reader and saw very good performance, but the code was not very mature. The vectorised reader did not support structs, maps or lists, and neither the row-oriented nor the column-oriented readers supported lazy reads. Additionally, at Facebook we use a fork of ORC named DWRF. The DWRF reader didn't support columnar reads or predicate pushdown, but it did support lazy reads. To get the speed we were looking for, we needed all three features, and by writing a new reader, we were able to create interfaces that work seamlessly with Presto."
The new Presto ORC reader supports both ORC and DWRF, exposes a columnar interface and supports predicate pushdown and lazy reads.
Sundstrom says that with terabyte-scale ZLIB compressed tables, the new Presto ORC reader achieved:
- 2-4x wall time and CPU time speedup over the old Hive-based ORC reader
- 2-4x wall time and CPU time speedup over the RCFile reader
- 1.3-5.8x wall time speedup (and comparable CPU time) over Impala Parquet
- A 4x+ speedup when lazy reads are utilized
- A 30x+ speedup when predicate pushdown is utilized