by Thor Olavsrud

MapR Aims to Take SQL-on-Hadoop to Next Level

News
Sep 16, 20143 mins
AnalyticsBig DataHadoop

Seeking to eliminate the need to manage schemas and perform time-consuming ETL tasks on incoming data before exploring it, MapR is adding the Apache Drill distributed ANSI SQL query engine to its Hadoop distribution.

Aiming to eliminate a number of onerous data engineering tasks, MapR today updated its distribution of Hadoop to include Apache Drill 0.5.

Drill is an open source distributed ANSI SQL query engine for self-service data exploration — an open source version of Google’s Dremel system for interactively querying large datasets, which powers its BigQuery service. The stated goal of the Apache Drill project is to make it able to scale to 10,000 servers or more while processing petabytes of data and trillions of records in seconds.

The Drill query engine provides the capability to do the following:

  • Explore data in its native format (including Parquet, JSON files and HBase tables) without intervention by a database administrator (DBA).
  • Analyze evolving and semi-structured/nested data from NoSQL data stores like MongoDB and online REST APIs.
  • Create queries that simultaneously combine different Hadoop data sources such as files, HBase tables and Hive tables.
  • Reuse existing SQL skill sets, BI tools and Apache Hive deployments.

[Related: MapR Extends Hadoop’s Reach With Big Data App Gallery]

“We’re excited about this because it really opens up a new era for SQL-on-Hadoop,” says Jack Norris, chief marketing officer at MapR. “The focus in on self-service data exploration on Hadoop that doesn’t require IT involvement.”

Because Drill provides the capability to run SQL queries directly on various formats, it can be used to explore live data as it arrives, without weeks spent preparing and managing schemas and setting up ETL tasks. In this way, it provides instant, self-service data exploration across multiple data sources.

[Related: MapR’s New Hadoop Distribution Promises No-Risk Upgrade]

“Organizations want to provide access to data stored in Hadoop and NoSQL databases to a broader set of users with existing SQL analysis skills,” says Matt Aslett, research director, data platforms and analytics, at 451 Research. “Apache Drill’s ability to provide access to data in Hadoop without the need for centralized schemas and also NoSQL datasets with complex data structures including nested and repeated fields differentiates it from traditional approaches to SQL-on-Hadoop.”

“Every other SQL-on-Hadoop solution, whether it’s Hive or Tez or what have you, relies on a fixed schema,” Norris adds. “Whether you’re talking about MapReduce, Hive or some other SQL-on-Hadoop solution, there’s this middleman required to do the modeling, the data transformations, the plumbing to support the analysis. Drill’s ability to discover the data without having to wait for that process to take place gives you speed and agility advantages.”

MapR is packaging Drill with MapR 4.0.1, also released today. The new version of its Hadoop distribution expands its real-time capabilities for use cases including operational applications, interactive queries and stream processing.

The new version includes multiple batch processing frameworks, including MapReduce 1.x and 2.x (YARN-based), as well as Spark (0.9 and 1.0.2). It also supports five SQL-on-Hadoop technologies: Hive (0.11, 0.12, 0.13), Drill (0.5), SparkSQL (1.0.2), Impala (1.3.1) and certified integration with HP Vertica. It adds support for the HBase (0.94.21, 0.98.4) and MapR-DB NoSQL technologies and three machine learning and graph libraries in the form of Mahout (0.8, 0.9), MLLib (0.9, 1.0.2) and GraphX.

Follow Thor on Google+

Thor Olavsrud covers IT Security, Big Data, Open Source, Microsoft Tools and Servers for CIO.com. Follow Thor on Twitter @ThorOlavsrud. Follow everything from CIO.com on Twitter @CIOonline, Facebook, Google + and LinkedIn.