As enterprises begin deploying Apache Hadoop to store their data and enable their users to interact with it in various ways, they've often run into a glaring problem: Hadoop was designed for the singular purpose of Web-scale data processing. Enterprises of all sorts increasingly want to store all incoming data in Hadoop—creating a sort of data lake—which their users can then leverage for uses ranging from batch processing to analyzing data streams as they arrive.
Case in point: running SQL on Hadoop. Business analysts have been using SQL as the query language to perform ad-hoc queries against data warehouses for years. If you're creating a data lake using Hadoop, you've got to be able to query that data using SQL.
"But by building SQL access on top of Hadoop, it just highlights the challenge of Hadoop being a single application system," writes Arun Murthy, founder and architect at Hortonworks and former architect of the Yahoo Hadoop Map-Reduce Development Team. "For when I run a SQL query on that data, it could consume all the resources of the cluster and cause performance issues for the other applications and jobs running in the cluster—not a good outcome to say the least."
The answer to that problem is YARN (Yet Another Resource Negotiator), the foundation of the recently released Hadoop 2. Apache Hadoop YARN serves as the Hadoop operating system, taking what was a single-use data platform for batch processing and evolving it into a multi-use platform that enables batch, interactive, online and stream processing.
YARN acts as the primary resource manager and mediator of access to data stored in Hadoop distributed file system (HDFS), giving enterprises the capability to store data in a single place and then interact with it in multiple ways, simultaneously, with consistent levels of service.
Hortonworks, provider of the Hortonworks Data Platform (HDP), one of the most popular distributions of Hadoop, was quick to take up the YARN banner today with the announcement of the general availability of HDP 2.0.
HDP 2.0 is the first commercial distribution built on Hadoop 2, delivering the YARN-based architecture and new features from Phase 2 of the Stinger Initiative. The Stinger Initiative is a community-based effort that aims to enhance the speed, scale and breadth of SQL semantics supported by Apache Hive.
"The YARN-based architecture of HDP 2.0 delivers on our mission to enable the modern data architecture by providing one enterprise Hadoop that deploy integrates with existing, and future, data center technologies, says Shaun Connolly, vice president of corporate strategy at Hortonworks.
"In our benchmarking across some of the customers we've been working with, classic MapReduce jobs will just port over from the 1.0 line to the 2.0 line," Connolly adds. "You get twice the performance and you can run twice the jobs. You get a lot more headroom in the cluster."
Meanwhile, the addition of Hive 0.12 (the culmination of phase 2 of the Stinger Initiative) delivers large performance gains for queries that bring them in line with "human interactive response time rather than batch response time."
Connolly says queries that previously took 1,400 seconds for a response can now get responses in fewer than 10 seconds. Phase 3 (targeted for the first quarter of 2014), is expected to improve those response times even more by allowing interim processing to happen within memory.
HDP 2.0 is available for download now. Connolly says HDP 2.0 for Windows will be available next month.
Thor Olavsrud covers IT Security, Big Data, Open Source, Microsoft Tools and Servers for CIO.com. Follow Thor on Twitter @ThorOlavsrud. Follow everything from CIO.com on Twitter @CIOonline, Facebook, Google + and LinkedIn.