Two of the greatest achievements in the history of computer data management are:
- The theory and development of relational database management systems (RDBMSs)
- The realization that strong transactional consistency was not required for all data
Let me elaborate…
When the founders of Google were trying to figure out how to assemble information about every Web page and hyperlink on the Internet so they could develop analytics to gauge the relevance of search results to end users, they were intimately aware of how RDBMSs were used for building large data warehouses. They were also certain that the economics of scaling those architectures for their purposes was not going to be sustainable.
The Google engineers decided to move away from traditional data warehousing with RDBMSs and develop a new platform that used a “write once – read many” philosophy. When data changes, you delete the original data and write the new copy, no requirement for fine grained transactional consistency. Since that software didn’t exist, they invented it, creating the first foundations of the Hadoop platform for “big data”.
Fast forward 10 years ― now the Hadoop ecosystem has grown into one of the world’s largest and most successful open source projects measured by the number of projects and contributors, as well as number of major releases. On the demand side, some market penetration estimates find that more than half of all enterprises have deployed Apache Hadoop, and another one-third plan to in the future. This is why I seriously think the realization that not all data needs to live in a RDBMS is the second most influential achievement in the history of digital data management.
To get an idea of the rate of innovation consider this high-level timeline of major releases:
- November 2012 – Hadoop 1.0 Released
- October 2013 – Hadoop 2.2 Released
- December 2017 – Hadoop 3.0 Released
The results of the hundreds of developers who worked on Hadoop 3 are an array of new features and enhancements that improve agility, scalability and availability, and enable lower total cost of ownership and time to business value. There are also many new use cases that can be addressed with a single investment and set of skills.
The support for Hadoop ecosystem is strong and will continue to grow over time. If you are currently using Hadoop 2.X, it is time to start planning a path to Hadoop 3. If you are already working with Hadoop 3, the resources below will help you to broaden your interest and understanding of the power in Hadoop 3. By taking advantage of new features, streaming data and IoT Hadoop deployments, your organization can develop new products, improve customer service, decrease costs, increase efficiency and gather more valuable insights that could help you improve the bottom line.
Ready to learn more?
Information Week recently published an overview of the how the development of Hadoop 3 can impact the way enterprises manage and get value from big data titled Innovation and Opportunity: 3 Hadoop Trends That Will Affect Your Business. The work was sponsored by Intel and Dell EMC.
In the paper, the authors highlight:
- The benefits to application developers from new support for GPUs and Docker containers.
- The importance of Kafka Streams functionality for developing near-real time data driven applications.
- Implantation of new use cases like edge analytics for IoT with decoupled compute and storage.
In addition to the white paper, Dell EMC and Intel hosted a webinar, Hadoop is Cool – Again! A 10-year-old Technology Gets a Fresh New Look. The discussion included the following questions:
- What problems do containers address for both data scientists and IT professionals?
- How can I provide access to GPU resources in multi-user, multi-node environment with Hadoop 3?
- How can I dramatically reduce the data protection overhead typically incurred in Hadoop environments?
- What improvements have been added for managing high velocity data streams in the Hadoop ecosystem?
You may view the webinar on-demand here.