By now, most CIOs are aware of Big Data and its promise. But it is an inescapable fact that creating, maintaining and configuring Hadoop clusters is challenging, costly and time-consuming. Doing so with high availability has been next to impossible. Now VMware hopes to change all that by virtualizing the Hadoop cluster and making it ready for the cloud.
"Hadoop is a Big Data processing de facto standard," says Fausto Ibarra, senior director of product management, Cloud Application Platform, at VMware. "One of the biggest challenges in the adoption of Hadoop is the difficulty in deploying Hadoop and the cost associated with that. What we're basically doing is dramatically simplifying what it takes to deploy, configure and manage Hadoop clusters."
Open Source Serengeti Virtualizes Hadoop
VMware today took the wraps off a new open source project dubbed Serengeti that is designed to be a "one-click" deployment toolkit for deploying highly available Hadoop clusters-and common Hadoop components like Apache Pig and Apache Hive-on VMware's vSphere platform. VMware is leading the Serengeti project in collaboration with key Hadoop distribution vendors like Cloudera, Greenplum, Hortonworks, IBM and MapR.
Currently, Hadoop is primarily deployed on a physical infrastructure. Such deployments can take days, weeks or even months depending on the scale, as IT obtains the necessary hardware, installs the distribution on the nodes and then configures the cluster and all the Hadoop components. And if the cluster is incorrectly sized for your need, resizing it can involve doing much of that work over again.
"With Serengeti you can deploy a Hadoop cluster in as little as 10 minutes without having to learn anything new," Ibarra says. "You have your choice of Hadoop distribution, and you will be able to reuse your existing virtual infrastructure running on vSphere; all while using the same skills and operations requirements as other things on vSphere."
"Hadoop must become friendly with the technologies and practices of enterprise IT if it is to become a first-class citizen within enterprise IT infrastructure," says Tony Baer, principal analyst at research firm OVUM. "The resource-intensive nature of large Big Data clusters make virtualization an important piece that Hadoop must accommodate. VMware's involvement with the Apache Hadoop project and its new Serengeti Apache project are critical moves that could provide enterprises the flexibility that they will need when it comes to prototyping and deploying Hadoop."
Making Hadoop Virtualization Aware
In addition to Serengeti, Ibarra says VMware is working with the Apache Hadoop community to contribute changes to the Hadoop Distributed File System (HDFS) and Hadoop MapReduce projects to make them "virtualization aware." These changes will allow data and compute jobs to be optimally distributed across a virtual infrastructure, giving enterprises the ability to achieve more elastic, secure and highly available Hadoop clusters.
VMware is also making changes to Spring for Apache Hadoop, the open source project it launched in February. Built on the Spring Java application framework, Spring for Hadoop is intended to make it easy for enterprise developers to build distributed processing solutions with Hadoop. Ibarra says the updates will give Spring developers the ability to build applications that integrate with the Hbase database, the Cascading library and Hadoop security.
"Hadoop is now ready for prime time with these updates," Ibarra says. "Provisioning a Hadoop cluster is going to be as simple as provisioning a new database or server."
Thor Olavsrud covers IT Security, Big Data, Open Source, Microsoft Tools and Servers for CIO.com. Follow Thor on Twitter @ThorOlavsrud. Follow everything from CIO.com on Twitter @CIOonline and on Facebook. Email Thor at firstname.lastname@example.org