NEW YORK--Microsoft today used the first day of O'Reilly Strata Conference + Hadoop World in New York City to announce that its Windows Azure HDInsight Service is now generally available after a year in preview.
The HDInsight Service, designed in partnership with Hadoop specialist Hortonworks, makes standard Apache Hadoop available as a service in Microsoft's Azure cloud, allowing you to deploy Hadoop clusters in minutes and shut them down just as easily.
Integration with the Microsoft data platform means that you can access and analyze your data with PowerPivot, Power View and other Microsoft BI tools, like Microsoft SQL Server Analysis Services (SSAS).
"Hadoop is a cornerstone of big data," says Quentin Clark, corporate vice president, Microsoft Data Platform. "The need for the insights and results and transformations from big data is really there. There are companies talking to us about how they don't feel they can even be competitive without embracing the big data phenomenon."
-- Wu Feng, professor of Computer Science, Virginia Tech<
The goal, Clark says, is to bring Hadoop together with the flexibility of cloud deployment and the security that enterprises require to help customers achieve the competitive edge they need.
DNA Sequencing with HDInsight Service
The use cases are many and varied. For instance, Virginia Polytechnic Institute and State University has been using the HDInsight Service to aid its life sciences research in DNA sequencing.
Leveraging a grant from the National Science Foundation, Virginia Tech computer scientists developed an on-demand, cloud computing model using Windows Azure HDInsight Service that helps locate undetected genes in a massive genome database.
"Of the estimated 2,000 DNA sequences worldwide, they are generating 15 petabytes of genome data every year," says Wu Feng, professor of Computer Science at Virginia Tech. "Many life sciences institutions simply do not have access to the computational and storage resources required to work with data sets of this size. We're generating data faster than we can analyze it."
Fend and his team used the grant to develop two software artifacts: SeqInCloud, a popular genetic variant pipeline called the Genome Analysis Toolkit (GATK), and CloudFlow, a workflow management framework that uses both client and cloud resources.
SeqInCloud generalizes the GATK pipeline, allowing it to run in the cloud using HDInsight and Azure to maximize portability. Meanwhile, CloudFlow, installed on a researcher's PC, aids interactions with the Windows Azure HDInsight Service.
"It allows us to compose flexible MapReduce pipelines that simultaneously utilize both client and cloud resources for running the pipeline and automating data transfers," Feng explains. "This is where the HDInsight resource has been particularly useful."
Using HDInsight to Track and Analyze Social Media
Then there's data services company iTrend, which tracks and analyzes unstructured data generated by social media. It built its new data discovery platform on a hybrid cloud implementation running on Windows Azure that includes an Apache Hadoop cluster for processing raw data and a relational database to work with extracted information. It currently uses an on-premises Apache Hadoop cluster, but plans to migrate it to Windows Azure HDInsight Service.
The platform allows iTrend to provide dynamic reporting tools accessed through a customer portal that customers can use to track campaigns, brands and individual products. Once they specify what they want to monitor, the tool automatically tracks, analyzes and summarizes potentially millions of conversations from multiple sources. It then provides a dashboard view of the data and users can drill down for a more detailed view.
"One search term for a relatively obscure topic such as a rare medical condition might return 100,000 results, while a search for a popular celebrity might generate 100 million," says Michael Alatortsev, CEO of iTrend, explaining why a big data solution is necessary for iTrend's business.
"To work with high volumes of unstructured data, which is basically just text, we needed to be able to process it in parallel. Trying to do that with a relational database and all of the necessary infrastructure would be too costly."
"From the technology side, we can deploy faster on Windows Azure and add new modules quickly," Alatortsev says. "And from a business perspective, we're seeing tremendous opportunities even with the first release of our service. We have the tools available to offer features that no one else has."
Thor Olavsrud covers IT Security, Big Data, Open Source, Microsoft Tools and Servers for CIO.com. Follow Thor on Twitter @ThorOlavsrud. Follow everything from CIO.com on Twitter @CIOonline, Facebook, Google + and LinkedIn.