Ease Big Data Hiring Pain With Cascading
Finding developers with the skills to create MapReduce jobs in Apache Hadoop is challenging, but you can ease that hiring pain with Cascading, an open source Java application framework for building enterprise Big Data applications on Hadoop.
Wed, June 06, 2012
One of the major challenges companies face when they set out to transform the data they store into actionable insight is finding developers able to create MapReduce jobs to query Hadoop-stored datasets. MapReduce is a complicated and difficult framework to use.
"Folks that know MapReduce are a tough find and they're in high demand," says Brandon Mason, CTO of Upstream Software, a specialist in integrated marketing performance management. Upstream analyzes all the marketing data a retailer has-including Coremetrics or Omniture logs, keywords shoppers use, direct mail logs, email logs and so forth-to help retailers properly weight their marketing mix. "To do the secret sauce stuff, we really needed a platform to handle lots of different data sets. Sometimes it's very dirty."
Open Source Cascading Is Alternative to MapReduce
Enter Cascading, a stand-alone open source Java application framework designed as an alternative API to MapReduce. Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset.
"I created Cascading in anger after having used MapReduce once in my life and vowing never to use it again," explains Chris Wensel, creator of Cascading.
Wensel authored Cascading as an open source project in 2007 and is now CEO of Concurrent, an enterprise Big Data application platform company that continues to drive development of Cascading as its primary commercial sponsor. Concurrent numbers companies like Twitter and Etsy, as well as Upstream, among its clients. Twitter has three internal teams that use Cascading to perform sophisticated statistical functions to analyze huge volumes of data from tweet contents, ad campaigns and user activity. Etsy executes more than 65 Cascading applications daily to extract data from its web logs and databases to monitor and understand user behavior, A/B site testing and power new features on its ecommerce site.
On Tuesday, Concurrent released Cascading 2.0 under the Apache 2.0 License Agreement. Cascading 2.0 adds a number of new features, including in-memory processing that allows users to run it in memory on a local computer to rapidly test Big Data applications in development. Upstream's Mason says his company made the switch to Cascading 2.0 about two months ago. But even as the CTO of a company that lives and dies on its ability to leverage Big Data, Mason is not as excited by the new features of Cascading as he is in the ability to use it to more easily build a team to meet Upstream's needs.