There is a dizzying array of big data reference architectures available today. 2014 may be the year we see a big data stack—similar to the LAMP stack that drove development of dynamic and interactive websites in the dotcom era—begin to coalesce.
Will 2014 see the emergence of a big data equivalent of the LAMP stack?
Richard Daley, one of the founders and chief strategy officer of analytics and business intelligence specialist Pentaho, believes that such a stack will begin to come together this year as consensus begins to develop around certain big data reference architectures—though the upper layers of the stack may have more proprietary elements than LAMP does.
“The explosion of dynamic, interactive websites in the late 1990s and early 2000s was driven, at least in part, by the LAMP stack, consisting of Linux, Apache HTTP server, MySQL and PHP (or Perl or Python).”
“There’s thousands of big data reference architectures out there,” Daley says. “This is going to be more of a ‘history repeats itself’ kind of thing. We saw the exact same thing happen back with the LAMP stack. It’s driven by pain. Pain is what’s going to drive it initially; pain in the form of cost and scale.”
But, Daley says, organizations dealing with that pain with big data technologies—42 percent of organizations were already engaged in some form of big data initiative in 2013, according to a CompTIA study—quickly begin to see the upside of that data, particularly organizations that leverage it for marketing or for network intrusion detection.
“In the last 12 months, we’ve seen more and more people doing big data for gain,” he says. “There is much more to gain from analyzing and utilizing this big data than just storing it.”
The explosion of dynamic, interactive websites in the late 1990s and early 2000s was driven, at least in part, by the LAMP stack, consisting of Linux, Apache HTTP server, MySQL and PHP (or Perl or Python). These free and open source components are all individually powerful tools developed independently, but come together like Voltron to form a Web development platform that is more powerful than the sum of its parts. The components are readily available and have open licenses with relatively few restrictions. Perhaps most important, the source is available, giving developers a tremendous amount of flexibility.
While the LAMP stack specifies the individual components (though substitutions at certain layers aren’t uncommon), the big data stack Daley envisions has a lot more options at each layer, depending on the application you have in mind.
‘D’ Is for the Data Layer
The bottom layer of the stack, the foundation, is the data layer. This is the layer for the Hadoop distributions, NoSQL databases (HBase, MongoDB, CouchDB and many others), even relational databases and analytical databases like SAS, Greenplum, Teradata and Vertica.
“Any of those technologies can be used for big data applications,” Daley says. “Hadoop and NoSQL are open, more scalable and more cost-effective, but they can’t do everything. That’s where guys like Greenplum and Vertica have a play for doing some very fast, speed-of-thought analytical applications.”
In many ways, this layer of the stack has the most work ahead of it, Daley says. Relational and analytical databases have years of development behind them, but Hadoop and NoSQL technologies are in relatively early days yet.
“Hadoop and NoSQL, I have to say we are early,” Daley says. “”We’re over the chasm in terms of adoption—we’re beyond the early adopters. But there’s still a lot that needs to be done in terms of management, services and operational capabilities for both of those environments. Hadoop is a very, very complicated bit of technology and still rough around the edges. If you look at the NoSQL environment, it’s kind of a mess. Every single NoSQL engine has its own query language.”
‘I’ Is for the Integration Layer
The next layer up is the integration layer. This is where data prep, data cleansing, data transformation and data integration happens.
“Very seldom do we only pull data from one source,” Daley says. “If we’re looking at a customer-360 app, we’re pulling data from three, four or even five sources. When somebody has to do an analytical app or even a predictive app, 70 percent of the time is spent in this layer, mashing the data around.”
While this layer is the “non-glamorous” part of big data, it’s also an area that’s relatively mature, Daley says, with lots of utilities (like Sqoop and Flume) and vendors out there filling the gaps.
‘A’ Is for the Analytics Layer
The next layer up is the analytics layer, where analytics and visualization happen.
“Now I’ve got the data. I’ve got it stored and ready to be looked at,” Daley says. “I take a Tableau or Pentaho or Qlikview and visualize that data. Do I have patterns? This is where people—business users—can start to get some value out of it. This is also where I would include search. It’s not just slice-and-dice or dashboards.
This area too is relatively mature, though Daley acknowledges there’s a way to go yet.
“We’ve got to figure out as an industry how to squeeze more juice out of Hadoop—methods to get data faster,” he says. “Maybe we acknowledge that it’s a batch environment and we need to put certain data in other data sources? Vendors are working around the clock to make those integrations better and better.”
‘P’ Is for the Predictive/Prescriptive Analytics
The top layer of the stack is predictive/prescriptive analytics, Daley says. This is where organizations start to truly recognize the value of big data. Predictive analytics uses data (historical data, external data and real-time data), business rules and machine learning to make predictions and identify risks and opportunities.
One step further along is prescriptive analytics, sometimes considered the holy grail of business analytics, which takes those predictions and offers suggestions for ways to take advantage of future opportunities or mitigate future risks, along with the implications of the various options.
“You have to go through and do predictive to get value out of big data,” he says. “It’s a low likelihood that you’re going to get a lot of value out of just slicing and dicing data. You’ve got to go all the way up the stack.”
“At least 70, maybe even 80 percent of what we see around big data applications is now predictive or even prescriptive analytics,” Daley adds. “That’s necessity, they mother of invention. It starts at the bottom with data technology—storage, data manipulation, transformations, basic analytics. But what’s happening more and more, finally, is predictive, advanced analytics is coming of age. It’s becoming more and more mainstream.”
While predictive analytics are somewhat mature, it’s currently an area only data scientists are equipped to handle.
“I think predictive is a lot farther along than the bottom layer of the stack,” Daley says. “From a technology standpoint, I think it’s mature. But we need to figure out how to get it into the hands of a lot more users. We need to build it into apps that business users can access versus just data scientists.”
What’s That Spell? DIAP? PAID?
Call it the DIAP stack. Or maybe start from the top and call it the PAID stack. The trick now, Daley says, is not just adding more maturity to component technologies like Hadoop and NoSQL, it’s providing integration up and down the stack.
“That’s a very key point,” he says. “To date, all these things are separate. A lot of companies only do one of these things. Hortonworks will only do the data side, they won’t do integration, for example. But customers like to go through and buy an integrated stack. We should at least make sure that our products up and down those stacks are truly integrated. That’s where it’s going to have to get to. In order to really get adopted, products and vendors are going to need to work up and down that stack. I need to support every flavor of Hadoop—at least the commercially favorable ones. And it’s the same thing for NoSQL.”