At last week's SVForum Big Data Event, you could have been forgiven for thinking that it would be all Hadoop, all the time. Even though Hadoop was featured prominently, it was by no means the only topic.
It's no secret that Hadoop and big data are, well, big. The explosion of new data sources such as social media and website clickstreams offers new types of information and, potentially, the opportunity to generate new insights. Hadoop has been front and center in this trend, offering the MapReduce method across distributed data collections to make it possible to analyze far larger amounts of data than traditional approaches can manage.
However, it's obvious that Hadoop, despite its obvious benefits, is not a business intelligence magic bullet. A number of speakers noted the shortcomings of its batch nature; it can take hours to get return results that, once examined, are clearly not what was desired, thus setting off another batch submission.
A number of other speakers promoted real-time BI, which appears to be associated with mining sentiment streams such as Twitter and Facebook. This sort of real-time analysis is undoubtedly valuable, but it's unlikely to be the main source of insight for most companies going forward. It's more likely to be an adjunct to other forms of insight generation, and perhaps will be most valuable as offering interesting areas to examine.
Big Data Investments Will Be Open Source
I moderated a panel discussing the investment opportunities in big data. While investors on these kind of panels are always cagey—after all, they're not likely to trumpet an area they're about to invest in, are they?—my takeaway from their remarks indicated something very important about the big data space, and, more importantly, the nature of IT infrastructure innovation for the future.
All three panelists agreed that the new infrastructure offerings in big data were not going to be venture-backed proprietary products but, rather, will be open source-licensed, shared development products. This is in part because it is so expensive to bring a proprietary infrastructure product to market—$200 million was mentioned as the right level of funding—and in part because, today, innovation is dispersed, making it difficult to create a fundable entity that can capture and contain this kind of innovation.
Where will big data proprietary investment make sense? The panelists concurred that verticals that leverage big data will be areas of fruitful outcomes, and that they will be offered as SaaS. The only question is whether these verticals will build out their own computing infrastructure or leverage Amazon Web Services.
This aligns with a view I've held a long time that I characterize as "the migration of margin." Great software companies have been built on proprietary infrastructure (namely, Oracle), but the day has passed for this kind of company. Open source is going to rule software infrastructure going forward. Where will high-margin opportunity reside? Further up the stack, particularly in verticals, where domain expertise is required and open source is poorly suited to address market requirements.
Algorithms Changing Business Intelligence
The most interesting part of the event for me, however, was the glimpse of the future of analytics—and it's not business intelligence, or at least BI as we've traditionally known it. Both the opening keynote, by Kaggle CEO Anthony Goldbloom, and the closing keynote, by Mike Gualtieri of Forrester, focused on predictive analytics.
You may remember the great Netflix contest, in which the company offered a big prize to anyone who could improve its recommendation engine by 10 predictive or more.
The core of that effort was predictive analytics, in which an algorithm—probably dozens or hundreds of algorithms—is unleashed on a subset of a data collection to see if it can discern a pattern of data elements that's associated with some other interesting outcome. When a predictive algorithm is identified, it's set against another subset of the data collection to see if it can predict how that outcome turned out for the records in the second subset.
Gualtieri example was mobile churn rates. A wireless company could examine marital status, payment pattern (early, on time or late), usage amounts and so on to assess whether an analysis across those elements could predict whether a subscriber was likely to terminate his or her contract. (Cynics will state the obvious: Wouldn't it be easier to offer better service as an inducement to reduce churn?)
An obvious extension of this process is evolving the algorithms to further improve predictive power in a process dubbed "machine learning." Kaggle, by the way, focuses on organizing and running predictive analytics competitions, and Goldbloom offered a fascinating example: Could a machine learning system evaluate student essays better than human teachers? The answer was "Yes," especially since the software had far less variance in evaluation than a pool of teachers would.
Better Analytics, Better Performance, But at What Cost?
It seems clear that this kind of machine learning spells the death of traditional BI. Business intelligence, after all, is built on asserting an insight against a set of data—"I think warm weather makes people want to book cruise vacations. Run a report correlating temperature against cruise bookings."
The problem with this approach is that it depends on a human deciding the right correlation among the data. This is where it gets sticky. You're depending upon the judgment—and prejudices—of a person to decide the relevant data to look at. It's much more compelling to let the data identify what is relevant.
That's where this area becomes troubling.
Once you start down the road and say, "Let the data tell me what to do," the natural impulse is to get more data. That mobile company will strike a deal with, say, a consumer products company to get information on purchasing habits of other types of products to enable analysis regarding churn.
You might argue that this is still in the service of offering a customer more compelling reasons to stick with the wireless provider, so it's all good. What this brought to my mind, though, is a more troubling area: Finance and credit. This has been the battleground of data collection and accuracy for years.
Unlike a poorly targeted mobile offering, which can be ignored, an inaccurate credit report has real-world consequences. It's taken political action to force credit agencies to identify their data sources and offer the capability to correct inaccuracies.
A story I saw last week about a new, big data credit analysis company called ZestFinance brought this to mind. At a recent GigaOm conference, the company noted that it is "a new style of underwriting company that uses 70,000 data signals and 10 parallel machine learning algorithms to assess personal loans." The company looks at nontraditional signals such as whether or not a would-be borrower has read a letter on its website" to better evaluate whether someone is likely to repay a loan.
The notion is that, by exploiting all these data signals, the company will find people that traditional credit agencies say are unworthy of a loan but, upon more detailed examination with these other data elements, are actually a good credit risk.
That's all well and good, but what happens when the signals say you aren't a good credit risk? Or when you find out you're paying half a percentage point higher interest rate on your mortgage than your neighbor? When you ask why, you'll be pointed to the fact that your neighbor read a letter on the credit company's website—or, more likely, you won't find the reason, either because the person you contact doesn't know, since, hey, it's in the algorithm, or the company won't tell you because the algorithm is a trade secret.
I generally try to avoid being alarmist about technology trends, but this aspect of the big data domain is a big concern. The combination of data and credit analysis has always proved to be explosive, and I don't expect the machine learning and big data version to be any less so. It's going to be especially problematic because of the vague judgment criteria and the likely secrecy the industry will try to impose on its shift to machine learning.
So, BI is dead, and long live BI. We're clearly on the cusp of a new way of generating insight, moving from the inefficient sifting methods driven by humans to new methods that leverage machine learning to identify relevant patterns and outcomes. We're in for a lively next couple of decades as this BI movement plays out.
Bernard Golden is the vice president of Enterprise Solutions for Enstratius Networks, a cloud management software company. He is the author of three books on virtualization and cloud computing, including Virtualization for Dummies. Follow Bernard Golden on Twitter @bernardgolden.