by Elana Varon

Big Data in the Real World Isn’t So Easy

Apr 16, 201214 mins
Business IntelligenceData ManagementIT Leadership

The pundits make it sound so simple. But CIOs wrestling with overwhelming amounts of unwieldy data know that 'big data' is a big challenge that requires high-quality data, new approaches to data management and more processing power. But the payoff could be a strong competitive advantage.

General Motors’ OnStar service, which provides drivers with remote vehicle diagnostics and responds to emergencies, already manages as much as 3 petabytes of data annually. OnStar CIO Jeffrey Liedel knows there is so much more that can be done to exploit that data-for the benefit of drivers and GM’s business.

For example, GM is pilot-testing a mobile app for its Chevrolet Volt electric car that would help drivers monitor their vehicle batteries and remotely manage charging them.

Competitors, including Nissan and Ford, offer similar capabilities to monitor electric vehicles, or they plan to. Drivers want manufacturers to alleviate their “range anxiety,” or worry about whether an electric vehicle is about to run out of juice. But that’s not all. “There’s something about the electric vehicle —you want to be connected to it,” Liedel says. “The customer is more interested in analytics: How well am I driving, driving patterns, what’s my fuel economy.” (The Volt can also run on gasoline.)

Electric vehicle owners aren’t the only ones who want deeper insight from OnStar data. Internal business users and external partners want it, too. It falls to IT, Liedel says, to deliver the data in a way that is reliable, secure and flexible. “The key has been to recognize the importance of data and analytics,” he says. “Even though sometimes it’s not core to running a transactional system, it’s a key part of running the business.”

Not every CIO has to manage petabytes of information. But even companies that collect mere gigabytes of their own data are, increasingly, tapping information from outside their own systems. Having the capacity to process “big data” and the tools to analyze it effectively is becoming a competitive necessity.

“Every organization is trying to leverage the data that it has, or that is has access to, much better than they ever have before,” observes Gavin Michael, chief technology innovation officer at Accenture, yet “a lot of companies have had small analytics groups. They’ve never seen it as an enterprise resource.” CIOs are able to take an enterprise view of data, understand how to integrate it and help colleagues analyze it.

As a result, IT leaders are rethinking many aspects of how they manage and deliver information, from investments in infrastructure and analytics tools to new policies for organizing and accessing data so they can deliver more of it, faster.

Invest for Capacity and Speed

When Epic Advertising merged with Connexus in 2010 to form Epic Media Group, the combined companies were in the midst of a data explosion. Connexus, which at the time was serving ads to websites at a rate of 300 million per day, was running out of storage.

“Their infrastructure was out of date and they were in the process of trying to figure out what to do,” recalls CIO Rick Okin. “They were leaving a lot of data on the table.” If analysts wanted insight into a particular ad campaign, they had to request that IT monitor it; otherwise, details about consumer behavior related to the ad wouldn’t be captured completely.

Not only was this limitation handicapping existing operations at the privately held company, it would hamper growth if it continued. Any company can buy a server and push out ads. The agency’s value to customers comes from being able to figure out how to get maximum impact from each ad, and even generate new ideas for how to reach consumers, Okin says. “You can do analysis that gives you some insights into what works and what doesn’t work. Models might catch things that humans might never see.”

As the company evaluated new infrastructure technologies, scalability-the ability to add both servers and storage without sacrificing network performance-was a crucial factor. “The average query had to come back in a minute,” Okin says. “If we doubled the data that we’re loading, load times would probably double, but if we doubled our hardware [also], it would remain consistent.”

Epic Media decided to push ahead with plans Connexus had developed for a private cloud infrastructure, using a network, computing and storage platform from Joyent; a data warehouse from Vertica; and business intelligence software from Microstrategy. The company also built a proprietary application for predictive modeling, which runs on top of Vertica.

With more processing power and dedicated tools, Epic Media was able to start providing clients with insights about the audience for their ads and give them information they could use to develop their campaigns, says Okin. “It allows us to stay involved with the agency or advertiser we’re working with.” Client turnover has decreased as a result.

Transactions and Analytics May Not Mix

As hardware and storage costs decline, CIOs may find it cost-effective to add capacity to existing systems to support analytics. But some IT leaders say that when analyzing big data, maintaining separate transaction and analytics systems can be essential to getting both processes right.

That’s because databases set up for transactions may not be structured to perform heavy number crunching, and doing so may impair performance, notes Epic Media’s Okin.

Epic Media’s Vertica and Microstrategy platforms are separate from the company’s transaction systems, which serve the ads and capture consumer data. The company’s current goal is to load new data into the Vertica system hourly, although eventually, Okin says, data will be available for analysis within a few minutes of being collected.

“We don’t want to impact our transactional systems by executing large queries against them,” he adds, or store as much data in them as would be needed for analytics. The transaction systems are optimized for fast data processing and the analytics systems for handling queries.

Similarly, the U.S. Department of Veterans Affairs has deployed 25 data warehouses during the past two years to facilitate analysis of big data. The department provides health benefits to 22 million military veterans. CIO Roger Baker says analyzing data from veterans’ electronic health records would impede clinicians’ ability to use the its EHR system, called Vista, with patients.

But these records can’t be easily compared to those of other patients, or across many years. “We have a treasure trove of information gathered over 20 to 30 years about symptoms, treatment and results,” says Baker. The VA has even launched a program soliciting DNA samples to complement veterans’ health records. Insights hidden in petabytes of clinical and genetic data could point to more effective medical treatments.

Vista includes “hierarchical databases focused on transaction speed,” says Baker, so “they’re very, very quick when a clinician is interacting with a patient in the room.” The analytics databases, on the other hand, are organized according to clinical topic: pharmacy information in one place, hematology data in another. “We want to present the information that’s going to be relevant for researchers, to provide computing power and let researcher figure out” what data they need.

There’s so much data that during the next year Baker will be looking into providing supercomputers to crunch the numbers faster, so it’s even easier for researchers to use. “The more access we can provide, the more value that the information has,” he says.

Not every company with a lot of data to analyze needs to invest in supercomputers, however. Whether you do may depend on how quickly users need the results of their queries. Getting answers from large data sets can take days using conventional servers, rather than hours or minutes.

But not everyone needs every answer fast, says Isaac Kohane, director of informatics at Children’s Hospital in Boston and a professor at Harvard Medical School. He leads collaborations among researchers and physicians at Harvard and its affiliated hospitals to develop and use technologies for crunching clinical and research data. Most users don’t have queries that take “days or weeks” to run, he says. Those that do aren’t doing work that is such a high business priority that it would justify the investment in more processing power to speed up the results.

Focus on the Data

A bigger issue for CIOs is making sure the data itself is available and reliable. Big data complicates data governance, quality and access control challenges. Companies still struggling to break down internal information silos now have to add integrating data from external sources to the agenda.

As anyone who has spent time wrestling with Master Data Management knows, this is political, as well as technical, work, requiring deep relationships with business users. “Bringing an organization around a common data agenda is at times difficult,” says Accenture’s Michael.

“Sharing data across an organization is not necessarily natural.” But decisions about what information to include and how to represent it are crucial because they dictate what analysts can do.

Last year, Blue Cross/Blue Shield of Rhode Island reorganized, streamlining its operations. In the process, executives took a hard look at how they were using—and failing to use —corporate data.

“Our data resources were very fragmented. There was an awful lot of home cooking going on in each department,” with financial analysts, underwriters and health care analysts among other groups building their own data sets, says Bill Wray, who served as vice chairman and CIO before becoming COO in September. “There wasn’t a central governance to pull it together.”

The need for a corporate analytics capability became urgent because-spurred by federal health care reform —BlueCross/Blue Shield wants to shift how it reimburses doctors and hospitals. Instead of merely processing claims (at the rate of 1 million per month) the insurer wants to give health care providers financial incentives to make patients healthier. It also wants to encourage patients to have a closer relationship with their primary care doctors.

“Intuitively you know if you make better use of your primary care physician you’ll be healthier and you’ll cost less,” Wray says. “How do you prove that, though? You have a wide variety of people going to doctors. There’s a lot of longitudinal analysis and trend analysis that makes this very complicated.”

Wray is also exploring ways to integrate patient data from health care providers, a task complicated by health care privacy laws. “It’s completely feasible right now, and it happens in closed systems [such as Kaiser Permanente] all the time. How do you do that in a virtual environment?”

The company had an enterprise data repository, but it was set up primarily to gather data for reporting to a national Blue Cross Blue Shield business intelligence system used for benchmarking. Local analysts didn’t use it. One problem: It was missing information from one of the company’s two claims systems (one of which is now being decommissioned). It also didn’t incorporate external market data. In the process of making it more useful, company leaders had to agree on policies and processes for handling data, setting standards and making decisions about “what a particular field means,” Wray says.

Establishing new data governance practices is a key step for the Colorado Department of Education as it develops its Statewide Longitudinal Data System (SLDS). The project aims to integrate student data from 178 school districts and 28 public colleges and universities with welfare, income and workforce data to create a platform for analyzing student achievement from pre-school through college.

“We even have the Department of Corrections involved,” says CIO Daniel Domagala. Nine of the project’s 25 objectives relate to capturing data, including the establishment of common course and program codes and a way to integrate data collected by preschools.

Ultimately, local administrators and classroom teachers would be able to use the system to learn how their students perform over time compared to other students in the state, and to tease out the impact of factors such as income level, preschool attendance and high school coursework on their readiness to attend college or get a job. It’s a long-term effort-the current phase began in 2009 —that will eventually involve terabytes of data.

Compared to the oil and gas industry, where Domagala used to work, the scope of the project seems small. “It’s more the breadth of data, connecting different data sources,” he says. But it requires big changes in how the state manages and uses information.

In principle, state agencies want to share data. But across just the school districts, there are multiple ways to report information. “Traditionally, within the education realm, funding would come for a program [and] a system would be built to monitor and track” it with its own data requirements and access controls, Domagala says. Meanwhile, each district has its own systems and priorities.

As a result, says Domagala, large, urban districts such as Denver do a better job than the state at providing information to individual schools, while small, rural districts have “limited capacity or zero capacity.” And few of these many systems can share data.

“It becomes unwieldy for school districts to deal with all this information, to do the validation and verification. The more we can introduce standards, the more we can reduce redundancies,” says Domagala. And IT can focus more on helping educators use their data instead of collecting and policing it.

Control Access, Not Analysis

Organizations that handle health care, financial and education information have laws and regulations to dictate who can access different types of data. But every company restricts data in some way. Doing so may not be challenging technically: identity management technologies let CIOs keep control over how data gets distributed. However, the demands on Big Data raise the bar for corporate policies that define what type of access is permitted, by who, and when.

Distributing access to VA’s massive data archive, for example, “goes beyond what most large-scale databases and computing organizations have had to deal with in the past,” says Baker. On one hand, Baker has to ensure that individual veterans have access to all the information the agency has about them through their electronic health records. On the other, he has to ensure no personally identifiable information is released publicly to researchers.

The challenge is “how do you provide massive amounts of de-identified data for research where you can be less concerned about controls because you’ve removed the threat” of disclosure, says Baker.

By resolving this tension effectively, CIOs can set analysts loose to do their work, and get out of the business of telling them what kinds of reports they should run, and when, or dictate what tools they should use. The IT organization can become more of an adviser and a steward than a gatekeeper.

VA provides some “heavyweight” tools for researchers, but many of them supply their own. “Especially in our research and development area, researchers will use specialized analytical tools designed for the type of research they’re doing,” says Baker.

Wray, at Blue Cross Blue Shield of Rhode Island, has revamped the company’s data analyst teams to create a “community of practice,” making it easier for them to share tools and techniques. “There are apps other people have built that you can take advantage of. The staff tracks and markets them so people know they don’t have to build things for themselves anymore.”

Such flexibility is critical, says Brian Hopkins, an analyst at Forrester Research. The volume of data is growing so fast, and with it, the demand for fresh analysis, that a traditional approach to business intelligence-in which business leaders define what they need and IT builds a system to deliver it-won’t fly. Big data demands a fresh attitude.

“No one stakeholder group has all the answers,” Hopkins says. “In traditional BI environments, you have a set of business analysts and data integration specialists who work closely with the business, but they’re IT people. What we’re finding is for firms that are into big data, that model doesn’t work. They’re having to come together and collaborate.”

To set the right tone, Liedel chose a manager with a business resume to run his data reporting team.

“She doesn’t have an Oracle DBA in her background, which was a qualification we looked at in the past,” he says. “That was a big change for us.”

Elana Varon is a freelance writer and editor based in Massachusetts.