by Arik Hesseldahl

AI is hungry for fresh data…so why are you starving it?

Feb 05, 2018
Artificial IntelligenceIT StrategyTechnology Industry

Companies that fail to gather, secure, and deliver data to those that need are putting their business at risk.

brain education artificial intelligence innovation
Credit: Thinkstock

Businesses spent an awful lot of time talking about artificial intelligence and machine learning in 2017. It appears that 2018 is shaping up to be the year the talk trails off and morphs into reality.

And that’s precisely the moment that CIOs and IT decision-makers at large companies are going to realize that for all the attention they’ve paid to enabling AI in their business, they may have neglected the critical resource that fuels it: Data.

Data is always on the mind of Eric Schrock, CTO at Delphix, the Redwood City, Calif.-based company that since 2008 has worked to make data easier to work with and more accessible to the people who need it. Whatever business you’re in, chances are you’re dreaming of a near future where that business is made better by AI, and Schrock says that means that whether you know it or not, you have an unquenchable thirst for data that’s quickly going to rise to the top of your priority list.

Six months ago, he articulated the principles of DataOps, an emerging set of practices that enable more flexible and secure access to data for the people who need it. Earlier this month we revisited the topic, and I asked him what questions CIOs who are on this AI journey should be asking about their own business data.

1. ‘Is your data diverse enough to train your AI models?’

“How you train your models and the data you use to do it is incredibly important,” Schrock says. “If you’re not able to train your model using data that’s sufficiently diverse, it’s not going to work the way you want it to.”

Big companies like Facebook, Google and Apple have the business scale and reach to build great data sets. Everyone else, Schrock says, typically has to settle for data that’s “good enough.” He calls this the “data gap,” and overcoming it requires a lot of hard work and focus on improving the data’s overall quality.

What defines “great?”  One characteristic is better data diversity. Companies too often focus on one single source of data, which prohibits them from establishing context or describing useful attributes that can improve the overall data quality and diversity. Even when they do have diverse data sets within the company, data scientists are often starved of access, leaving them no better than if there was no data at all.

For instance, was the data generated by a user at home or at work? During a weekday or a weekend? In a small town or in a city? Maybe the data is for a voice-interface app that must understand different regional accents. After all consumers in Boston and Nashville tend to speak differently.

It’s important to decide early what to collect, and chances are you’re going to err on the side of the “collect” option: “If you decide early on not to collect what your customers’ favorite colors are, and then three years later you suddenly wish you had that data, there’s not much you can do about it,” Schrock says.

“There’s a lot of different types of data, especially data that adds context that further describes multiple attributes that can help train your models,” Schrock says. “If you gather these attributes, and ensure data scientists have access to select the ones they need, the better your data is and your models will only improve as a result. Without data diversity, you won’t get the outcomes you’re hoping for. It’s that simple.”

2: ‘Do your security policies help or hinder business decisions?’

We all know the old adage that convenience is the enemy of security and vice versa.

And yet barely a week goes by without news of ever-more shocking data breaches. That has fueled a reflexive desire on the part of many IT leaders to lock down access to their data to greatest extent possible. It’s an understandable reaction, but a misguided one, especially at a moment when you’re deep in the weeds of building an AI system, Schrock says. Borrowing from a football metaphor, he says “Defense doesn’t win championships.” For those who don’t follow sports: A team with perfect defense but lousy offense rarely scores, and thus never wins against teams that do.

What it means for your data is clear: Data that’s been locked down too well is essentially useless to developers and data scientists who need it. And in an age where both are accustomed to moving fast and iterating, data that’s well-protected but out of reach may as well not exist. “Developers and data scientists are under pressure to run a mile-a-minute,” Schrock says. “If it takes months for security reviews to get access to one data set, they’re never going to be effective,” he says. “If your data scientists are stuck waiting for data, your business is stuck too.”

The answer is to secure the delivery of that data within a single workflow. Once you understand which portions of your data require protection — personal account information for example — you can then use techniques like data masking and de-identification to deliver it in a useful form.

“If 97 percent of your data usage doesn’t involve credit card numbers, then why deliver credit card numbers if they’re not needed for the work at hand? Why not just give data scientists a copy of the data without the credit card numbers?” Going forward, Schrock says, more companies will see what Facebook and Google can do and discover that they want to enable more uses for data while managing risks that are specific to it. In the end, data will become more flexible while remaining secure.

3. ‘Does data flow where you need it to be?’

Data lives in many places today: Public and private cloud and on-premise environments are just the top of the iceberg. Most companies have a serious data flow problem with bottlenecks aplenty across their data spectrum.

The future of AI will be brought to you by the cloud. It’s a pretty sure bet that your AI-driven app is going to run, at least in part, on a public cloud. And, to complicate things, or make them interesting depending on who you ask, different cloud vendors — Amazon Web Services, Google Cloud and Microsoft Azure — each have inherent strengths and weakness and differing capabilities when it comes to providing a platform for AI.

Digging in a bit, Google has led the way with Tensorflow, an open source machine learning library and offers access to Tensor Processing Unit chips, which are optimized for use with Tensorflow on its cloud, and has long had a lead in areas like image and video recognition. AWS on the other hand, is betting big on conversational interfaces like Lex and has its own image and video analysis engine called Reckognition.

Once you’ve sorted through the strengths and weaknesses of each, you need to feed it with data stored on your own private cloud or, in some instances where you’re dealing with sensitive or proprietary information, inside on-premise systems. And if your data is in a public cloud, it may be in a different location or network that makes it impractical to access in the clouds where you need it.

CIOs sometimes get skittish about letting production data leave the private cloud or on-premise environment, but often at the cost of the business. “Moving compute is easy. Moving data is hard. Data often becomes a limiter to making decisions about what’s best for your business,” Schrock days. That’s left many companies scrambling for ways to enable access to their data from wherever they need it.

4. ‘Are you setting your AI up for for failed outcomes?’

DataOps is a new set of operating practices that companies are using to help take some of the worries about where and how that data is accessed off the table. “It’s about having the people, processes and technology in place to make the decisions you need without worrying so much about where the data is,” he says.

Companies that fail to gather, secure, and deliver data to those that need are putting their business at risk. At worst, your AI will do more harm than good: It may guide incorrect diagnoses of chest x-rays or mis-identify entire races of people as gorillas. More likely, the results may simply turn out to be underwhelming. It may fail to deliver the insights you desire or take so long to get there that by the time you get an outcome, the information in it no longer matters.

And that segues directly into the most important aspect of all this: The time to get on top of all of these questions is now, Schrock says. “You can make all kinds of big plans for an AI initiative and spend months working on it. But nine months in if you find out that your data is stuck, you can expect to spend another nine months or more figuring out what to do,” he says. “And while you’re doing that, your competition may pass you by.”