One of the oldest maxims of the information processing age – garbage in, garbage out (GIGO) – is staging a splashy comeback. The reason? The often messy intersection of big data and artificial intelligence (AI).
As noted in an earlier post, AI solutions typically depend upon massive amounts of data, for both their training stages and once in production. An inability to satiate this data hunger due to bottlenecks in the data infrastructure can severely hamper the functionality of AI technology.
Even more critical than the volume of data fed to AI systems, however, is the quality and relevance of that data. AI algorithms and models can be world-class, but can still fail miserably if the data they consume is suspect.
It’s important to understand that “garbage” in the context of AI operations isn’t just data that is corrupt or inaccurate in some fashion. The data may be perfectly good, but irrelevant to the AI problem at hand. Or it may be directly relevant, but incomplete and, as a result, problematic.
As detailed in a CIO.com article that characterized “data gone wrong” as AI’s biggest risk factor, human biases can also play a role in tripping up AI-driven conclusions. Training AI systems begins by feeding them large data sets that have been labeled by people, sometimes misleadingly. In one example cited, a picture of man cooking might be identified as a woman cooking by an AI-based image analyzer because all of the images used to train the system used only female cooks. This is a big concern in the world of AI. Biasing data, intentionally or not.
The history of AI is littered with this type of sometimes-amusing gaffes. But such errors can have serious ramifications when AI systems are enlisted to help inform – to say nothing of make – decisions in business, healthcare, and other real-world settings.
Even more ominously, AI’s big data dependency could make AI systems vulnerable to corruption by cyber criminals and hackers. It’s possible, for instance, that bad actors could insert tiny, difficult-to-detect errors or malware into the data streams feeding AI engines, causing either bad analyses or introducing vulnerabilities for future exploitation.
Indeed, the susceptibility of AI systems to data-based attacks has become a concern of U.S. government defense and security agencies, as noted in a recent article. One concern is that of an adversary feeding false data to a defense or security AI system to make it “learn” what the adversary wants it to learn.
Once the data destined for AI systems has been properly labeled and cleaned, determined to be relevant, and properly secured, it must still be stored in massive volumes and often delivered to AI engines in near real time. But ensuring data quality is a required first step before any AI-based solution can be trusted and, ultimately, successful.
For more information about how Pure Storage can help your organization clean up data, click here.