IT leaders seeking to derive business value from the data their companies collect face myriad challenges. Perhaps the least understood is the lost opportunity of not making good on data that is created, and often stored, but seldom otherwise interacted with.\n\nThis so-called \u201cdark data,\u201d named after the dark matter of physics, is information routinely collected in the course of doing business: It\u2019s generated by employees, customers, and business processes. It\u2019s generated as log files by machines, applications, and security systems. It\u2019s documents that must be saved for compliance purposes, and sensitive data that should never be saved, but still is.\n\nAccording to Gartner, the majority of your enterprise information universe is composed of \u201cdark data,\u201d and many companies don\u2019t even know how much of this data they have. Storing it increases compliance and cybersecurity risks, and, of course, doing so also increases costs.\n\nFiguring out what dark data you have, where it is kept, and what information is in it is an essential step to ensuring the valuable parts of this dark data are secure, and those that shouldn\u2019t be kept are deleted. But the real advantage to unearthing these hidden pockets of data may be in putting it to use to actually benefit the business.\n\nBut mining dark data is no easy task. It comes in a wide variety of formats, can be completely unformatted, locked away in scanned documents or audio or video files, for example.\n\nHere is a look at how some organizations are transforming dark data into business opportunities, and what advice industry insiders have for IT leaders looking to leverage dark data.\n\nCoded audio from race car drivers\n\nFor five years, Envision Racing has been collecting audio recordings from more than 100 Formula E races, each with more than 20 drivers.\n\n\u201cThe radio streams are available on open frequencies for anyone to listen to,\u201d says Amaresh Tripathy, global leader of analytics at Genpact, a consulting company that helped Envision Racing make use of this data.\n\nPreviously the UK-based racing team\u2019s race engineers tried to use these audio transmissions in real-time during races, but the code names and acronyms drivers used made it difficult to figure out what was being said and how it could be made use of, as understanding what other drivers were saying could help Envision Racing\u2019s drivers with their racing strategy, Tripathy says.\n\n\u201cSuch as when to use the attack mode. When to overtake a driver. When to apply brakes,\u201d he says.\n\nEnvision Racing was also collecting sensor data from its own cars, such as from tires, batteries, and breaks, and purchasing external data from vendors, such as wind speed and precipitation.\n\nGenpact and Envision Racing worked together to unlock the value of these data streams, making use of natural language processing to build deep learning models to analyze them. The process took six months, from preparing the data pipeline, to ingesting the data, to filtering out noise, to deriving meaningful conversations.\n\nTripathy says humans take five to ten seconds to figure out what they\u2019re listening to, a delay that made the radio communications irrelevant. Now, thanks to the AI model\u2019s predictions and insights, they can now respond in one to two seconds.\n\nIn July, at the ABB FIA Formula E World Championship in New York, the Envision Racing team took first and third places, a result Tripathy credits to making use of what was previously dark data.\n\nDark data gold: Human-generated data\n\nEnvision Racing\u2019s audio files are an example of dark data generated by humans, intended for consumption by other humans \u2014 not by machines. This kind of dark data can be extremely useful for enterprises, says Kon Leong, co-founder and CEO of ZL Technologies, a data archiving platform provider.\n\n\u201cIt is incredibly powerful for understanding every element of the human side of the enterprise, including culture, performance, influence, expertise, and engagement,\u201d he says. \u201cEmployees share absolutely massive amounts of digital information and knowledge every single day, yet to this point it\u2019s been largely untapped.\u201d\n\nThe information contained in emails, messages, and files can help organizations derive insights such as who are the most influential people are in the organization. \u201cEighty percent of company time is spent communicating. Yet analytics often deals with data that only reflects 1% of our time spent,\u201d Leong says.\n\nProcessing human-generated unstructured data is uniquely challenging. Data warehouses aren\u2019t typically set up to handle these communications, for example. Moreover, collecting these communications can create new issues for companies to deal with, having to do with compliance, privacy, and legal discovery.\n\n\u201cThese governance capabilities are not present in today\u2019s concept of a data lake, and in fact by collecting data into a data lake, you create another silo which increases privacy and compliance risks,\u201d Leong says.\n\nInstead companies can also leave this data where it currently resides, simply adding a layer of indexing and metadata for searchability. Leaving the data in place will also keep it within existing compliance structures, he says.\n\nEffective governance is key\n\nAnother approach to handling dark data of questionable value and origin is to start with traceability.\n\n\u201cIt\u2019s a positive development in the industry that dark data is now recognized as an untapped resource that can be leveraged,\u201d says Andy Petrella, author of Fundamentals of Data Observability, currently available in pre-release form from O\u2019Reilly. Petrella is also the founder of data observability provider Kensu.\n\n\u201cThe challenge with utilizing dark data is the low levels of confidence in it,\u201d he says, in particular around where and how the data is collected. \u201cObservability can make data lineage transparent, hence traceable. Traceability enables data quality checks that lead to confidence in employing these data to either train AI models or act on the intelligence that it brings.\u201d\n\nChuck Soha, managing director at StoneTurn, a global advisory firm specializing in regulatory, risk, and compliance issues, agrees that the common approach to tackling dark data \u2014 throwing everything into a data lake \u2014 poses significant risks.\n\nThis is particularly true in the financial services industry, he says, where companies have been sending data into data lakes for years. \u201cIn a typical enterprise, the IT department dumps all available data at their disposal into one place with some basic metadata and creates processes to share with business teams,\u201d he says.\n\nThat works for business teams that have the requisite analytics talent in-house or that bring in external consultants for specific use cases. But for the most part these initiatives are only partially successful, Soha says.\n\n\u201cCIOs transformed from not knowing what they don\u2019t know to knowing what they don\u2019t know,\u201d he says.\n\nInstead, companies should begin with data governance to understand what data there is and what issues it might have, data quality chief among them.\n\n\u201cStakeholders can decide whether to clean it up and standardize it, or just start over with better information management practices,\u201d Soha says, adding that investing in extracting insights from data that contains inconsistent or conflicting information would be a mistake.\n\nSoha also advises connecting the dots between good operational data already available inside individual business units. Figuring out these relationships can create rapid and useful insights that might not require looking at any dark data right away, he says. \u201cAnd it might also identify gaps that could prioritize where in the dark data to start to look to fill those gaps in.\u201d\n\nFinally, he says, AI can be very useful in helping make sense of the unstructured data that remains. \u201cBy using machine learning and AI techniques, humans can look at as little as 1% of dark data and classify its relevancy,\u201d he says. \u201cThen a reinforcement learning model can quickly produce relevancy scores for the remaining data to prioritize which data to look at more closely.\u201d\n\nUsing AI to extract value\n\nCommon AI-powered solutions for processing dark data include Amazon\u2019s Textract, Microsoft\u2019s Azure Cognitive Services, and IBM\u2019s Datacap, as well as Google\u2019s Cloud Vision, Document, AutoML, and NLP APIs.\n\nIn Genpact\u2019s partnership with Envision Racing, Genpact coded the machine learning algorithms in-house, Tripathy says. This required knowledge of Docker, Kubernetes, Java, and Python, as well as NLP, deep learning, and machine learning algorithm development, he says, adding that an MLOps architect managed the complete process.\n\nUnfortunately, these skills are hard to come by. In a report released last fall by Splunk, only 10% to 15% of more than 1,300 IT and business decision makers surveyed said their organizations are using AI to solve the dark data problem. Lack of necessary skills was a chief obstacle to making use of dark data, second only to the volume of the data itself.\n\nA problem (and opportunity) on the rise\n\nIn the meantime, dark data remains a mounting trove of risk \u2014 and opportunity. Estimates of the portion of enterprise data that is dark vary from 40% to 90%, depending on industry.\n\nAccording to a July report from Enterprise Strategy Group, and sponsored by Quest, 47% of all data is dark data, on average, with a fifth of respondents saying more than 70% of their data is dark data. Splunk\u2019s survey showed similar findings, with 55% of all enterprise data, on average, being dark data, and a third of respondents saying that 75% or more of their organization\u2019s data is dark.\n\nAnd the situation is likely to get worse before it gets better, as 60% of respondents say that more than half of the data in their organization is not captured at all and much of it is not even understood to exist. As that data is found and stored, the amount of dark data is going to continue to go up.\n\nIt\u2019s high time CIOs put together a plan on how to deal with it \u2014 with an eye toward making the most of any dark data that shows promise in creating new value for the business.