After exposing personal information of more than 650,000 customers, pub chain Wetherspoon decided to delete almost all the customer information it had been storing to reduce risk. After all, the data you don’t have doesn’t need to be checked for compliance, disclosed in a GDPR subject access request or apologized for after a data breach.
In fact, data can be so toxic that Joshua de Larios-Heiman, chair of the California Lawyers Association Internet & Privacy Law Committee, suggests thinking of it as uranium rather than oil. “What happens to spent uranium rods? They become toxic assets and getting rid of them is really difficult. People will sue you if you dispose of them negligently,” he says.
If you start thinking of risk in those terms, what data is your organization storing that you’d be better off without?
Don’t collect data you don’t need
There’s plenty of human-produced data that you don’t get any value from, and keeping it might be adding to your risk. “I’d be shocked if people didn’t find stuff they didn’t want to have and should be purging for GDPR reasons,” says Julia White, corporate vice president for Azure and enterprise security at Microsoft.
Don’t be fooled by the falling cost of storage into thinking that keeping data around is cheap, says Jon Callas, senior technology fellow at the ACLU.
“The costs of keeping data are higher than you think, and the benefits are lower. There is a chance it will be useful and contribute to analysis. There is a chance it will be harmful — like being lost in a breach or subpoenaed in a lawsuit,” he says. “The chance it will be useful goes down over time, but the harm value stays the same. If you lose the address somebody lived at five years ago, the EU doesn’t care that it was inaccurate data that you didn’t want and wasn’t helping your business; losing it is still losing it. At some point, those lines cross. You should toss data before they cross.”
The costs of a subpoena or a subject access request are higher than the costs of storage media, Callas points out. “The chances that something will happen and you have some data that causes you to get dragged into something else is higher than the value of that data. The procedures you have to put in place when you say, ‘I am only going to keep the data I know I have a reason for,’ puts you in a hugely different situation.”
About a third of data stored in your datacenter is likely redundant, obsolete or trivial, Jasmit Sagoo, senior director at Veritas, tells CIO.com.
“This is data that holds little or no business value and should be proactively deleted, especially when considering the data’s exposure and level of risk,” he says. “For example, ex-employee and ex-customer data is very high risk. It can contain personally identifiable information so it’s only worth keeping this data for legal reasons. Financial records are particularly vulnerable to hackers and another example of sensitive data that needs to be managed carefully.”
How do you find the data that you don’t need and should be deleting? “As a starting point, businesses need to be able to identify specific details within data, pinpoint the areas of risk and its potential value,” Sagoo says. “It’s also important to understand what is stored, who is accessing it and how often. Only then is it possible to understand what data exists and start classifying it based on a bespoke data retention policy. Deletion of these files should then happen at least once a quarter.”
There’s some data you should never store for analysis, says Blair Hanley Frank, a principal analyst at ISG. “Any organization that still stores user passwords in plain text in 2019 is asking for trouble.”
Delete data associated with production systems that are no longer in use. For example, the user data that Weatherspoon leaked was from an old website, so it shouldn’t have still been there. And Adobe’s password data breach was also from an older, nonproduction system. “Enterprises can’t just ignore systems that are out of date or rarely used just because they’re part of the legacy IT infrastructure,” Frank points out.
Pay particular attention to tracking down copies of customer databases that have been extracted (usually as XLS or CSV files) and handed over to developers to use as sample data.
You should use masked data for this. By masking data, you can retain a relevant statistical distribution of data for use in testing without risk of exposure.
“Nonproduction development and testing environments, vital as they are, pose an enormous increase in the surface area of risk and are often the soft underbelly for GDPR compliance,” notes Benjamin Ross, director at Delphix.
Don’t deidentify; delete
Data should only be kept for current business reasons, not the vague hope that a machine learning system could discover something useful in it. Callas notes that even AI startup investors Andreessen Horowitz have called into question the value of collecting large amounts of data. “There’s a mystical belief that there’s a sustainable competitive advantage to having this ‘data moat’ and as investors they have learned that historically, that’s not true,” Callas says. “This thing you might think is going to make you a better business, is not likely to.”
That particularly apples to personally identifiable information (PII) in data sets you’re considering using for training machine learning models, says Mary L. Gray, senior researcher at Microsoft Research. “Now that we have GDPR, there are very tight limits on what PII companies can collect, who’s allowed to have access to it, what auditing has to be in place to say where, when and how that PII has been repurposed and sold off to some entity outside the firm that collected it, and how long companies can hold on to it,” she says.
And ‘de-identifying’ data doesn’t make it safe to keep, because with enough data you’ll find you can still identify individuals — even if you didn’t want to. “It’s nonsense to consider any data collected ‘de-identified’ in perpetuity,” she warns.
“The data-centric tech industry hasn’t figured out how to let go of data, let alone identified what they could just stop collecting altogether. The industry landed on agreeing to hash PII: the equivalent of running a black marker across [it],” she says. “But they can collect everything else around what we do. If you’re predictable in what you do and where you do it, you’re still creating a digital footprint that’s not all that different than what you look like with the PII in the picture.”
While it’s trivial to remove obvious identifiers, such as names and dates of birth, data that has been ‘de-identified’ can still have PII in it, such as when users add their full names to fields not marked for names, and so on, she adds.
“That’s why data breaches are hard to plug,” Gray explains. “You could get one data set of email addresses, another of geolocation metadata, and a third set of search queries and run enough combinations of these data to land on a search string that generates a name, a birthdate, and a location to reidentify people associated with a specific email address.”
This potentially toxic data could even slow down your data strategy, warns Frank. “Having a whole bunch of essentially useless information can make it harder to analyze useful data by increasing the amount of time people spend building and testing models. To solve this problem, enterprises should be aggressive in judging the value that information brings, and test that data to see if it has predictive value,” he says.
Scott Guthrie, executive vice president of the Microsoft Cloud and AI Group, suggests reducing what data you store and anonymizing as much of it as possible. “If you’ve got telemetry on web searches, are you storing the exact house the person did the web search from? Or do you anonymize it at the street level or at some other unit, so that regardless of whether you have a data breach you do not violate privacy?”
If you don’t have data, no one can use it inappropriately.
“Don’t ask, ‘Why should I throw this data away?’ Ask, ‘Why should I keep it?’” Callas says. “Unless you know why you want to keep data, you should be getting rid of it because we live in a world in which collecting more data — which is fresher — is relatively cheap.” That could be an opt-in on your website, a reward for filling out a survey or telemetry from a beta software program. (You should immediately delete any data you can’t prove you have consent for.)
Throwing away PII gets you statistics “and that’s what you wanted anyway,” he notes.
“If a transit authority runs a survey because they want to know what people are doing, you really want accurate data and it makes sense to pay for it but you want to run it through some data grinder and throw the original data away, and then get rid of the ground-up data in a year,” Callas says. “If you’re trying to figure out which roads to fix, you don’t need data about the road you just fixed even — or especially — if the data shows you should have fixed something else. Every piece of data about the road you just fixed is toxic: there is no upside, only downside.”
Have a clear policy for how long you will keep data, like not keeping log files for more than a week (with exceptions for debugging). Callas suggests establishing some ‘forcing functions’ to make sure those decisions are made. “If I say, ‘Everything you put into my data warehouse I will delete after ten years unless you tell me why you want to keep it,’ then I’ve made you think about why you put things into a data warehouse.”