When compared to data warehousing, the data lake paradigm is incredibly appealing. Load all the raw data, model just what you need, when you need it, check the quality and content on the fly, and voilà! Cut the bureaucratic red tape and get to serving up the answers the business demands.
While the new approach has delivered real value, many data lakes have a critical blind spot: exposing sensitive data. Until you start working with the data, you don’t know what you don’t know. And that can be fatal.
Take personally identifiable information (PII), such as Tax IDs, email addresses, and credit card numbers. Many industries have established regulations on how to protect this information to safeguard their customers (and themselves) from fraud. Elaborate encryption and obfuscation methods have been developed to hide sensitive information yet still enable the business to use it for automated processes and analytic insights.
But you can only protect PII if you know you have PII, and it’s harder to detect in the data lake than in traditional databases, for several reasons:
- Load-first approach: Data warehouses required a lot of analysis, transformation, and modeling before data could be loaded. This provided a natural (but slow) gate to identify columns with sensitive data and apply protection on the way in. The agile data lake approach of loading everything and structuring and protecting just the data being used leaves a lot of untouched data fields that could be land mines of PII.
- Schema on read: If it’s not detected, it’s not protected. It’s well known that common data quality issues, such as embedded delimiters and control characters, are not detected by Hadoop. These issues cause “column shifting”, where queries start seeing values from Column C in Column D. When this happens in a query you get a wrong answer; when it happens with PII you get a lawsuit. If Column C contains PII and the program is supposed to be encrypting it, column shifting could expose the PII in column D. And nothing will notify you of the problem.
- Complex PII logic: It is hard to detect PII data, because by definition there are millions of distinct values that are frequently changing. Scanning entire files for common patterns frequently produces overwhelming numbers of false positives (9 digit strings could be SSNs, product identifiers, or random keys) and false negatives (such as non-standard address formats or data entry errors). Robust PII logic uses many types of metadata to increase the accuracy of detection, such as source data record layouts, data profiling statistics, and expected data patterns.
The potential cost of unprotected data in the lake is enormous. In a recent study, the Ponemon Institute* measured the average cost of a data breach at $3.6 million, or $158 per data record. For large consumer businesses with millions of customers, unprotected data represents a billion dollar risk.
Reverting to the old process of comprehensive analysis before loading is not the answer. One executive told me that they could load 10,000 fields of new data daily, but their SME could review about 100 fields per day for PII. Instead, best practices are emerging to retain data lake agility without a PII blind spot:
- Develop a data loading process that leverages known structures in data to automatically populate a metadata catalog. Most business data sources are structured, and that can be used to identify common data quality issues, automatically profile the data, and tag fields with known sensitive data types.
- Use advanced pattern detection rules on the metadata and profiling statistics to efficiently and accurately identify PII. Profiling statistics reveal common PII patterns (such as the cardinality & frequency of most common values) to efficiently flag potential PII accurately.
- Establish a data governance zone in the lake to detect and protect PII data before it is shared broadly. The results of automated detection can be reviewed by a small number of SMEs in a tightly controlled environment. The metadata catalog keeps track of all data assets and can automatically promote protected data to the consumption zone in the lake.
Data lakes have enabled a new approach to data management and delivery that is faster, better, and cheaper than data warehousing. Protecting sensitive data in this environment requires another innovation: automated data profiling combined with multi-dimensional pattern matching that recognizes the fingerprints of PII data.
*Poneman Institute “2017 Cost of Data Breach Study: June 2017”