Yash Mehta

How can CIOs protect Personal Identifiable Information (PII) for a new class of data consumers?

Mar 22, 202310 mins
Data PrivacyData ScienceMachine Learning

Enterprises and data owners must ensure customer data privacy while training their machine learning models. Let us learn how.

Credit: Getty

Industries increasingly rely on data and AI to enhance processes and decision-making. However, they face a significant challenge in ensuring privacy due to sensitive Personally Identifiable Information (PII) in most enterprise datasets. Safeguarding PII is not a new problem. Conventional IT and data teams query data containing PII, but only a select few require access. Rate-limiting access, role-based access protection, and masking have been widely adopted for traditional BI applications to govern sensitive data access. 

Protecting sensitive data in the modern AI/ML pipeline has different requirements. The emerging and ever-growing class of data users consists of ML data scientists and applications requiring larger datasets. Data owners need to walk a tightrope to ensure parties in their AI/Ml lifecycle get appropriate access to the data they need while maximising the privacy of that PII data.

Enter the new class 

ML data scientists require large quantities of data to train machine learning models. Then the trained models become consumers of vast amounts of data to gain insights to inform business decisions. Whether before or after model training, this new class of data consumers relies on the availability of large amounts of data to provide business value.

In contrast to conventional users who only need to access limited amounts of data, the new class of ML data scientists and applications require access to entire datasets to ensure that their models represent the data with precision. And even if they’re used, they may not be enough to prevent an attacker from inferring sensitive information by analyzing encrypted or masked data patterns. 

The new class often uses advanced techniques such as deep learning, natural language processing, and computer vision to analyze and extract insights from the data. These efforts are often slowed down or blocked as they face sensitive PII data entangled within a large proportion of datasets they require. Up to 44% of data is reported to be inaccessible in an organization. This limitation blocks the road to AI’s promised land in creating new and game-changing value, efficiencies, and use cases. 

The new requirements have led to the emergence of techniques such as differential privacy, federated learning, synthetic data, and homomorphic encryption, which aim to protect PII while still allowing ML data scientists and applications to access and analyze the data they need. However, there is still a market need for solutions deployed across the ML lifecycle (before and after model training) to protect PII while accessing vast datasets – without drastically changing the methodology and hardware used today.

Ensuring privacy and security in the modern ML lifecycle

The new breed of ML data consumers needs to implement privacy measures at both stages of the ML lifecycle: ML training and ML deployment (or inference).

In the training phase, the primary objective is to use existing examples to train a model.

The trained model must make accurate predictions, such as classifying data samples it did not see as part of the training dataset. The data samples used for training often have sensitive information (such as PII) entangled in each data record. When this is the case, modern privacy-preserving techniques and controls are needed to protect sensitive information.

In the ML deployment phase, the trained model makes predictions on new data that the model did not see during training; inference data. While it is critical to ensure that any PII used to train the ML model is protected and the model’s predictions do not reveal any sensitive information about individuals, it is equally critical to protect any sensitive information and PII within inference data samples as well. Inferencing on encrypted data is prohibitively slow for most applications, even with custom hardware. As such, there is a critical need for viable low-overhead privacy solutions to ensure data confidentiality throughout the ML lifecycle.

The modern privacy toolkit for ML and AI: Benefits and drawbacks

Various modern solutions have been developed to address PII challenges, such as federated learning, confidential computing, and synthetic data, which the new class of data consumers is exploring for Privacy in ML and AI. However, each solution has differing levels of efficacy and implementation complexities to satisfy user requirements.

Federated learning

Federated learning is a machine learning technique that enables training on a decentralized dataset distributed across multiple devices. Instead of sending data to a central server for processing, the training occurs locally on each device, and only model updates are transmitted to a central server.

  • Limitation: Research conducted in 2020 from the Institute of Electrical and Electronics Engineers  shows that an attacker could infer private information from model parameters in federated learning. Additionally, federated learning does not address the inference stage, which still exposes data to the ML model during cloud or edge device deployment.

Differential privacy

Differential privacy provides margins on how much a single data record from a training dataset contributes to a machine-learning model. A membership test on the training data records ensures that if a single data record is removed from the dataset, the output should not change beyond a certain threshold.

  • Limitation: While training with differential privacy has benefits, it still requires the data scientist’s access to large volumes of plain-text data. Additionally, it does not address the ML inference stage in any capacity. 

Homomorphic encryption

Homomorphic encryption is a type of encryption that allows computation to be performed on data while it remains encrypted. For modern users, this means that machine learning algorithms can operate on data that has been encrypted without the need to decrypt it first. This can provide greater privacy and security for sensitive data since the data never needs to be revealed in plain text form. 

  • Limitation: Homomorphic encryption is prohibitively costly because it operates on encrypted data rather than plain-text data, which is computationally intensive. Homomorphic encryption often requires custom hardware to optimize performance, which can be expensive to develop and maintain. Finally, data scientists use deep neural networks in many domains, often difficult or impossible to implement in a homomorphically encrypted fashion.

Synthetic data

Synthetic data is computer-generated data that mimic real-world data. It is often used to train machine learning models and protect sensitive data in healthcare and finance. Synthetic data can generate large amounts of data quickly and bypass privacy risks. 

  • Limitation: While synthetic data may help train a predictive model, it only adequately covers some possible real-world data subspaces. This can result in accuracy loss and undermine the model’s capabilities in the inference stage. Also, actual data must be protected in the inference stage, which synthetic data cannot address. 

Confidential computing

Confidential computing is a security approach that protects data during use. Major companies, including Google, Intel, Meta, and Microsoft, have joined the Confidential Computing Consortium to promote hardware-based Trusted Execution Environments (TEEs). The solution isolates computations to these hardware-based TEEs to safeguard the data. 

  • Limitation: Confidential computing requires companies to incur additional costs to move their ML-based services to platforms that require specialized hardware. The solution is also partially risk-free. An attack in May 2021 collected and corrupted data from TEEs that rely on Intel SGX technology.

While these solutions are helpful, their limitations become apparent when training and deploying AI models. The next stage in PII privacy needs to be lightweight and complement existing privacy measures and processes while providing access to datasets entangled with sensitive information. 

Balancing the tightrope of PII confidentiality with AI: A new class of PII protection 

We’ve examined some modern approaches to safeguard PII and the challenges the new class of data consumers faces. There is a balancing act in which PII can’t be exposed to AI, but the data consumers must use as much data as possible to generate new AI use cases and value. Also, most modern solutions address data protection during the ML training stage without a viable answer for safeguarding real-world data during AI deployments.

Here, we need a future-proof solution to manage this balancing act. One such solution I have used is the stained glass transform, which enables organisations to extract ML insights from their data while protecting against the leakage of sensitive information. The technology developed by Protopia AI can transform any data type by identifying what AI models require, eliminating unnecessary information, and transforming the data as much as possible while retaining near-perfect accuracy. To safeguard users’ data while working on AI models, enterprises can choose stained glass transform to increase their ML training and deployment data to achieve better predictions and outcomes while worrying less about data exposure.  

More importantly, this technology also adds a new layer of protection throughout the ML lifecycle – for training and inference. This solves a significant gap in which privacy was left unresolved during the ML inference stage for most modern solutions.

The latest Gartner AI TriSM guide for implementing Trust, Risk, and Security Management in AI highlighted the same problem and solution. TRiSM guides analytics leaders and data scientists to ensure AI reliability, trustworthiness, and security. 

While there are multiple solutions to protect sensitive data, the end goal is to enable enterprises to leverage their data to the fullest to power AI.

Choosing the right solution(s) 

Choosing the right privacy-preserving solutions is essential for solving your ML and AI challenges. You must carefully evaluate each solution and select the ones that complement, augment, or stand alone to fulfil your unique requirements. For instance, synthetic data can enhance real-world data, improving the performance of your AI models. You can use synthetic data to simulate rare events that may be difficult to capture, such as natural disasters, and augment real-world data when it’s limited.

Another promising solution is confidential computing, which can transform data before entering the trusted execution environment. This technology is an additional barrier, minimizing the attack surface on a different axis. The solution ensures that plaintext data is not compromised, even if the TEE is breached. So, choose the right privacy-preserving solutions that fit your needs and maximize your AI’s performance without compromising data privacy.

Wrap up

Protecting sensitive data isn’t just a tech issue – it’s an enterprise-wide challenge. As new data consumers expand their AI and ML capabilities, securing Personally Identifiable Information (PII) becomes even more critical. To create high-performance models delivering honest value, we must maximize data access while safeguarding it. Every privacy-preserving solution must be carefully evaluated to solve our most pressing AI and ML challenges. Ultimately, we must remember that PII confidentiality is not just about compliance and legal obligations but about respecting and protecting the privacy and well-being of individuals.

Yash Mehta

Yash Mehta is an internationally recognized Internet of Things (IoT), machine to machine (M2M) communications and big data technology expert. He has written a number of widely acknowledged articles on data science, IoT, business innovation, tools, security technologies, business strategies, development, etc. His articles have been featured on the most authoritative publications and awarded as one of the most innovative and influential work in the connected technology industry by IBM and Cisco IoT department. His work has been featured on leading industry platforms that have a specialization in big data science and M2M. His work was published in the featured category of IEEE Journal (worldwide edition - March 2016) and he was highlighted as a business intelligence expert. The opinions expressed in this blog are those of Yash Mehta and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.

More from this author