Industries increasingly rely on data and AI to enhance processes and decision-making. However, they face a significant challenge in ensuring privacy due to sensitive Personally Identifiable Information (PII) in most enterprise datasets. Safeguarding PII is not a new problem. Conventional IT and data teams query data containing PII, but only a select few require access. Rate-limiting access, role-based access protection, and masking have been widely adopted for traditional BI applications to govern sensitive data access. \n\nProtecting sensitive data in the modern AI\/ML pipeline has different requirements. The emerging and ever-growing class of data users consists of ML data scientists and applications requiring larger datasets. Data owners need to walk a tightrope to ensure parties in their AI\/Ml lifecycle get appropriate access to the data they need while maximising the privacy of that PII data.\n\nEnter the new class \n\nML data scientists require large quantities of data to train machine learning models. Then the trained models become consumers of vast amounts of data to gain insights to inform business decisions. Whether before or after model training, this new class of data consumers relies on the availability of large amounts of data to provide business value.\n\nIn contrast to conventional users who only need to access limited amounts of data, the new class of ML data scientists and applications require access to entire datasets to ensure that their models represent the data with precision. And even if they're used, they may not be enough to prevent an attacker from inferring sensitive information by analyzing encrypted or masked data patterns. \n\nThe new class often uses advanced techniques such as deep learning, natural language processing, and computer vision to analyze and extract insights from the data. These efforts are often slowed down or blocked as they face sensitive PII data entangled within a large proportion of datasets they require. Up to 44% of data is reported to be inaccessible in an organization. This limitation blocks the road to AI\u2019s promised land in creating new and game-changing value, efficiencies, and use cases. \n\nThe new requirements have led to the emergence of techniques such as differential privacy, federated learning, synthetic data, and homomorphic encryption, which aim to protect PII while still allowing ML data scientists and applications to access and analyze the data they need. However, there is still a market need for solutions deployed across the ML lifecycle (before and after model training) to protect PII while accessing vast datasets \u2013 without drastically changing the methodology and hardware used today.\n\nEnsuring privacy and security in the modern ML lifecycle\n\nThe new breed of ML data consumers needs to implement privacy measures at both stages of the ML lifecycle: ML training and ML deployment (or inference).\n\nIn the training phase, the primary objective is to use existing examples to train a model.\n\nThe trained model must make accurate predictions, such as classifying data samples it did not see as part of the training dataset. The data samples used for training often have sensitive information (such as PII) entangled in each data record. When this is the case, modern privacy-preserving techniques and controls are needed to protect sensitive information.\n\nIn the ML deployment phase, the trained model makes predictions on new data that the model did not see during training; inference data. While it is critical to ensure that any PII used to train the ML model is protected and the model's predictions do not reveal any sensitive information about individuals, it is equally critical to protect any sensitive information and PII within inference data samples as well. Inferencing on encrypted data is prohibitively slow for most applications, even with custom hardware. As such, there is a critical need for viable low-overhead privacy solutions to ensure data confidentiality throughout the ML lifecycle.\n\nThe modern privacy toolkit for ML and AI: Benefits and drawbacks\n\nVarious modern solutions have been developed to address PII challenges, such as federated learning, confidential computing, and synthetic data, which the new class of data consumers is exploring for Privacy in ML and AI. However, each solution has differing levels of efficacy and implementation complexities to satisfy user requirements.\n\nFederated learning\n\nFederated learning is a machine learning technique that enables training on a decentralized dataset distributed across multiple devices. Instead of sending data to a central server for processing, the training occurs locally on each device, and only model updates are transmitted to a central server.\n\nDifferential privacy\n\nDifferential privacy provides margins on how much a single data record from a training dataset contributes to a machine-learning model. A membership test on the training data records ensures that if a single data record is removed from the dataset, the output should not change beyond a certain threshold.\n\nHomomorphic encryption\n\nHomomorphic encryption is a type of encryption that allows computation to be performed on data while it remains encrypted. For modern users, this means that machine learning algorithms can operate on data that has been encrypted without the need to decrypt it first. This can provide greater privacy and security for sensitive data since the data never needs to be revealed in plain text form. \n\nSynthetic data\n\nSynthetic data is computer-generated data that mimic real-world data. It is often used to train machine learning models and protect sensitive data in healthcare and finance. Synthetic data can generate large amounts of data quickly and bypass privacy risks. \n\nConfidential computing\n\nConfidential computing is a security approach that protects data during use. Major companies, including Google, Intel, Meta, and Microsoft, have joined the Confidential Computing Consortium to promote hardware-based Trusted Execution Environments (TEEs). The solution isolates computations to these hardware-based TEEs to safeguard the data. \n\nWhile these solutions are helpful, their limitations become apparent when training and deploying AI models. The next stage in PII privacy needs to be lightweight and complement existing privacy measures and processes while providing access to datasets entangled with sensitive information. \n\nBalancing the tightrope of PII confidentiality with AI: A new class of PII protection \n\nWe've examined some modern approaches to safeguard PII and the challenges the new class of data consumers faces. There is a balancing act in which PII can't be exposed to AI, but the data consumers must use as much data as possible to generate new AI use cases and value. Also, most modern solutions address data protection during the ML training stage without a viable answer for safeguarding real-world data during AI deployments.\n\nHere, we need a future-proof solution to manage this balancing act. One such solution I have used is the stained glass transform, which enables organisations to extract ML insights from their data while protecting against the leakage of sensitive information. The technology developed by Protopia AI can transform any data type by identifying what AI models require, eliminating unnecessary information, and transforming the data as much as possible while retaining near-perfect accuracy. To safeguard users' data while working on AI models, enterprises can choose stained glass transform to increase their ML training and deployment data to achieve better predictions and outcomes while worrying less about data exposure. \n\nMore importantly, this technology also adds a new layer of protection throughout the ML lifecycle \u2013 for training and inference. This solves a significant gap in which privacy was left unresolved during the ML inference stage for most modern solutions.\n\nThe latest Gartner AI TriSM guide for implementing Trust, Risk, and Security Management in AI highlighted the same problem and solution. TRiSM guides analytics leaders and data scientists to ensure AI reliability, trustworthiness, and security. \n\nWhile there are multiple solutions to protect sensitive data, the end goal is to enable enterprises to leverage their data to the fullest to power AI.\n\nChoosing the right solution(s) \n\nChoosing the right privacy-preserving solutions is essential for solving your ML and AI challenges. You must carefully evaluate each solution and select the ones that complement, augment, or stand alone to fulfil your unique requirements. For instance, synthetic data can enhance real-world data, improving the performance of your AI models. You can use synthetic data to simulate rare events that may be difficult to capture, such as natural disasters, and augment real-world data when it's limited.\n\nAnother promising solution is confidential computing, which can transform data before entering the trusted execution environment. This technology is an additional barrier, minimizing the attack surface on a different axis. The solution ensures that plaintext data is not compromised, even if the TEE is breached. So, choose the right privacy-preserving solutions that fit your needs and maximize your AI's performance without compromising data privacy.\n\nWrap up\n\nProtecting sensitive data isn't just a tech issue \u2013 it's an enterprise-wide challenge. As new data consumers expand their AI and ML capabilities, securing Personally Identifiable Information (PII) becomes even more critical. To create high-performance models delivering honest value, we must maximize data access while safeguarding it. Every privacy-preserving solution must be carefully evaluated to solve our most pressing AI and ML challenges. Ultimately, we must remember that PII confidentiality is not just about compliance and legal obligations but about respecting and protecting the privacy and well-being of individuals.