Introduction to Privacy-Preserving AI
We live in extraordinary times where we can do many things with AI systems like we can unlock the iPhone with faces a doctor can detect disease at earlier stages than ever before. Translate text from one language to different languages automatically. It's essential to keep in mind that AI systems are often built on Machine Learning. These machine learning systems rely on and are shaped by data that's increasingly private and sensitive. So, there is a need to find another way to simultaneously unlock all of this Power of Artificial Intelligence while still respecting and protecting data Privacy-Preserving.
Privacy-Preserving AI is a big word. To explore it, there is a need to understand the current approaches to privacy-preserving in the context of AI and machine learning.
What are the Current Approaches to Privacy and Machine Learning?
There are two significant pillars. One among them is user control, and another one is data protection.
Suppose we want to give industries like retail to get insights around the data to know what other users may say. For this, the user may want to know who is collecting my data, for how long, for what purpose.
For example, a user surfs the web and clicks on adding red shoes. Every website that the user visits after also shows more Ads for red shoes. It shows the behavior that is not desired by the user. Here, data protection is another pillar of privacy-preserving AI or machine learning.
Data protection has two components; anonymized data and encrypted data. These components currently have some gaps concerning machine learning that need to be addressed.
- Anonymized Data: Anonymization of Data is not enough to cross out names and addresses and think that data can be tied back to their original owners. It is much easier to do the anonymization of data than ever before.
- Encrypted Data: Same way, Encryption of Data at rest is easy or encrypts data typically when it's in transit, but because we are doing machine learning.
Because of the nature of machine learning, we need to operate on the data. So that means typically, we have to decrypt the data at some point to work on it and then create a new vulnerability. There is a need for additional protection.
Why Current Approaches to AI Require Complex Webs of Trust?
Another gap between the intersection of privacy and machine learning is Trust. There is a need to understand that the data we are dealing with and the models are digital Assets. Whenever a user shares a digital asset with someone, it's equivalent to sharing information by trusting them that he will not do something wrong with the data.
An additional fact is that machine learning is fundamentally a multi-stakeholder computation. In a machine learning situation, multiple entities need to work together. One entity that owns training data, another set of Entities that may own inference data; the third entity may provide a machine learning server running on the inference that may be performed by the model. That model is owned by someone else. Further, it runs on infrastructure that came from some very long supply chain, so many parties are involved. Because of this digital data property, all of these entities have to trust each other in a very complex Chain. This web of Trust is becoming increasingly unmanageable.
What if Untrusted Parties Could do Machine Learning Together?
Imagine a world where parties who don't necessarily trust each other can still come together and do machine learning and unlock all these benefits of AI? If we could, one could do things like Banks who are otherwise Rivals in the Marketplace. They may decide jointly to work together and take private customer data. Without sharing data, they can jointly build a model of money laundering.
A local radiologist comes up with one hospital diagnosis in the Healthcare field. Still, they may get a second opinion or a third opinion from some highly trained AI systems. They need a way to do that without revealing personal identity while still respecting that patient's data.
Suppose there are untrusted parties who come to do machine learning together in retail. They need to monetize their retail data to guarantee users' privacy who contributed to the data.
What are the Building Blocks of Privacy-Preserving Machine Learning?
Privacy-preserving learning techniques are merging techniques that help preserve the user's privacy. The building blocks of privacy-preserving machine learning are Federated learning, homomorphic Encryption, and differential privacy. They can use tricks from Cryptography and statistics.
For example, it is possible for a group of separate entities to collectively train a machine learning model by pooling their data without explicitly sharing it, which can be accomplished using Federated Learning or multiparty computation.
Another technique is to do machine learning on encrypted data, and that stays encrypted throughout, and that's a technique known as homomorphic Encryption.
One more technique is to do statistics on datasets so that your calculation's output cannot be tied to the presence or absence of any individuals in that data set. That's the technique known as differential privacy.
These techniques are further amplified by hardware and software-based techniques known as a trusted execution environment.
A Case-Study on Monetising Private Data and Insights
Suppose there is a bank, and they need a model that detects fraud when somebody comes in with the transaction.
So they will go to an AI company to build some model for the fraud detection system. A company has a data scientist, but they don't have the required data. They need data to build this model. The Bank provides a list of retailers with pools of data that could be used towards this problem.
Firstly, AI systems and retailers use federated learning to jointly put a model out of their data without actually seeing their data. Based on that, the AI company would release the initial version of the model that goes out to all of this Federation members. They all use their Private data to improve that model and then send their progress back to the company to aggregate these improvements, develop a new version of the upgrades and then send that out again for further improvements. There might be a case that these improvements to the model leak some information about the underlying data. So, there is a need to do this aggregation process securely.
One way to do that is by using a trusted execution environment at the AI company. Also, models can learn and can memorize certain aspects of the data on which they were trained. That is not good because then somebody who has that model could potentially extract that information later to prevent that from happening. There is a need to use differential Privacy-Preserving AI techniques during the training process, where we can add a bit of noise to prevent the model from overfitting.
Now the Bank wants to host the model and test to check by doing some transaction to check whether it's a fraud. But this individual transaction may contain data that is susceptible, i.e., somebody's credit card number or someone's purchase. To enable it, we can use Homomorphic Encryption to encrypt the transaction, send it over to be processed purely in the encrypted way. When the answer comes back, it's the encrypted version of the response that only the Bank can unlock.
Another challenge is banks that might not want the Federation members to see the training model. Here, AI companies can use multiparty computation techniques to keep that separate. The company can also use Homomorphic Encryption. The data scientists can send encrypted data to some machine that may or may not be trustworthy. When the answer comes back, it's an encrypted version of the solution.
- Explore more on Machine Learning Observability and Monitoring
- Learn more about Data Preparation Roadmap