Synthetic Data: The Next Solution for Data Privacy?

ccraig — Thu, 23 Feb 2023 17:00:00 +0000

Gregory Hong is an IPilogue Writer and a 1L JD candidate at Osgoode Hall Law School.

One contentious point from the session was synthetic data’s potential to solve the privacy concerns surrounding the datasets needed to train AI algorithms. In light of its increasing popularity, I will explore the benefits and dangers of this potential solution.

Concept

The data privacy concern that synthetic data aims to address is very similar to the purpose of — protecting anonymized data from being de-identified without reducing data utility. This is distinct from data augmentation, which is the process of adding new data to an existing real-world dataset in order to provide more training data, and could include rotating images or combining two images to create a new one. Data augmentation is typically not useful in the privacy context.

In a , the Office of the Privacy Commissioner of Canada (“OPC”) describes synthetic data as “fake data produced by an algorithm whose goal is to retain the same statistical properties as some real data, but with no one-to-one mapping between records in the synthetic data and the real data.” Synthetic data consists of real-world source data that is put through a generative statistical model, which is evaluated for statistical similarity to the source alongside privacy metrics. Critically, there is no need to remove quasi-identifying data, that is, data vulnerable to de-anonymization. This results in more complete datasets.

Benefits

Synthetic data uses a highly automated process to provide protection from de-identification using a highly automated process. This results in datasets that can be readily shared between AI developers without the dangers of privacy concerns. also points out that there are substantial cost savings. The points to how a synthetic data service company founder estimated that “a single image that could cost $6 from a labeling service can be artificially generated for six cents.” Synthetic data can also be manufactured to reduce bias by deliberately including a wide variety of rare but crucial edge-cases. Nvidia uses machine vision for autonomous vehicles as their example, but I think this concept should translate to improving representation of marginalized and under-represented groups in large datasets in healthcare or facial recognition. Many of the Bracing for Impact panelists shared this concern.

Dangers

The OPC notes in their blog many issues and concerns, particularly regarding de-identification. This is especially true if the synthetic data is not generated with sufficient care and if the “generative model learns the statistical properties of the source data too closely or too exactly”. In other words, if it “overfits” the data, then the synthetic data will simply replicate the source data, making re-identification easy.” Moreover, there is also concern with membership inference, where the fact that some individual data exists is an inherent risk. A also demonstrated that “synthetic data does not provide a better tradeoff between privacy and utility than traditional anonymization techniques” and “the privacy-utility tradeoff of synthetic data publishing is hard to predict.” This indicates that the characterization of synthetic data as a “silver bullet” is likely overselling its capabilities. ��

Implementations

Nvidia is using synthetic data in computer vision, but its primary purpose is not privacy — that there are other important functions for the technology. is a leading platform for synthetic data in healthcare and is . It is only beginning: it is predicted that “.”

Conclusion

Synthetic data has the potential to be highly beneficial, as it may be the answer to the many challenges AI developers face in sharing sensitive data. However, like many developments in AI technology, it requires caution and careful implementation to be effective and is potentially dangerous if relied upon haphazardly.

The post Synthetic Data: The Next Solution for Data Privacy? appeared first on IPOsgoode.

Differential Privacy: The Big Tech Solution to Big Data Privacy

ccraig — Fri, 16 Dec 2022 17:00:00 +0000

Gregory Hong is an IPilogue Writer and a 1L JD candidate at Osgoode Hall Law School.

The AI revolution has brought about significant concerns about the privacy of big data. Thankfully, over the past decade, big tech has found a solution to this problem: differential privacy, which actors have . The technology is not limited to big tech anymore either; the . Furthermore, the European Union is – indicating that policymakers are on board with differential privacy as a standard means of protecting large, tabulated datasets.

What problem does differential privacy aim to solve?

Differential privacy was created to combat the , which states that “overly accurate answers to too many questions will destroy privacy in a spectacular way.” For instance, in a striking example,

�� showed that gender, date of birth, and zip code are sufficient to uniquely identify the vast majority of Americans. By linking these attributes in a supposedly anonymized healthcare database to public voter records, she was able to identify the individual health record of the Governor of Massachusetts.

, which at the time contained anonymous movie ratings of 500,000 Netflix subscribers. The attacker compared this to the Internet Movie Database (IMDb) and ��successfully identified the Netflix records of known users, uncovering information such as their apparent political preferences.

How does one defend against such an attack?

De-anonymization attacks follow the principle that overly accurate answers to too many questions will destroy privacy. Defending a database against too many questions is impractical, thus there must be a method to make answers inaccurate without affecting the data’s utility. Per , this method is achieved by introducing “statistical noise”. The noise () is significant enough to protect the individual’s privacy, but small enough that it will not impact the extracted answers’ accuracy.

Why is this relevant to law?

protects an individual’s information by presenting the impression that their information were not used in the analysis at all, which is more likely to comply with legal requirements for privacy protection. Differential privacy also masks individual contributions to ensure that using an individual’s data will not reveal any personally identifiable information, making it impossible to infer any information specific to an individual.

raised (and voluntarily dismissed) legal arguments against differential privacy by alleging that “the defendants’ decision to produce “manipulated” census data to the states for redistricting would result in the delivery of inaccurate data for geographic regions beyond the state's total population in violation of the Census Act”. As the plaintiff voluntarily dismissed the case, we will need to wait to see if this argument is successful in the future. However, it is obvious that the courts find the addition of statistical noise to violate the data’s integrity, which would be a serious problem for differential privacy.

The post Differential Privacy: The Big Tech Solution to Big Data Privacy appeared first on IPOsgoode.

Data Security Archives - IPOsgoode