Synthetic data is a solution to the tension between the need for real data to train AI, test systems and run analytics, and the need to stay compliant and protect individuals’ data. If you’re worried about using sensitive personal information and GDPR, this could be the answer you’re looking for.
In this article, we’ll talk about what synthetic data actually is, how it’s becoming an important data protection tool, what the benefits and limitations are and more.
What Is Synthetic Data?
Synthetic data is artificially generated data that’s designed to mimic real-world data. It has the same statistical patterns, structures and properties of real data, so it can supplement or even replace real datasets.
In contrast to anonymised or pseudonymised data, synthetic data is made from original data but does not map directly to real individuals.
Synthetic data is primarily used in sectors where data is in limited supply, difficult to access or time-consuming to obtain, like in finance and healthcare. It is currently most notably used to train AI and machine learning models.
Types of Synthetic Data
- Fully synthetic data is entirely artificial and doesn’t include any authentic information. It estimates relationships, patterns and attributes to emulate the real data as closely as possible.
- Partially synthetic data replaces some of the original data, particularly sensitive information, with artificial values, but the rest remains real. This technique helps protect personal data while preserving the complexities of authentic data.
- Hybrid synthetic data combines real data with fully synthetic information, which allows organisations to scale data sets.
How is Synthetic Data Generated?
Synthetic data is generated by AI that is trained on real-world datasets. These AI models take the structure, patterns and statistical properties of that real data and create similar data points, but without the personal information that real data would include.
Why Is Synthetic Data Important for Data Protection?
There are a number of applications for synthetic data, but one of the most exciting is to help minimise the exposure of real personal information and help businesses stay compliant with GDPR.
Elimination of Identifiers
In synthetic data, no actual personal data is present; instead, there are artificial data points that simply mimic the original personal data. It means that there is no link with real individuals, making it inherently private.
Enables Safe Data Sharing
In some sectors, like finance and healthcare, using real, sensitive data is often restricted due to privacy concerns. Synthetic data provides an alternative. It allows different departments and external partners to collaborate on data without exposure to sensitive information.
Supports Compliance
When generated properly, fully synthetic data that consists of entirely artificial data points falls outside of the GDPR scope as non-personal data. Using it can therefore help businesses stay compliant.
However, if you’re the one creating the synthetic data, then the original personal data will still fall under GDPR, as well as any data where there is a residual risk of re-identification.
Benefits of Synthetic Data for Businesses
Synthetic data has a number of benefits, including:
- Reduced breach risk. If synthetic data is leaked, it will only cause minimal harm compared to the potential of real datasets.
- Facilitates data minimisation. Synthetic data reduces the need to collect and store real user data, which aligns with the data minimisation principle of GDPR.
- Reduced operational timeframes. Developers, analysts and researchers no longer have to wait for approval to access sensitive data – because the data is no longer sensitive!
- Lower compliance costs. Synthetic data reduces the need for manual anonymisation and redaction, avoids costs associated with breaches and streamlines data sharing.
Limitations and Risks of Synthetic Data
While synthetic data has lots of benefits for data sharing, it does come with limitations. These include:
Lack of Realism
Synthetic data is an approximation of real-world data and may lack some of the nuances and complexities of authentic information.
Bias Amplification
If the source data contains bias, synthetic data will replicate or exaggerate it, which may lead to discrimination or unfair outcomes in downstream applications.
Risk of Re-Identification
Synthetic data is not automatically anonymous. High-fidelity data that closely mirrors the original data, or data derived from unique data sets, can still contain patterns that could enable the re-identification of individuals.
Attackers can exploit this weakness through:
- Linkage attacks. This is where an attacker links two or more records belonging to a data subject in a dataset or across multiple datasets by exploiting extraordinary characteristics, such as a rare disease.
- Attribute inference attacks. This is where attackers query a trained model and observe its outputs to deduce sensitive or private information.
- Regulatory uncertainty. As synthetic data is relatively new, the legislation around it is still evolving. While fully synthetic, truly anonymous data falls outside of GDPR, the risk of re-identification means that some datasets will still fall under the regulations.
Best Practices for Using Synthetic Data
Determine The Data Quality
Many factors affect the quality of synthetic data, so it’s very important to ensure the quality and accuracy of the data you’re working with.
Compare the synthetic data to real-data baselines to see how well it mimics authentic data. There are metrics like Inception Score and FID score that can help you do this.
Assess Re-identification Risk
Rigorous assessments should be conducted to determine the likelihood of re-identification. This will then guide the governance you need to apply to the data.
Implement Privacy-Enhancing Techniques
There are additional techniques you can use to add another layer of privacy to the data, such as replacing direct identifiers or using Privacy-Enhancing Technology.
Need Data Protection Support?
Synthetic data is at the forefront of data protection practices. If you’d like to review your protection processes and make sure that you’re fully compliant with GDPR regulations, then get in touch with our team today. We offer data protection audits designed to test your compliance with the law, covering everything from data mapping to DPIA services.