top of page

What is Synthetic Data? Is it Better from a Privacy Professional Perspective?

Credit: Markus Spiske | Unsplash

With the emergence of ChatGPT and the RobotLawyer, the public is turning their attention away from cryptocurrency to artificial intelligence (AI). However, given the intricate nature of AI, many are left wondering how it works and how to maximize its benefits. In today’s data protection regulatory landscape, businesses may be forbidden from processing their real data sets for the purpose of using it for AI. Specifically, the General Data Protection Regulation (GDPR) prohibits uses not explicitly consented to by the consumer when the business first disclosed it was collecting their personal information.

For this and several other reasons, synthetic data has become increasingly popular in the AI space. This begs the question, what is synthetic data?

What is Synthetic Data?

Synthetic data is information generated on a computer to augment or replace real data to test and train AI models. Synthetic data is usually created and used when real data is not available or has to be kept private because of personally identifiable information (PII). There are several types of synthetic data that serve different purposes. Synthetic data can be:

  • Synthetic text. This is best described as artificially-generated text.

  • Synthetic media like video, image, or sound. This is best described as artificially generated videos, images, or sounds that can be used to increase a datasets’ size and diversity when training image recognition systems.

  • Synthetic tabular data. This best described as artificially generated data that mimics real world data stored in tables.

As of today, synthetic data is widely used in the health, financial, and eCommerce sectors. Healthcare providers in fields such as medical imaging use synthetic data to train AI models while protecting patient privacy. Financial-service providers (e.g., American Express & J.P Morgan) use synthetic data to train AI models to identify fraudulent transactions. Amazon Robotics uses synthetic data to train robots to identify packages of varying types and sizes. Google’s Waymo uses synthetic data to train its self-driving cars.

Why is Synthetic Data become so important?

Our personal data has become increasingly valuable to businesses, but it is generally more expensive to source real data than synthetic data. Developers need large, carefully labeled datasets that may contain a few thousand to tens of millions of edge cases. There is a lower labor cost of identifying, acquiring, and preparing datasets for use, especially when manual labeling of data-sets is required, hence why businesses are taking advantage of synthetic data as an affordable option.

Another benefit is that synthetic datasets are quicker to produce since the data is not captured from real-world events. Hence it is possible to generate as well as construct a dataset much faster with suitable tools and hardware. This is helpful for developers who need a huge volume of data in a shorter period of time.

Additionally, the use of synthetic data may lower bias. Developers may strategically use synthetic data to train (and re-train) AI models to remove bias after analyzing real-word data and observing a potential bias. Developers may also use synthetic data to artificially build out a dataset if it is not representative enough of specific populations. This is particularly important because the large dataset may exclude or unequally represent some compared to others from another community.

How does it impact the privacy law industry?

Synthetic data can address privacy issues and reduce bias by ensuring users have the data diversity to represent the real world. The use of synthetic datasets encapsulates some of the principles laid out in the Fair Information Practice Principles framework. By incorporating privacy-enhancing technologies (PETs) (e.g., differential privacy and privacy filtering) it is possible to guarantee that a machine learning model will not memorize any individual’s personal information. Through PETs, businesses can bypass time-consuming and complex manual data anonymization tasks, enabling fast and safe sharing of information between teams.

Synthetic data may also motivate businesses to create safe data retention policies. Businesses no longer need to store too much or stale data on their servers since they create their own data. Consequently, they can shorten the data retention policy for real data likely reducing operating costs. As more and more businesses shift to a shorter data retention policy, it may ultimately impact the industry standard. As it stands now, businesses generally keep real data “for as long as necessary.”

In practice, synthetic data helps with edge cases, fills data scarcity gaps, lowers business costs, and alleviates privacy concerns. Yet, is it too good to be true? Maybe. It is likely that we will see sooner rather than later since according to a study, by 2024, 60% of data used for the development of AI and analytics projects will be artificially generated.

*The views expressed in this article do not represent the views of Santa Clara University.

bottom of page