Maria Korolov
Contributing writer

What is synthetic data? Generated data to help your AI strategy

Mar 15, 202211 mins
Artificial IntelligenceMachine Learning

Artificially generated data can be used in place of real historic data to train AI models when actual data sets are lacking in quality, volume, or variety.

binary neural network - artificial intelligence - machine learning
Credit: Thinkstock

Synthetic data defined

Synthetic data is artificially generated information that can be used in place of real historic data to train AI models when actual data sets are lacking in quality, volume, or variety. Synthetic data can also be a vital tool for enterprise AI efforts when available data doesn’t meet business needs or could create privacy issues if used to train machine learning models, test software, or the like.

According to Gartner analyst Svetlana Sicular, by 2024, 60% of the data used for the development of AI and analytics solutions will be synthetically generated, up from 1% in 2021.

Synthetic data use cases

Artificial data has many uses in enterprise AI strategies. As a stand-in for real data, synthetic data can be helpful in the following scenarios:

For training models when real-world data is lacking: AI and ML systems require massive amounts of data. For some use cases, there just isn’t enough data available, either because the use case happens very infrequently, or the use case is new and there isn’t much historical data available yet. Synthetic data can also lower costs when collecting or buying real-world data is prohibitively expensive.

To fill gaps in training data: Some data sets don’t fully reflect a company’s use cases. For example, a system trained to recognize phone numbers may not have enough international numbers to work with.

Another common problem is to balance out a data set. For example, a historic data set might be composed of 99% non-fraudulent transactions and less than 1% fraudulent ones, says John Blankenbaker, principal data scientist at SSA & Co., a global management consulting firm. “Many models will decide that the most successful policy will be to label every transaction as non-fraudulent.”

Synthetic data can help balance the data set, but it has to be done very carefully. “It will only be useful if the synthesis process captures whatever it is about a transaction that indicates fraud,” Blankenbaker says. “Which is unlikely to be obvious because then we’d use that as our fraud detector.”

‘Long tail’ use cases: As AI becomes ubiquitous in organizations, companies are running out of use cases where the required training data is plentiful and easily available. Once those projects show success, business leaders will want the same approaches used for their own use cases.

To speed up model development: Collecting real-world training data may require time, as the information is gathered, labeled, processed, and goes through compliance and other checks. This can slow down the development of new AI models. With synthetic data, models can be trained and calibrated before real-world data becomes available.

To simulate the future: When fashions change, historic data might become obsolete overnight. For example, when people switched from wired headphones to wireless, all that historic customer data lost its predictive value. Recommendation engines relying on old training data might still be recommending wired options. Replacing or augmenting the historic data with synthetic data that accounts for the fashion change can help keep recommendation engines relevant.

To simulate alternate futures: If a change is coming, and it’s unclear which direction customers will go, simulated data can help companies run scenario simulations and be prepared for either option.

To simulate “black swan” events: Certain situations come up very rarely and might not be present in historic data at all — but if they would have a dramatic impact on an organization if they did happen, then it’s necessary to be prepared. Using synthetic data to simulate those situations can help a company model those responses.

To simulate the metaverse: The metaverse — virtual, 3D simulations of gaming, social, and business environments — will require a massive amount of content. Rooms, buildings, landscapes, and so on will need to be created, and hiring 3D artists to create all this content from scratch will be prohibitively expensive. Synthetic data can fill in some of the gaps to create realistic, appropriate settings and objects for virtual environments, events, and interactions.

To generate marketing imagery: Advertisers are already creating synthetic images to showcase their products. For example, a photograph of a model wearing a sweater in one color can be turned into realistic photos of the same model wearing all the different versions of the same sweater. Image generation tools are also available that can even generate realistic yet unique faces or show off furniture in different arrangements.

For software testing: Using real data to test new software can create privacy and security problems. Synthetic data that looks like real data but isn’t allows software to be tested across the full gamut of use cases without putting real data at risk. “If we want to see how our infrastructure handles a large number of user accounts, it is easy to write a program that connects to our website and signs up synthetic users,” SSA’s Blankenbaker says.

To create digital twins: In court cases, attorneys sometimes create a shadow jury to test arguments. Organizations can do something similar by using synthetic data. For example, in 2019, Norway’s Labour and Welfare Administration created a synthetic version of its entire population. The data is regenerated daily, says Gartner’s Sicular, and is used by a number of outside organizations.

In place of medical and financial data: Using real customer or patient data for training AI models, running simulations, or finding useful treatments or correlations can be very risky from a compliance standpoint. Even scrubbed or anonymized data can often be reverse engineered to get the original data back, says Andy Thurai, vice president and principal analyst at Constellation Research. Synthetic data can’t be de-anonymized but can still be used to find valuable insights.

For sales and marketing: When a sales team calls on a customer to demonstrate a product or service that ingests data, it can be useful to use samples that are as close to the customer’s own use case as possible. Using data from another customer would be a privacy violation. Synthetic data can enable the sales team to put the product through its paces in a use case similar to that of the customer, without divulging sensitive information.

“A startup that is trying to build a healthcare application can build their entire framework using synthetic PHI [protected health information] data to create an end-to-end framework for prospective demo to clients instead of having to wonder and wait to make the right connections to use actual PHI data,” says Priya Iragavarapu, vice president in the center of data excellence at AArete, a global management consultancy.

To test AI systems for bias: When AI systems discriminate based on race, religion, or other illegal considerations it can create a compliance liability or a public relations disaster — or both. With “black box” AI systems and new AI technologies like neural networks, it can be hard to figure out why an AI makes the recommendation that it does. Testing the AI systems against synthetic data sets that are designed to mimic real-world demographics can help uncover these hidden biases.

Synthetic data generation

Sometimes, generating synthetic data can be very simple. A list of names, for example, can be generated by combining a randomly chosen first name from a list of first names and a last name from a list of last names. Zip codes can be randomly picked from a list of Zip codes. That might be enough for some applications. For other purposes, however, the list may need to be balanced so that, say, synthetic spending data correlates to the usual spending patterns in those Zip codes.

Most data sets are still produced manually with SQL for data extraction and anonymization and are then cleansed using standard programmatic languages, says Steven Karan, vice president and head of insights and data at Capgemini Canada.

“A commercial off-the-shelf solution has not hit the market yet,” he says. “While there are a small handful of startups that provide synthetic data solutions, none of them have reached any level of critical adoption.”

Instead, most data scientists leverage pre-built packages to generate synthetic data sets, he says.

Generating synthetic data sets that are statistically meaningful and reflect real data in ways relevant to use cases can be a challenge. Most recently, AI and machine learning algorithms have been used to create synthetic data that is more useful and representative. For example, data scientists have just begun using generative adversarial networks (GANs), says AArete’s Iragavarapu.

“It’s a type of neural work that has made a huge leap in making synthetic data generation a reality,” she says.

The way a GAN works is that one system generates data — say, an image of a cat — and a second system tries to guess whether the image is real or fake. By pitting the two systems in a race against each other, the generated images quickly become indistinguishable from reality.

Synthetic data tools

A number of tools are currently available to organizations interested in generating their own synthetic data, most of which are open source. Following are some of the more popular tools for creating synthetic data:

  • GPT-J: Open-source alternative to OpenAI’s GPT-3 text generation tool
  • Synthea: Open-source tool popular in the medical field
  • scikit-learn: Used to generate synthetic data sets for use in regression, clustering, and classification with the aim of producing data sets that can enable predictions, according to Capgemini’s Karan
  • symPy: Used by data scientists who need more custom synthetic data sets for more specific needs, as it enables the creation and development of custom symbolic expressions
  • pydbgen: Used to generate common data sets, such as phone numbers or email addresses
  • synthpop: An R package used to generate synthetic demographic data
  • faker: A Python package that can generate synthetic data such as names, addresses, emails, Social Security numbers, and other data
  • SDV: A Python tool for generating tables, relational databases, and time-series models

Synthetic data best practices

Companies just starting to experiment with synthetic data should start with well-structured examples, Gartner’s Sicular suggests. These use cases can be the easiest to deploy and offer the most initial value. For example, a database of names and Social Security numbers can be easily replaced by a synthetic equivalent that offers business benefits without creating compliance liabilities.

Constellation’s Thurai recommends against using synthetic data for both model creation and testing. “That will lead to false positives,” he says. “And don’t go cheap and use all synthetic data. You will need a good amount of real-world data to mix in the blend as well.”

Another mistake would be to use synthetic data to figure out whether things are causally related, says AArete’s Iragavarapu, or to generate synthetic outliers unless there is specific logic by which they are generated.

“And we must always quote explicitly where we use synthetic data versus actual data to remain transparent to our customers,” he adds.

Synthetic data companies

A variety of companies are stepping in to create synthetic data for use in your models, including the following: