Synthetic data
Synthetic data is artificially generated data that mimics the statistical properties of real data without referencing any specific record. Used for AI training when real data is restricted.
What is synthetic data?
Synthetic data is artificially generated to mimic the statistical properties of a real dataset without containing any actual records from the original. It is generated either by privacy-aware ML models (typically generative models trained with differential privacy) or by rule-based simulation. The use case is allowing AI training, testing, and demonstration on data that "looks like" the real thing without the privacy and contractual constraints of the real data.
Trade-offs
Synthetic data can preserve utility for some downstream tasks while reducing privacy risk. It works well for testing pipelines, training models that need only the joint distribution, and demoing to prospects. It works less well for tasks that require accurate representation of rare events or precise individual-level patterns — those are the patterns most likely to be lost in the synthesis process. Mature synthetic data products quantify the utility-privacy trade-off with measurable metrics.
For procurement
If an AI vendor claims to train only on synthetic data: what is the source dataset, what synthesis method was used, what privacy guarantees does the synthesis provide (differential privacy bounds, membership-inference test results), and has the synthetic data been audited against re-identification attacks. Synthetic data is not automatically anonymized data — poor synthesis methods leak training records, and good synthesis methods still leak statistical properties.