Be the first to know and get exclusive access to offers by signing up for our mailing list(s).

Subscribe

We ❤️ Open Source

A community education resource

How synthetic data accelerates AI development without privacy risk

Building models with velocity and confidence using generated data.

NASA faced a problem in 1961 when Kennedy declared the United States would put a man on the moon by decade’s end. They needed to design and test spacecraft for conditions they had almost no data about. In this presentation at All Things Open, Brett Wujek, Head of Product Strategy for Next-Generation AI Technology at SAS, shares what NASA did: They simulated the environment and synthesized the data needed to test systems under different conditions. Unless you’re a conspiracy theorist, it worked.

Subscribe to our All Things Open YouTube channel to get notifications when new videos are available.

Fast forward 60 years to today’s AI revolution. We generate 400 million terabytes daily, yet organizations struggle with the right data. Privacy regulations like GDPR and HIPAA restrict access. Imbalanced datasets propagate bias downstream. Rare cases like credit card fraud (one to three in every thousand transactions) lack sufficient examples to build robust detection. Just having lots of data doesn’t solve these challenges.

Synthetic data addresses gaps by building mechanisms that quickly sample realistic records for development. The process starts with onboarding original data, often messy with multiple related tables. Profile the data to understand column types, identify entities, and maintain sequential aspects. Preprocess to handle missing values. Then build generator models using algorithms from deep learning GANs (Generative Adversarial Networks) to simpler techniques like Bayesian networks or SMOTE (Synthetic Minority Over-sampling Technique). Training incorporates differential privacy, which inserts noise to perturb data enough that generated records can’t be traced back to real individuals.

Read more: 5 forces driving DevOps and AI in 2026

Privacy approaches vary in effectiveness. Mocked data produces questionable realism. Masking removes sensitive information but loses important details. Anonymization encodes data differently, yet 87 percent of Americans can be identified from just gender, birth date, and zip code. Synthetic data uses algorithmic approaches creating new observations that represent real data without traceback risk, especially with differential privacy.

Validation ensures statistical congruence through distribution matching, correlation preservation, and relationship maintenance across tables. There’s typically a trade-off between accuracy and privacy. Brett demonstrates how organizations use synthetic data (check the video for the live demo), with examples like Nationwide Building Society achieving 28 percent model accuracy improvement.

Key takeaways

  • Synthetic data solves privacy restrictions, data scarcity, and bias by generating realistic records that can’t be traced to individuals using differential privacy techniques.
  • The generation process requires profiling data types, maintaining table relationships, preprocessing messy data, and validating statistical congruence with original datasets.
  • Organizations in financial services and healthcare see measurable model accuracy improvements while enabling experimentation previously blocked by regulatory constraints.

Synthetic data enables access where previously restricted, reduces bias, improves accuracy, and helps test models downstream without inflicting harm on real people while maintaining privacy compliance.

More from We Love Open Source

The opinions expressed on this website are those of each author, not of the author's employer or All Things Open/We Love Open Source.

Want to contribute your open source content?

Contribute to We ❤️ Open Source

Help educate our community by contributing a blog post, tutorial, or how-to.

Two World-class Events

If you didn't make it to All Things AI, check out the event summary, and make plans to join us October 19-20 for All Things Open.

Open Source Meetups

We host some of the most active open source meetups in the U.S. Get more info and RSVP to an upcoming event.