Today we launch the first industry-grade open-source synthetic data toolkit (SDK), enabling any organization to easily generate high-quality, privacy-safe synthetic datasets from sensitive proprietary data, all within their own compute infrastructure. By eliminating data-sharing hurdles, this open-source release clears the path for the next wave of AI innovation, fueled by previously inaccessible data.

Synthetic Data Leads Into a New Era of AI and Data Democratization

AI is recognized as a foundational technology, akin to electricity. Yet, the shortage of relevant training data is starting to hamper its further development. Governments and analysts alike emphasize synthetic data as the next frontier:

  • The UK Government’s most recent AI Opportunities and Action Plan stresses the “need access to high-quality data — the lifeblood of modern AI”, and explicitly calls for “exploring the use of synthetic data generation techniques to construct privacy-preserving versions of highly sensitive datasets.”
  • Industry analyst Gartner predicts 75% of businesses will generate synthetic customer data by 2026, up from less than 5% in 2023.
  • The European Commission’s JRC deems synthetic data a key enabler for AI development and data democratization.

Our mission has always been to democratize data. With the open-source release of our industry-proven synthetic data toolkit, we can now empower every business, every agency to finally harness their proprietary data with zero compromises on privacy.

SDK Brings Industry-Grade Synthetic Data Capabilities to Every Organization

We've already proven synthetic data in highly regulated sectors and our Platform is trusted by the U.S. Department of Homeland Security and leading Fortune 500s. We stand as the category leader for privacy-safe synthetic data. With today’s open-source release, we eliminate lingering adoption barriers while boosting transparency and trust.

The new SDK delivers state-of-the-art accuracy, differential privacy, best-in-class compute efficiency as well as a broad data support. Its fully permissive license fosters tighter integrations with leading AI and cloud platforms, creating a seamless ecosystem for synthetic data at scale.

Importantly, any synthetic data generator built with the SDK is fully compatible with our Enterprise Platform, enabling instant sharing, analysis, and AI-assisted data exploration. This unlocks data insights for everyone in an organization, independent of their background, driving true democratization of knowledge.

Synthetic Data Fuels Next-Gen AI With Proprietary Knowledge

Synthetic data is set to drive the next wave of AI adoption. Many organizations realize that AI trained solely on public data falls short when it comes to context and relevance. Proprietary business data — rich in behavioral insights and domain expertise — holds the key to more effective AI applications. However, privacy constraints often lock this data out of AI training.

MOSTLY AI’s Synthetic Data SDK removes that barrier, empowering organizations to safely harness their proprietary data without risking privacy compliance. With the recent advancements of GPTs, organizations can empower every employee with AI, not just expert users. But without unlocked training data, it’s like having a high-end device with no power source. By safely fueling AI with proprietary synthetic data, businesses can now turn their AI models from a novelty into an indispensable force for innovation.

Availability

The Synthetic Data SDK is available as a standalone Python package at https://github.com/mostly-ai/mostlyai under the fully permissive Apache v2 license. Join our young community by installing, using, and integrating the SDK — and help shape the future of privacy-safe synthetic data. Share your questions, star the repository, request features, and showcase your use cases.

If you need high-quality synthetic data, please pass this toolkit on to data owners. Together, we can unlock privacy-safe access to otherwise untapped data assets, fueling a more inclusive and responsible AI ecosystem.

Further Readings

Learn more about TabularARGN - the model architecture that powers the SDK here

Learn more about the Synthetic Data SDK and how to quickly get started here