TabularARGN is powering the synthetic data generation capabilities of the MOSTLY AI Platform, delivering reliable quality and efficiency to our users. Today, we are pleased to share the architecture and implementation of this foundational model framework with the community. In our latest paper we present the TabularARGN framework, now part of our open-source SDK, making its capabilities accessible to a broader audience.
The Engine Behind High-Quality Privacy-Preserving Synthetic Data
TabularARGN is at the core of synthetic data that is enabling organizations to extract value from their data while ensuring privacy. Whether for flat tables with mixed data types or sequential datasets with irregular structures and varying sequence lengths, TabularARGN stands out for its robustness and high performance. Its ability to handle diverse and complex data ensures its applicability across a wide range of real-world scenarios and use cases while maintaining statistical fidelity and strong privacy safeguards including differential privacy guarantees.
TabularARGN is an auto-regressive neural network architecture. We adapted and extended auto-regressive concepts to tackle the unique challenges of tabular data, resulting in a model framework that excels at balancing quality, speed, and robustness.
What Sets TabularARGN Apart?
Unlike synthetic data generators that rely on increasingly complex and resource-heavy architectures, TabularARGN adopts a more focused and efficient model design. These design choices result in:
- High Fidelity: TabularARGN achieves synthetic data quality on par with state-of-the-art (SOTA) models
- Privacy by Design: TabularARGN only considers privacy-preserving value ranges for sampling, and has built in privacy protection features. Plus can be trained via DP-SGD for obtaining differential privacy guarantees.
- Simplicity: TabularARGN leverages existing building blocks, and thus can be easily implemented within standard deep learning frameworks.
- Compute Efficiency: With training speeds up to 100x faster, TabularARGN scales effectively, even for large and complex datasets.
- Sampling Flexibility: TabularARGN supports advanced sampling capabilities, including:
- Conditional generation to create targeted datasets.
- Missing value imputation to handle incomplete data seamlessly.
- Fairness adjustments to align with ethical data synthesis goals.
- Controlling sampling probabilities via temperature adjustments to balance rule-adherence with data diversity.
- Data Versatility: TabularARGN accommodates the heterogeneity of real-world tabular datasets, including:
- Multi-variate, mixed-type data (categorical, numerical, date-time, geo-spatial).
- Multi-sequence datasets with varying sequence lengths and varying time intervals.
- Missing values.
- Robustness in Training: TabularARGN delivers high quality synthetic data with default settings and remains consistent across several training runs.
High-Fidelity Synthetic Data, Fast
We built TabularARGN to deliver high-fidelity results in a fraction of the time typically seen with other deep generative approaches. TabularARGN’s performance has been rigorously tested against other open-source benchmarks. Below, we present a table with with synthetic-data accuracy and training times for a flat and sequential data set:
Key Highlights
- Flat Tables: On the Adult dataset, TabularARGN achieves an accuracy of 97.9%, comparable to SOTA benchmark methods, while training 16x faster.
- Sequential Tables: For datasets like Baseball, TabularARGN outperforms all baseline models by 9 percentage points and achieves training times up to 100x faster.
Even when incorporating differential privacy (DP-SGD), TabularARGN maintains competitive accuracy, demonstrating its adaptability without compromising quality.
Ready for Real-World Demands
TabularARGN isn’t just fast and accurate—it’s also robust. It has undergone extensive testing in production environments and across an array of tabular structures. It automatically adapts to mixed data types, handles missing values, and scales to millions of records without getting bogged down in training bottlenecks. Its reliability means teams can confidently run multiple synthetic data generation jobs on large datasets and trust the results.
Now Open-Source in Our SDK
We’re now pleased to release its core implementation under a fully permissive Apache v2 license. Starting today, you can:
- Explore the code in our public SDK repository.
- Build on top of our reference model architecture.
- Experiment with your own data and workflows.
We see this as a crucial step toward fostering greater collaboration within the synthetic data community. By open-sourcing TabularARGN, we hope to accelerate research in advanced data privacy, fairness, and machine learning applications—all while helping organizations tap into new, safe ways of sharing data.
Learn More
TabularARGN reflects our commitment to bridging cutting-edge research with practical applications. For a deeper dive into the architecture and benchmarks, read our paper, or explore the implementation through our open-source SDK.
Get started today and discover how TabularARGN can meet your synthetic data needs. Have questions or feedback? Reach out to us or join the conversation in our GitHub discussions!