f-105 - Latest release
f-104
Snowflake support
You can now create Snowflake connectors, and with that, read original data directly from as well as write synthetic data directly to your Snowflake databases.
Auto-detection of CSV data types
MOSTLY AI now instantly recognizes the correct data types for uploaded CSV files. Previously, this was done as part of the data synthesis.
With this change, the Encoding Type AUTO is now deprecated.
Support for Gzip and Bzip2 files
You can now speed the provisioning of large files by uploading them as Gzip (.gz
) or as Bzip2 (.bz2
) archive files.
Specify single files from cloud buckets
Previously, you were only able to specify the containing folders as a cloud bucket. With this release, you can now specify the path to individual files on a bucket.
Support for JSON Lines, Feather and ORC format (experimental)
You can now provide your original data as JSON Lines, Feather, or ORC format.
f-102
Home page
We want to welcome you to the new Home page in the top navigation bar. With the Home page, you have easier and direct access to MOSTLY AI features. You can review them below.
-
Upload files
In the Upload files tab, you can upload (drag-and-drop or browse to select) a CSV of Parquet file with data to immediately configure and start a synthetic data job. -
Connect to a source
On the Connect to a source tab, you can immediately create a connection to a new database or cloud bucket. -
Start a synthetic data job with an existing sample dataset
Under Or use sample data, you can immediately start a synthetic data job with any of the datasets that are available. Pick one and start a synthetic data job for it with the Start button. -
Last six completed jobs
Under Existing synthetic datasets, you can review the last six completed jobs. The card for each job indicates if the synthetic data passed the Privacy check and what its overall Accuracy is.
v3.0
Kubernetes and Openshift support
MOSTLY AI 3.0 will use Kubernetes and Openshift as the deployment method.
Smart imputation
Smart imputation allows the user to create a synthetic dataset where specific columns don’t contain null values.
Rebalancing
Rebalancing allows you to specify the distribution of specific values in a column. Using Rebalancing, you can create a large number of relevant business scenarios out of the few that are present in your data. Use it to simulate what-if scenarios based on your historical data, or make minority classes visible for downstream machine learning algorithms.
Generation mood
Generation mood allows you to control the degree to which the synthetic version of the column will adhere to the detected distributions and correlations in the original data. The following generation mood settings are available:
Conservative |
Generates synthetic data strictly within the business rules captured in the data. |
Representative |
Generates synthetic data that adheres less strictly to the business rules captured in the data. |
Creative |
Generates synthetic data skewed toward the outliers of the detected distributions. |
New QA Report that reflects Programmable synthetic data metrics
With the introduction of the Programmable data, we are now providing quality assurance metrics for the model and data separately.
New User Interface
The look and feel of the application are updated, along with the below improvements:
-
We are now providing consistency throughout the application in terms of flows and page elements, which will allow you to use the application more efficiently.
-
The steppers and information boxes will help you through your journey.
-
Data, Training, and Output settings are separated in different tabs
-
We are giving a visual clue of the configured number using a thousand separator to help you work more efficiently with large numbers.
Rare / Extreme Value Protection updates
Improvements
Resolved issues
MPD-2715 |
PK and FK relationships are not correctly set for file based jobs. |
MCD-1469 |
Fixed the issue that catalogs with multiple context foreign keys may not complete synthetic data generation. |
MCD-1445 |
Fixed the issue that batch sizes greater than 4096 crashes synthetic data generation. |
MCD-1438 |
Fixed the issue that in database synthesization jobs, tables whose names start with an _ fail to be read. |
MCD-1432 |
Fixed the issue of misalignment of data partitions occurring when the subject table is big and the linked table is small. |
MPD-2576 |
For Ad hoc jobs, the default rare category protection method is now constant instead of sample. |
MPD-2532 |
Fixed the issue that tables with multiple foreign keys may crash when the relationships have been edited in the data catalog. |
MPD-2480 |
Fixed the issue that users cannot upload tables that are partitioned over multiple files. |
MPD-2478 |
Fixed the issue that free version users see Local Server as a data connector option while unavailable to them. |
MPD-2470 |
Fixed this issue that Mock is selectable as an encoding type. |
MPD-2444 |
Fixed the issue that the encoding type is not saved when a linked table column is set to ITT. |
MPD-2604 |
Fixed the issue that in Ad Hoc jobs, column settings are not persisted after saving when switching tabs. |
MPD-2340 |
In Ad hoc jobs and Cloud storage data catalogs, the Edit relationships drawer is automatically shown to the user if the foreign key is not found. |
MPD-2443 |
Certain database relationships result in two context foreign keys to the same referenced table, resulting in an error during synthesization. |
MPD-2395 |
When creating a data connector, the schema field is marked as mandatory for databases that don’t require it. |
MPD-2381 |
For Ad hoc jobs and cloud storage catalogs, the linked table’s first column is automatically selected as the foreign key. |
MPD-2378 |
When a table has an unexpected character, the error message doesn’t mention the issue as such, nor does it state where it occurs. |
MPD-2339 |
If there is only one referring table, it doesn’t show up in the |
MPD-2281 |
The column settings drawer shows the incorrect generation method for Smart Select and context foreign keys. |
MPD-2060 |
For users of the free version, |
MCD-1381 |
Missing values in the numerical columns of Parquet files are not correctly read. |
MCD-1373 |
The Smart Select algorithm throws an error if the referring table is empty. |
MCD-1364 |
The database data connector throws an error if there are empty tables. |
MPD-2371 |
Tables are not shown in alphabetical order in the 'Database contents' section of the database table selection step. |
MPD-2357 |
The job settings' column details of uploaded Parquet files show |
MPD-2356 |
Parquet files cannot be used as a seed for the |
MPD-2351 |
When starting an Ad hoc job, users can upload 2 different files as a subject table. |
MPD-2347 |
Reference tables' primary keys are not copied but generated. |
MCD-1325 |
QA report generation fails when analyzing database datetime columns that contain values in an unknown format. |
MCD-1327 |
Sequence lengths are incorrectly calculated in an edge case scenario. |
MPD-2194 |
When creating or modifying a data connector, the |
MPD-2178 |
Whitespaces in the header row of CSV files cause issues during synthesization. |
MCD-1275 |
QA report generation fails when synthesizing Parquet files. |
MCD-1273 |
Incorrect processing of scientific notation in CSV files. |
MCD-1266 |
Certain datetime ranges are incorrectly processed as strings. |
MCD-1265 |
Restrictive rules causing the QA report to fail in certain edge cases. |
MCD-1261 |
Long warning messages within the app’s architecture causes it to crash. |
MCD-1260 |
QA report fails when a column is configured as 'mock data'. |
MCD-1259 |
Incremental timestamps in time-series data may generate inconsistent synthetic data when configured as ITT. |
MCD-1258 |
QA report fails when a numerical column is completely empty. |
MCD-1257 |
Synthesization fails if the linked table’s entries are not linked to the subjects in the subject table. |
v2.4.4
Improvements
MCD-1217 |
When synthesizing databases, the data types of the original schema are now respected, regardless of encoding type. |
Resolved issues
MCD-1469 |
Fixed the issue that catalogs with multiple context foreign keys may not complete synthetic data generation. |
MCD-1445 |
Fixed the issue that batch sizes greater than 4096 crashes synthetic data generation. |
MCD-1438 |
Fixed the issue that in database synthesization jobs, tables whose names start with an _ fail to be read. |
MCD-1432 |
Fixed the issue of misalignment of data partitions occurring when the subject table is big and the linked table is small. |
MPD-2576 |
For Ad hoc jobs, the default rare category protection method is now constant instead of sample. |
MPD-2532 |
Fixed the issue that tables with multiple foreign keys may crash when the relationships have been edited in the data catalog. |
MPD-2480 |
Fixed the issue that users cannot upload tables that are partitioned over multiple files. |
MPD-2478 |
Fixed the issue that free version users see Local Server as a data connector option while unavailable to them. |
MPD-2470 |
Fixed this issue that Mock is selectable as an encoding type. |
MPD-2444 |
Fixed the issue that the encoding type is not saved when a linked table column is set to ITT. |
MPD-2604 |
Fixed the issue that in Ad Hoc jobs, column settings are not persisted after saving when switching tabs. |
MPD-2340 |
In Ad hoc jobs and Cloud storage data catalogs, the Edit relationships drawer is automatically shown to the user if the foreign key is not found. |
v2.4.3
Improvements
MPD-2175 |
When running a job, the |
MPD-2267 |
The QA report for linked tables no longer displays the linked table name along with the context table name. |
MPD-2088 |
When adding new foreign keys with the relationships drawer, if there are more than 1 parent tables without primary keys, the error message shows all these tables instead of only the first one. |
Resolved issues
MPD-2443 |
Certain database relationships result in two context foreign keys to the same referenced table, resulting in an error during synthesization. |
MPD-2395 |
When creating a data connector, the schema field is marked as mandatory for databases that don’t require it. |
MPD-2381 |
For Ad hoc jobs and cloud storage catalogs, the linked table’s first column is automatically selected as the foreign key. |
MPD-2378 |
When a table has an unexpected character, the error message doesn’t mention the issue as such, nor does it state where it occurs. |
MPD-2339 |
If there is only one referring table, it doesn’t show up in the |
MPD-2281 |
The column settings drawer shows the incorrect generation method for Smart Select and context foreign keys. |
MPD-2060 |
For users of the free version, |
MCD-1381 |
Missing values in the numerical columns of Parquet files are not correctly read. |
MCD-1373 |
The Smart Select algorithm throws an error if the referring table is empty. |
MCD-1364 |
The database data connector throws an error if there are empty tables. |
v2.4.2
Improvements
-
Multiple synthesization jobs started at the same time will now be processed one by one instead of all at once.
Resolved issues
MPD-2371 |
Tables are not shown in alphabetical order in the 'Database contents' section of the database table selection step. |
MPD-2357 |
The job settings' column details of uploaded Parquet files show |
MPD-2356 |
Parquet files cannot be used as a seed for the |
MPD-2351 |
When starting an Ad hoc job, users can upload 2 different files as a subject table. |
MPD-2347 |
Reference tables' primary keys are not copied but generated. |
MCD-1325 |
QA report generation fails when analyzing database datetime columns that contain values in an unknown format. |
MCD-1327 |
Sequence lengths are incorrectly calculated in an edge case scenario. |
v2.4.1
Improvements
-
Ad hoc jobs can now synthesize Parquet files.
-
CSV files can now have semicolons (;) as well as commas (,) as column separators.
Resolved issues
MPD-2194 |
When creating or modifying a data connector, the |
MPD-2178 |
Whitespaces in the header row of CSV files cause issues during synthesization. |
MCD-1275 |
QA report generation fails when synthesizing Parquet files. |
MCD-1273 |
Incorrect processing of scientific notation in CSV files. |
MCD-1266 |
Certain datetime ranges are incorrectly processed as strings. |
MCD-1265 |
Restrictive rules causing the QA report to fail in certain edge cases. |
MCD-1261 |
Long warning messages within the app’s architecture causes it to crash. |
MCD-1260 |
QA report fails when a column is configured as 'mock data'. |
MCD-1259 |
Incremental timestamps in time-series data may generate inconsistent synthetic data when configured as ITT. |
MCD-1258 |
QA report fails when a numerical column is completely empty. |
MCD-1257 |
Synthesization fails if the linked table’s entries are not linked to the subjects in the subject table. |
v2.4
Synthesize databases even if they don’t have a schema, and impress your colleagues with its QA report.
-
Relationship manager
Use the relationship manager to add and modify relationships so that you can tailor the synthetic version of your database entirely to your use case. It’s specifically designed to help you synthesize databases without schema or with an incomplete schema. -
A QA report for everyone
You can now download and share the QA report of your synthetic databases with your colleagues. Not only did we make it easy to share, but also easy to read!
We worked on numerous improvements that help you assess synthetic data quality and convey the message that your synthetic data is privacy-secure and an accurate representation of your company’s valuable data assets.
Relationship Manager
Whether your database is small or large, with or without schema, we’ve got you covered. You can now complete the relationships between your database’s tables so that all of its data assets can be properly secured and accurately synthesized, QA report included.
And if you’re dealing with partially defined relationships and don’t know which ones are missing, you can count on us as well. Our handy 'Tables without relations' filter gets you going in no time!
Working with the relationship manager is not complicated either. Watch this 6-minute video tutorial to get you up to speed.
QA report
-
Improved interactive charts help you easily pinpoint and identify potential accuracy issues in the synthetic data.
-
There’s no need to wait for it either! QA report generation now takes seconds per table rather than minutes, so you can immediately assess the quality of your synthetic data.
-
Explainer sections in the report help the reader understand what they’re looking at.
-
The QA report now comes in a handy, self-contained HTML document that retains all interactive charts when sharing it across your business and partnerships.
Resolved issues
MPD-2198 |
When the ‘number of generated subjects’ is left blank, the |
MPD-2185 |
Incorrect number of columns reported in the QA report. |
MPD-2180 |
‘Cancel training’ and ‘Cancel generation’ buttons are not working when synthesizing data. |
MCD-2057 |
UI issues when creating an Oracle database data connector. |
MCD-1177 |
Incorrect handling of SID and SERVICE_NAME connections to Oracle databases. |
MCD-1169 |
The QA report of certain datasets have an Incorrect placement of labels in the correlation matrix. |
MCD-1163 |
Numerical columns may generate a casting exception during generation causing a job failure. |
v2.3
Whether you’re a student, small business, or enterprise, our Synthetic Data Platform is ready to serve your needs
-
Effortless onboarding with our new video tutorials
Our new video tutorials help users start synthesizing your company’s valuable data assets right away and help them understand what’s going on in each step. -
Audit logs for compliance and security
MOSTLY AI’s audit log keeps track of who accessed the system, what they looked at, and what actions they took. -
Improved synthesization of your database’s sequences
The order of your linked tables' lists, sequences, and time-series data embodies valuable information. MOSTLY AI now allows you to sort your linked tables by column so that all sequential information is optimally preserved.
Free edition
The best AI-driven synthetic data generator is available free of charge forever for generating up to 100K rows daily. If you want to generate high-quality, privacy-safe synthetic versions of your datasets for machine learning, testing or data sharing use cases, MOSTLY AI’s synthetic data generator is at your service. And it’s available straight from your browser after a simple registration.
Effortless user onboarding with video tutorials
Our new video tutorials help users start synthesizing your company’s valuable data assets right away and help them understand what’s going on in each step. There are three video tutorials available:
-
Privacy-secure your customer data
Users will learn to synthesize a table with basic customer profile information, such as their name, address, birth date, etc., and get a glimpse into the type of insights they can obtain from it. -
Privacy-secure behavioral customer data
Users will learn to synthesize a subject table-linked table dataset and understand how to deal with lists, sequences, and time-series data. -
Create a realistic and secure test database
Test engineers will learn to create a subset of a production database that is privacy secure and referentially intact while maintaining all business rules and relevant business scenarios for testing.
Audit logs for compliance and security
System administrators can now retrieve an audit log from the MOSTLY AI Synthetic Data Platform. It keeps track of information regarding who accessed the system, what they looked at, and what actions they took. This temporal information is important to proving compliance and security.
Improved synthesization of your database’s sequences
The order of your linked tables' lists, sequences, and time-series data embodies valuable information. MOSTLY AI now allows you to sort your linked tables by column so that all sequential information is optimally preserved. For time-series data, you can now also select the ITT (Inter-Transaction-Time) encoding type. It models the time interval between two subsequent events, resulting in a very accurate rendering of the time between events.
Resolved issues
MPD-2090 |
License renewal issues. |
MPD-2033 |
QA report generation crashes when a CSV file contains |
MCD-1131 |
Out of memory issues when synthesizing subject table-linked table datasets. |
MCD-1079 |
The trained AI model is lost when the training crashes. |
MCD-1070 |
In rare cases, numerical values are incorrectly detected as boolean values. |
v2.2
Transform your business with synthetic data that’s effortlessly privacy-secure, efficient, and fast
-
Take advantage of a synthetic data engine that’s mindful of your time and hardware resources.
-
Benefit from a much-simplified preparation of your synthesization jobs. The web UI now serves your goals, while MOSTLY AI handles complex configurations in the background.
-
Our new user management system lets you create groups, manage group-level access permissions, and lets users share synthetic data assets with these groups.
-
MySQL support enables synthetic data in the cloud, integrating MOSTLY AI with cloud databases like AWS Aurora, Google Cloud SQL, and many more.
A smarter, faster & more efficient MOSTLY AI
For the past few months, we have been working to make synthetic data work for your business. Here are some of the highlights:
-
We achieved a more than two-fold increase in synthesization speed, significantly reducing the resource footprint of synthetic data in your company.
-
Preparing synthetic data has become much simpler. The engine now determines the best AI model and outlier protection settings for your dataset and use case.
-
Benefit from better resilience for missing files, rows, columns, and so on. Defects in your data sources will no longer cause issues.
Increased speed
Benefit from a smaller synthetic data footprint and shorter time-to-data
-
MOSTLY AI can not only process datasets virtually limitless in size, it can now ingest and encode them faster than before.
-
Overall AI model training speeds halved, and wide tables now benefit from a faster training time for the first epoch.
-
We achieved a ten-fold increase in synthetic data generation performance. What MOSTLY AI used to generate in minutes can now be done in seconds.
Better synthetic data
Privacy-security is now out-of-the-box and takes zero effort to realize
-
MOSTLY AI protects rare categories by replacing them with non-rare categories. Release 2.2 replaces them in a context-aware manner. For instance, if a female data subject has a rare name, it will be replaced with a female non-rare name.
-
Rare category protection can no longer be adjusted or turned off.
-
Extreme values are now protected in all numerical formats, including datetime and ITT.
-
Lists, sequences, and time-series data now benefit from extreme sequence length protection.
-
Improved accuracy of sequence length distributions in the synthetic data, as minimum sequence lengths are now respected.
Simpler preparation
-
Use the batch size AI model training parameters to balance training speed with memory availability. The appropriate learning rate is now calculated in the background.
-
If your synthesization job doesn’t run as desired, you can choose a smaller or bigger AI model size to mitigate the issue.
-
The job summary now shows a progress bar for each epoch, giving you an indication of how long AI model training will take.
-
The "generate more data" function for synthesization jobs created with release 2.2 will now work with all upcoming versions of MOSTLY AI.
Manage users and groups
Create groups and let users share assets across them
-
As an admin, you can now create groups and manage group-level access permissions. This makes it easier to manage permissions for multiple users or reassign individual users if they change jobs in the organization.
-
As a user, you can now share synthetic data assets with your group or with other groups.
MySQL Data connector
Use the MySQL family of databases for synthetic data
The MySQL data connector enables synthetic data in the cloud and integrates MOSTLY AI with cloud databases like AWS Aurora, Google Cloud SQL, and many more.
Resolved issues
MPD-1781 |
License issues due to restarted VMs. |
MPD-1439 |
The data connector details view doesn’t show the database name. |
MCD-952 |
The AI server crashes when there’s an issue with assigning foreign keys using Smart Select. |
MCD-939 |
Generation crashed if the precision is specified for columns with floating point numbers. |
v2.1
Equip yourself with the most comprehensive synthetic data platform on the market
MOSTLY AI 2.1 continues our mission to deliver an enterprise-grade synthetic data platform and remain the leader in the tabular synthetic data space.
-
We now support the DB2 family of databases, enabling synthetic data for mainframe applications.
-
Our new Text encoding type allows you to synthesize unstructured natural language. MOSTLY AI 2.1 now covers all tabular data types, from categories to geolocation data and beyond. The world is all yours!
-
Benefit from searchable and interactive charts in the QA report, allowing you to intuitively spot opportunities to further improve synthetic data quality.
Synthetic text
Put unstructured natural language texts to use in your AI/ML applications.
Insurance claim reports, medical diagnoses, and other types of unstructured texts are very rich sources of information, capturing details that aren’t present in numbers or other structured forms of data.
Our new Text encoding type allows you to privacy-protect these texts and put them to use in various AI/ML use cases, for example:
-
Named-entity recognition
-
Sentiment analysis
-
Testing—by generating real descriptions
-
E-commerce analytics—by synthesizing customers' search keywords
DB2 Data connector
Use DB2 databases for synthetic data
You can now connect MOSTLY AI to the DB2 family of databases.
Use them as a data source or as a destination, and enable synthetic data for mainframe applications.

Updated QA Report
All privacy and accuracy charts are now in a hand’s reach
Our privacy and accuracy charts are now available in the web UI, so you can intuitively evaluate the quality of your synthetic data.
Spot opportunities to further improve quality and immediately apply them to the job settings.
-
Use the search function to look up specific columns.
-
Interactive charts allow you to learn more about specific data points.
-
Enlarge them to study them in detail.

Resolved issues
MPD-1596 |
Job cancellation hangs when using AWS ECR. |
MCD-885 |
Job won’t start if the |
MCD-754 |
The Generate more data feature crashes with some of the supported datetime formats. |
MCD-806 |
Jobs may crash if they process very wide tables. |
MCD-760 |
AI model training crashes when consistency correction and GPU acceleration are both active. |
v2.0
Synthetize your data wherever it is
Mostly AI 2.0 is now capable of synthesizing entire databases!
It connects to your data sources, recognizes its columns and their relationships, and provides you with a synthetic version of your data wherever you need it.
There are no more limits to what you can synthesize. Connect to your databases, buckets, and files without any hurdles.
Be ready for the synthetic data revolution. It’s already here.
New UI
A new customer centric UI
With Mostly AI 2.0 we introduce a new UI!
The new UI has been redesigned with a customer centric approach.
The task of creating a new synthesization job has never been easier.
And it looks cool too!
Multi-table data catalog
Synthesize complex data structures
With Mostly AI 2.0 it is now possible to define multi-table data catalogs!
The complexity of your data source is now represented in the data catalog:
-
Support of primary keys,
-
Support of foreign keys,
-
and Referential integrity.
The platform understands the relationships between all the tables and create a synthesization plan based on these relationships.
The result is synthetic version of your data in its original form!
Parallel computing
Better performance thanks to parallelization
Thanks to a major architectural redesign, the Mostly AI platform now supports parallel computing.
In case of multi-table synthetic generation, the Mostly AI platform will intelligently divide the tasks that can be calculated in parallel in the available VMs.
Data connectors
Create your data source once and re-use it!
You can now define data connectors in the Mostly AI platform!
Data connector can be used as a source of data or as a data destination.
You can fetch data from your production data lake or database and push them wherever you need!
Mock data
A perfect way to test the extremes
Some of the biggest challenges when testing software can be getting the software into some very specific states. You want to test that the new error message works, but this message is only shown when something breaks. You may have no direct control over and you really need to manipulate this data in order to perform your tests.
You can now define Mock Data in the generation process!
Mock data makes it possible to simulate errors and circumstances that would otherwise be very difficult to create in a real world environment.