Release notes

What's new in MOSTLY AI

v200

Feb 29th, 2024

Decoupling of model training and data generation

  • Introduction of the Generator concept
  • With generators, workflows are updated in a way that better reflects how the platform is used
  • Users with access to production data (Model Creators) can train Generative AI on tabular data
  • Users without access to production data (Data Consumers) can use trained Generators to create Synthetic Datasets
  • Data Consumers have great flexibility on how to generate synthetic data for their specific needs
  • Model Creators can describe their curated synthetic data asset before sharing with the world
  • Generators are now a shareable data asset

Improved UI/UX

  • Overall improved look & feel of the platform for more intuitive workflows
  • Faster and simpler configuration of multi-table setups
  • Faster and simpler configuration of advanced features, such as value protection and flexible generation
  • More flexible way of configuring and working with data connectors
  • More flexible way to control temperature of synthetic data generation
  • Convenient way of defining maximum training time for predictable Generator creation duration

Python Client & REST API

  • Provides full programmatic control of the platform
  • Especially helpful for anyone who wants to work directly with synthetic data out of their code (Data scientists that work with Jupyter Notebooks)

Highly-scalable low-latency engine

  • Improved performance and speed across the entire synthesis process
  • Optimized performance for very large datasets

Improved data quality

  • Excellent sequence length capabilities, catered specifically for large transaction data (0000s events per customer) i.e. multi-sequence multi-variate time-series

Strengthened privacy protection

  • Improvements to prevent memorization of the models for small datasets

Flexible rebalancing

  • For any number and any kind of attributes

Seed generation for single table

  • Conditional generation with seed is now available on single-table generators

Improved & easier deployment

  • More flexible options when it comes to required storage classes
  • Improved memory management for more robust platform operation
  • Simplified Helm Charts
  • Centralized logging for easier maintenance and issue remediation

v122

Feb 21st, 2024

Improvements

  • MSD-309 - Improved database connection management by maintaining up to 4 simultaneous connections and quickly closing any connections that become idle
  • MSD-314 - Improved the logging that original data is deleted immediately after AI training completes
  • MSD-279 - Improved AI training memory usage for datasets with high-cardinality categories and long sequences

v113.8LTS

Feb 16th, 2024

Improvements

  • MSD-309 - Improved database connection management by maintaining up to 4 simultaneous connections and quickly closing any connections that become idle
  • MSD-314 - Improved the logging that original data is deleted immediately after AI training completes

v121

Jan 17th, 2024

Resolved issues

  • MSD-209 - Improved the data quality for long sequences.
  • MSD-223 - Increased an internal timeout to better handle larger CSV files.

v120

Dec 21st, 2023

New sequence and time-series training strategy

Drastically improved training performance for long sequence lengths, by allowing users to specify a maximum length of records, that is to be considered for each sequence during training of the generator model.

To use the Generate more data option with sequence data, you will need to create a new synthetic dataset.

v119

Dec 18th, 2023

Migration to PyTorch

MOSTLY AI has now migrated to PyTorch! The implementation of PyTorch now provides 2x to 3x faster AI model training times and up to 2x faster synthetic data generation times, reduced memory footprint, and better compute resource utilization.

To use the Generate more data option, you will need to create a new synthetic dataset.

Updated heuristic of Batch size = Auto

We updated the heuristic behind the selection of Batch size = Auto so that it auto-selects batch sizes for more optimal training times based on your subject and linked table data.

Resolved issues

Security and vulnerability fixes

v113.7LTS

Dec 7th, 2023

Resolved issues

  • MPD-3807 - Fixed an issue that caused the creation of new synthetic datasets with the api/v2/jobs endpoint to fail with Job with catalog can't be started, because the catalog is not completed yet.
  • MCD-2309 - Security fixes by upgrading pyarrow library.
  • MCD-2295 - Resolved an issue where primary keys were unexpectedly enforced even when not explicitly configured through the UI. Primary keys are now enforced only when configured through the UI.

v118

Nov 22nd, 2023

Resolved issues

  • MCD-2309 - Security fixes by upgrading pyarrow library
  • MCD-2295 - Resolved an issue where primary keys were unexpectedly enforced even when not explicitly configured through the UI. Primary keys are now enforced only when configured through the UI.
  • MCD-2300 - Reduced the generation batch size to make it even more conservative in resolving out-of-memory issues.

v113.6LTS

Nov 10th, 2023

Improvements

Updated generation batch size logic to take into account the model size

Resolved issues

MPD-3707 - Security fixes by upgrading to JDK 17, Spring Boot 3, and related libraries and components.

v117

Nov 9th, 2023

Improvements

  • Added support for taints and tolerations in the MOSTLY AI Helm chart
  • Security updates

Resolved issues

  • MCD-2292 - Fixed a bug where Smart imputation functioned incorrectly for discrete and binned numeric encoding types.
  • MCD-2296 - Reduced generation batch size to resolve out-of-memory issues.

v113.5LTS

Nov 3rd, 2023

Improvements

  • Added support for taints and tolerations in the MOSTLY AI Helm chart
  • Improved data pull performance for star schemas

Resolved issues

  • MPD-3672 - Fixed a bug that triggered the error message Error while connecting API incorrectly.
  • MCD-2272 - Fixed a bug in Model QA report that underreported accuracy for datasets exceeding 10k subjects with more than 10k data points each.
  • MCD-2292 - Fixed a bug where Smart imputation functioned incorrectly for discrete and binned numeric encoding types.

v116

Oct 26th, 2023

Improvements

  • The relationship diagram now accurately displays table hierarchies from top to bottom and correctly represents 1-n table relationships.
  • Improved data pull performance for star schemas.

Resolved issues

  • MCD-2246 - Fixed training failures for Text models when using data augmentation features for the non-Text columns.
  • MCD-2272 - Fixed a bug in Model QA report that underreported accuracy for datasets exceeding 10k subjects with more than 10k datapoints each.

v115

Oct 12th, 2023

The term "Original" replaces "Training" in the QA report

To better indicate the metrics of your original data in the QA report, we replaced the term "Training" with "Original".

Logging of peak virtual memory in AI model training logs

You can now find information about the peak virtual memory reached at the end of each training epoch in the training logs of a synthetic dataset.

Improvements

  • MCD-2252 - The MOSTLY AI engine now uses the latest version of TensorFlow
  • MCD-2245 - Improvements in the synthetic dataset logs for better readability
  • MCD-2214 - Improvements in the QA report related to the Rare category protection where rare categories on the X-axis of Univariate and Bivariate charts are now indicated with a combination of rare and a truncated alphanumeric hash string, such as rare...8fs2

Resolved issues

  • MPD-3623 - Fixed the issue that made the View training logs button inactive for the AI model training for columns with the Text encoding type
  • MCD-2244 - MOSTLY AI now includes fixes to make the application more resilient in cases when Kubernetes sends a restart command during the training step of a synthetic dataset which caused synthetic datasets to endlessly restart the training step and never finish the synthetic dataset
  • MCD-2243 - Fixed the issue that could lead a synthetic dataset to fail with an OutOfMemory error as a result of the original data containing very long sequences
  • MCD-2239 - Fixed the issue that caused a failure during training for linked tables where the linked tables contain very few samples with very short sequences

v113.4LTS

Oct 12th, 2023

Logging of peak virtual memory in AI model training logs

You can now find information about the peak virtual memory reached at the end of each training epoch in the training logs of a synthetic dataset.

Resolved issues

  • MPD-3623 - Fixed the issue that made the View training logs button inactive for the AI model training for columns with the Text encoding type
  • MCD-2244 - MOSTLY AI now includes fixes to make the application more resilient in cases when Kubernetes sends a restart command during the training step of a synthetic dataset which caused synthetic datasets to endlessly restart the training step and never finish the synthetic dataset
  • MCD-2243 - Fixed the issue that could lead a synthetic dataset to fail with an OutOfMemory error as a result of the original data containing very long sequences
  • MCD-2239 - Fixed the issue that caused a failure during training for linked tables where the linked tables contain very few samples with very short sequences
  • MCD-2193 - Fixed the issue which caused issues when reading a linked table if you excluded columns from the linked table through the UI

v114

Sep 28th, 2023

New table relationships viewer

When you are configuring a synthetic dataset, you can now get an overview of all relationships and foreign key types in the new Relationship diagram.

To open, click the new Relationship diagram button in the Tables page.

Database catalog - Relationship diagram

Resolved issues

  • MCD-2217 - Resolved the error [-<index>] not found in axis which appears during the encoding step of the creation of a synthetic dataset
  • MCD-2211 - Resolved an issue where columns with names that contain a dot character (.) were previously dropped from the synthetic dataset
  • MCD-2210 - Resolved an issue which resulted in failed synthetic datasets when the original data contains a datetime column with a constant value for each row
  • MCD-2208 - Resolved an issue for failed synthetic datasets started with the Generate more data > with seed option and failed with the error unknown EncodingType None for <column_name>
  • MCD-2175 - Resolved the issue where Parquet files are not delivered to cloud bucket destinations
  • MPD-3542 - Fixed the format of the creation date of a synthetic dataset that appears on the Summary page
  • MPD-3630 - Fixed the issue that caused the error 'License is missing' related to the fetching of user data from Keycloak

v113.3LTS

Sep 28th, 2023

Resolved issues

  • MCD-2217 - Resolved the error [-<index>] not found in axis which appears during the encoding step of the creation of a synthetic dataset
  • MCD-2211 - Resolved an issue where columns with names that contain a dot character (.) were previously dropped from the synthetic dataset
  • MCD-2210 - Resolved an issue which resulted in failed synthetic datasets when the original data contains a datetime column with a constant value for each row
  • MCD-2208 - Resolved an issue for failed synthetic datasets started with the Generate more data > with seed option and failed with the error unknown EncodingType None for <column_name>
  • MCD-2175 - Resolved the issue where we delivered only CSV files and not Parquet files to cloud bucket destinations

v113.2LTS

Sep 21st, 2023

Resolved issues

  • MPD-3626 - Fixed the empty downloads of synthetic datasets

v113.1LTS

Sep 19th, 2023

Resolved issues

  • MPD-3542 - Fixed the format of the creation date of a synthetic dataset that appears on the Summary page
  • MPD-3630 - Fixed the issue that caused the error 'License is missing' related to the fetching of user data from Keycloak

v113LTS

Sep 14th, 2023

Auto-adding of child tables

When you add database tables to a synthetic dataset, MOSTLY AI now also automatically adds all related child tables. You no longer need to add related tables manually.

Better guidance for _RARE_ values

You can now find more explanation about _RARE_ values after you hover over each in the preview of synthetic samples in the Summary page.

Improvements in handling nested table relationships

For multi-table setups with a 3-level hierarchy, any correlation between the 3rd level entities and all the 2nd level entities, that link to the same subject, are now retained. For example, for a User > Order > Item setup, all Items now retain correlations to all other Orders that belong to the same User.

Resolved issues

  • MCD-2190 - Resolved an issue with the use of the Numeric:Auto encoding type which caused the generation of synthetic datasets to fail for very large or very small datasets.
  • MCD-2192 - Resolved an issue that caused some values in a numeric column with the auto-detected encoding type Numeric:Binned to be empty
  • MCD-2204 - Resolved an issue where categorical columns were not auto-detected in Parquet files

v112

Aug 31st, 2023

Welcome, Synthetic datasets! (Goodbye, Jobs)

Synthetic datasets is why you use MOSTLY AI! High accuracy, high data quality, privacy-protected synthetic datasets.

We want you to focus on generating synthetic data and we are adding the term to the top-level menu in the MOSTLY AI Synthetic Data Platform!

With that, we also want to say goodbye to Jobs. You served our users well and we are thankful for it!

Source and destination connector types

You can now define each connector as either a data source or a destination. That way, you can only select destination connectors for your synthetic dataset destination and prevent the risk of selecting a data source as the destination.

New design for synthetic datasets summary

When you now open a synthetic dataset from the new Synthetic datasets tab, a new summary page provides easier access to the preview of sample data, the QA report, the tracking of the synthetic dataset progress, and the configuration of the synthetic dataset.

You can use the sidebar on the right to quickly access each section.

  • Overview
  • Sample data
  • QA Report
  • Logs
  • Configuration

Numeric (Auto) encoding type

The new encoding type Numeric (Auto) is now auto-assigned to columns that contain numeric data. Numeric (Auto) uses heuristics to automatically assign the relevant one of the available Numeric encoding types: Discrete, Digit, or Binned.

You no longer need to worry about which Numeric encoding type you need to use. Just select Numeric (Auto).

v111

Aug 17th, 2023

New Numeric encoding types

You can now select from three different Numeric encoding types: Digit, Discrete, and Binned.

Preview of synthetic data is now available for shared jobs

The Synthetic data tab in a completed job is now available on shared jobs. When you share a link to a completed job with your team, they can now access the Synthetic data tab in the job and preview the generated synthetic data in the job.

Drop tables in the destination

The new option Drop tables in the destination in the Output settings will drop any tables that match the names of the tables in your synthetic data job. MOSTLY AI drops the tables at the start of the job before it completes AI model training and data generation.

You can enable Drop tables in the destination after you start a new job and select a database connector as the destination. The option is not available for cloud storage connectors.

Search database tables when adding them to a catalog

In v110, MOSLTY AI introduced a drop-down to add tables from a database. You can now enter a search term in the drop-down to filter the list of tables and more easily find the table you want to add.

v110

Aug 3rd, 2023

Support for multiple tables in ad hoc and cloud storage jobs

You can now create and configure multi-table jobs not only with databases, but also with file uploads and cloud storage catalog jobs.

New Tables tab in job configuration

For each job configuration, the new Tables tab gives you a list of all tables in the job. The Tables tab is also the new home of all training settings that were previously available in the Training settings tab. Moving forward, the Training settings tab will be no longer available.

In the Tables tab, you can now also add and remove tables from a job.

When you start a job, the Tables tab opens and contains no tables. You can add new tables with the Add table button. This action is supported in all job types: ad hoc, database catalogs, and cloud storage catalogs.

Easier database catalog creation flow

With the new database catalog flow, you no longer need to identify subject tables and rank them.

After you select a database connector, MOSTLY AI shows the new Tables tab where you can now add tables from your cloud buckets or databases and remove any tables that you no longer need.

Easier configuration of table relationships

You can now use the Foreign key option in Generation method to define relationships between tables. This is now available in the Data settings tab during job configuration.

To mark a table as a linked table, specify which of its columns is set as a Context foreign key to another table.

Table relationships configuration is no longer required to start a job

You can now start a job with two or more subject tables. You no longer need to define a relationship and mark any of the tables as a linked table.

Reference tables are now only available in old catalogs

With v110 of MOSTLY AI, the concept of reference tables is no longer available for any newly created jobs or catalogs. All tables in a job are either a subject table (by default) or a linked table (after you set a foreign key to another table). You can only view reference tables in catalogs that you created before v110.

However, you can no longer change the configuration of reference tables, such as set any primary or foreign keys.

Updates in Generate more data

You can now use Generate more data for all job types including database catalog jobs.

With the capability to have ad hoc jobs with multiple uploaded subject tables, for such jobs you will now need to specify the number of new generated subjects or provide a table seed for every subject table in the job.

Resolved issues

  • MCD-2071 Implemented better precision when handling primary keys

v109

Jul 19th, 2023

Use different types of data sources and destinations for the same job

Regardless of the type of data source you use for your original data, you can now deliver the generated synthetic data into any type of destination that suits your downstream tasks.

You can now select a different type of connector for the delivery of your synthetic data, so you can mix and match, such as use original data from Databricks but deliver the synthetic data into Snowflake, or use original data from a Microsoft SQL Server database and deliver the synthetic data into a PostgreSQL database.

v108

Jul 6th, 2023

Preview generated synthetic data

When a synthetic data job completes, you can now preview the first up to 100 samples from each generated synthetic table.

Share links to generated synthetic data

With MOSTLY AI, you can now share links to completed synthetic data jobs with anyone. Send the links to colleagues or data-minded friends and they can download the generated synthetic data and review all available QA reports.

Improved star schema support with better handling of correlations between linked tables

We improved the support of star schemas and now provide better handling of the correlations between linked tables. In such cases, synthetic linked tables with correlations now have better quality and accuracy.

SSL support in PostgreSQL connectors

You can now configure your PostgreSQL connectors to use secure SSL connections to the database.

Job progress is now updated every second

As you track the progress of a running job from the Jobs tab or in the View tasks drawer, the progress is now updated every second to provide a more responsive experience.

Resolved issues

  • MPD-3220 - In the previous version, when you clicked Stop generation while looking at a job progress in the View tasks drawer, the job would continue generating data and ignore the action. We have now resolved this issue and clicking Stop generation now takes immediate effect.
  • MCD-1952 - When a column is set as both a primary key and a foreign key in the original data, MOSTLY AI prioritizes the foreign key relationship and the issue is handled gracefully.
  • MCD-1951 - Resolved an issue when MOSTLY AI writes primary keys in UUID format that are longer than the maximum number of characters allowed by the column data type in the destination database

v107

Jun 21st, 2023

Databricks support

You can now create Databricks connectors and use Databricks catalogs as a data source or destination for your generated synthetic data.

Coherence report for linked tables

The Model QA report and Data QA report now contain a Coherence tab for linked tables (event & time-series data). In the Coherence tab, you can find bivariate plots that show how well the sequence and logic of events is preserved in the synthetic data.

Auto-update of training settings based on selected training goal

When you set the Training goal for a synthetic data job, MOSTLY AI now auto-updates the training settings Maximum training epochs and Training samples to values appropriate for the selected training goal.

AccuracyMaximum training epochs is set to 100
Speed
  • Maximum training epochs is set to 10
  • Training samples is set to 100000
Turbo
  • Maximum training epochs is set to 1
  • Training samples is set to 10000

Actual and maximum theoretical accuracy in QA report

The Accuracy tab in the Model QA report now shows maximum theoretical accuracy in parenthesis, next to the actual accuracy for each column.

Improvements

  • MPD-3182 - Improved the indication of mandatory fields and default values in all database and cloud storage connector configuration screens
  • MPD-2985 - The Accuracy tab now orders columns by their univariate accuracy in descending order
  • MPD-3080 - The training setting Limit records per subject is now renamed as Limit sequence length

v106

Jun 7th, 2023

BigQuery support

You can now create BigQuery connectors and use BigQuery as a data source or destination for your generated synthetic data.

Use the new Turbo training goal for quick synthetic data jobs

For testing purposes, you might need to run and complete synthetic data jobs rapidly without the need for accuracy. For such cases, you can now use the new Turbo training goal. When you select Turbo, MOSTLY AI automatically sets the Maximum training epochs setting to 1 and reduces the training time to a minimum so that you can get a quickly generated synthetic dataset.

Improvements

  • MPD-3105 - The Data settings screen now shows the type of mock data you selected for a column.
  • MPD-2476 - You can now set Encoding type: ITT for more than one column in a linked table.

Resolved issues

  • MPD-3084 - The metric Context columns no longer appears in the QA report for subject tables.
  • MCD-1812 - _RARE_ token values in Categorical columns in the input dataset are now considered as actual categories and no longer result in the crashing of synthetic data jobs.
  • MCD-1868 - We made optimizations to reduce the number of jobs that fail with OutOfMemory errors.
  • MCD-1982 - Empty linked tables (that have columns defined but contain no rows) no longer crash synthetic data jobs. MOSTLY AI generates the same empty tables in the synthetic dataset.

v105

May 25th, 2023

Performance improvements

After a number of performance optimizations to our database and queries, the MOSTLY AI synthetic data platform now supports even more simultaneous synthetic data jobs.

Resolved issues

MPD-3147 - Due to some incorrect assignments of foreign keys in specific cases, we disabled the auto-assignment of foreign keys when you upload subject and linked table files.

v104

May 16th, 2023

Snowflake support

You can now create Snowflake connectors, and with that, read original data directly from as well as write synthetic data directly to your Snowflake databases.

Auto-detection of CSV data types

MOSTLY AI now instantly recognizes the correct data types for uploaded CSV files. Previously, this was done as part of the data synthesis.

With this change, the Encoding Type AUTO is now deprecated.

Support for Gzip and Bzip2 files

You can now speed the provisioning of large files by uploading them as Gzip (.gz) or as Bzip2 (.bz2) archive files.

Support for TSV files

You can now upload TSV (tab-separated values) files.

Specify single files from cloud buckets

Previously, you were only able to specify the containing folders as a cloud bucket. With this release, you can now specify the path to individual files on a bucket.

Support for JSON Lines, Feather and ORC format (experimental)

You can now provide your original data as JSON Lines (opens in a new tab), Feather (opens in a new tab), or ORC (opens in a new tab) format.

Resolved issues

MCD-1862 - MOSTLY AI now discards rows with duplicate primary keys if you have such in your dataset.

v103

May 8th, 2023

Granular options for Generation mood

Generation mood now includes additional options for finer control over the type of distribution that you want to achieve in the generated synthetic data.

v102

Apr 24th, 2022

Home page

We want to welcome you to the new Home page in the top navigation bar. With the Home page, you have easier and direct access to MOSTLY AI features. You can review them below.

  • Upload files In the Upload files tab, you can upload (drag-and-drop or browse to select) a CSV of Parquet file with data to immediately configure and start a synthetic data job.
  • Connect to a source On the Connect to a source tab, you can immediately create a connection to a new database or cloud bucket.
  • Start a synthetic data job with an existing sample dataset Under Or use sample data, you can immediately start a synthetic data job with any of the datasets that are available. Pick one and start a synthetic data job for it with the Start button.
  • Last six completed jobs Under Existing synthetic datasets, you can review the last six completed jobs. The card for each job indicates if the synthetic data passed the Privacy check and what its overall Accuracy is.

Reference tables are no longer copied in the synthetic dataset

To prevent any potential data leaks, MOSTLY AI no longer copies Reference tables in the generated synthetic data.

Resolved issues

  • MPD-3064 - Fixed the issue where the Save button remained inactive after you edited a column with a Smart select relationship.
  • MPD-3039 - Fixed the issue that kept the Delete button inactive in the Catalogs tab.

v101

Apr 3rd, 2023

Improvements

Easy onboarding with Magiclink

You can now login to MOSTLY AI using Magiclink.

Resolved issues

  • MCD-1691 - Fixed the issue that job fails due to too few samples being provided by the User.
  • MCD-1740 - Fixed the issue of having Nulls in a Text column.

v3.0

Mar 7th, 2023

Kubernetes and Openshift support

MOSTLY AI 3.0 will use Kubernetes and Openshift as the deployment method.

Smart imputation

Smart imputation allows the user to create a synthetic dataset where specific columns don't contain null values.

Rebalancing

Rebalancing allows you to specify the distribution of specific values in a column. Using Rebalancing, you can create a large number of relevant business scenarios out of the few that are present in your data. Use it to simulate what-if scenarios based on your historical data, or make minority classes visible for downstream machine learning algorithms.

Generation mood

Generation mood allows you to control the degree to which the synthetic version of the column will adhere to the detected distributions and correlations in the original data. The following generation mood settings are available:

Conservative - Generates synthetic data strictly within the business rules captured in the data. Representative - Generates synthetic data that adheres less strictly to the business rules captured in the data. Creative - Generates synthetic data skewed toward the outliers of the detected distributions.

New QA Report that reflects Programmable synthetic data metrics

With the introduction of the Programmable data, we are now providing quality assurance metrics for the model and data separately.

MariaDB support

You can use MariaDB both as a data source and as a data destination.

New User Interface

The look and feel of the application are updated, along with the below improvements:

  • We are now providing consistency throughout the application in terms of flows and page elements, which will allow you to use the application more efficiently.
  • The steppers and information boxes will help you through your journey.
  • Data, Training, and Output settings are separated in different tabs
  • We are giving a visual clue of the configured number using a thousand separator to help you work more efficiently with large numbers.

Rare / Extreme Value Protection updates

Enabling / Disabling the Rare Category Protection

You can enable or disable Rare category protection for categorical type columns.

Extreme Value Protection

You can enable or disable Extreme value protection for numerical, datetime, and ITT-type columns. If enabled, the values of the smallest and largest outliers in these columns will be replaced by the non-outlier values.

Improvements

Improved Quality

The context of all the tables in the hierarchy is now being propagated to the offspring tables. Also, the smart select columns are normalized in the context to improve quality further.

Editing settings of multiple columns at once

You can select and edit multiple columns at once.

Downloading synthetic data as CSV/parquet for all types of jobs

You can now download synthetic data for all types of jobs. If you don't have access to a destination database/bucket, you can use the Download as CSV/parquet option to download your synthetic data.

Resolved issues

  • MPD-2715 - PK and FK relationships are not correctly set for file based jobs.
  • MCD-1469 - Fixed the issue that catalogs with multiple context foreign keys may not complete synthetic data generation.
  • MCD-1445 - Fixed the issue that batch sizes greater than 4096 crashes synthetic data generation.
  • MCD-1438 - Fixed the issue that in database synthesization jobs, tables whose names start with an _ fail to be read.
  • MCD-1432 - Fixed the issue of misalignment of data partitions occurring when the subject table is big and the linked table is small.
  • MPD-2576 - For Ad hoc jobs, the default rare category protection method is now Constant instead of Sample.
  • MPD-2532 - Fixed the issue that tables with multiple foreign keys may crash when the relationships have been edited in the data catalog.
  • MPD-2480 - Fixed the issue that users cannot upload tables that are partitioned over multiple files.
  • MPD-2478 - Fixed the issue that free version users see Local Server as a data connector option while unavailable to them.
  • MPD-2470 - Fixed this issue that Mock is selectable as an encoding type.
  • MPD-2444 - Fixed the issue that the encoding type is not saved when a linked table column is set to ITT.
  • MPD-2604 - Fixed the issue that in Ad Hoc jobs, column settings are not persisted after saving when switching tabs.
  • MPD-2340 - In Ad hoc jobs and Cloud storage data catalogs, the Edit relationships drawer is automatically shown to the user if the foreign key is not found.
  • MPD-2443 - Certain database relationships result in two context foreign keys to the same referenced table, resulting in an error during synthesization.
  • MPD-2395 - When creating a data connector, the schema field is marked as mandatory for databases that don't require it.
  • MPD-2381 - For Ad hoc jobs and cloud storage catalogs, the linked table's first column is automatically selected as the foreign key.
  • MPD-2378 - When a table has an unexpected character, the error message doesn't mention the issue as such, nor does it state where it occurs.
  • MPD-2339 - If there is only one referring table, it doesn't show up in the Primary key and referring tables section.
  • MPD-2281 - The column settings drawer shows the incorrect generation method for Smart Select and context foreign keys.
  • MPD-2060 - For users of the free version, Local storage is no longer an option when creating data catalogs.
  • MCD-1381 - Missing values in the numerical columns of Parquet files are not correctly read.
  • MCD-1373 - The Smart Select algorithm throws an error if the referring table is empty.
  • MCD-1364 - The database data connector throws an error if there are empty tables.
  • MPD-2371 - Tables are not shown in alphabetical order in the 'Database contents' section of the database table selection step.
  • MPD-2357 - The job settings' column details of uploaded Parquet files show Auto-detect instead of encoding types.
  • MPD-2356 - Parquet files cannot be used as a seed for the Generate more data feature.
  • MPD-2351 - When starting an Ad hoc job, users can upload 2 different files as a subject table.
  • MPD-2347 - Reference tables' primary keys are not copied but generated.
  • MCD-1325 - QA report generation fails when analyzing database datetime columns that contain values in an unknown format.
  • MCD-1327 - Sequence lengths are incorrectly calculated in an edge case scenario.
  • MPD-2194 - When creating or modifying a data connector, the Test connection button doesn't check whether the specified schema can be accessed.
  • MPD-2178 - Whitespaces in the header row of CSV files cause issues during synthesization.
  • MCD-1275 - QA report generation fails when synthesizing Parquet files.
  • MCD-1273 - Incorrect processing of scientific notation in CSV files.
  • MCD-1266 - Certain datetime ranges are incorrectly processed as strings.
  • MCD-1265 - Restrictive rules causing the QA report to fail in certain edge cases.
  • MCD-1261 - Long warning messages within the app's architecture causes it to crash.
  • MCD-1260 - QA report fails when a column is configured as 'mock data'.
  • MCD-1259 - Incremental timestamps in time-series data may generate inconsistent synthetic data when configured as ITT.
  • MCD-1258 - QA report fails when a numerical column is completely empty.
  • MCD-1257 - Synthesization fails if the linked table's entries are not linked to the subjects in the subject table.

v2.4.4

Dec 5th, 2022

Improvements

MCD-1217 - When synthesizing databases, the data types of the original schema are now respected, regardless of encoding type.

Resolved issues

  • MCD-1469 - Fixed the issue that catalogs with multiple context foreign keys may not complete synthetic data generation.
  • MCD-1445 - Fixed the issue that batch sizes greater than 4096 crashes synthetic data generation.
  • MCD-1438 - Fixed the issue that in database synthesization jobs, tables whose names start with an _ fail to be read.
  • MCD-1432 - Fixed the issue of misalignment of data partitions occurring when the subject table is big and the linked table is small.
  • MPD-2576 - For Ad hoc jobs, the default rare category protection method is now constant instead of sample.
  • MPD-2532 - Fixed the issue that tables with multiple foreign keys may crash when the relationships have been edited in the data catalog.
  • MPD-2480 - Fixed the issue that users cannot upload tables that are partitioned over multiple files.
  • MPD-2478 - Fixed the issue that free version users see Local Server as a data connector option while unavailable to them.
  • MPD-2470 - Fixed this issue that Mock is selectable as an encoding type.
  • MPD-2444 - Fixed the issue that the encoding type is not saved when a linked table column is set to ITT.
  • MPD-2604 - Fixed the issue that in Ad Hoc jobs, column settings are not persisted after saving when switching tabs.
  • MPD-2340 - In Ad hoc jobs and Cloud storage data catalogs, the Edit relationships drawer is automatically shown to the user if the foreign key is not found.

v2.4.3

Oct 11th, 2022

Improvements

  • MPD-2175 - When running a job, the View training logs is now visible by epoch 1 and shows a spinner to indicate that the training is being canceled.
  • MPD-2267 - The QA report for linked tables no longer displays the linked table name along with the context table name.
  • MPD-2088 - When adding new foreign keys with the relationships drawer, if there are more than 1 parent tables without primary keys, the error message shows all these tables instead of only the first one.

Resolved issues

  • MPD-2443 - Certain database relationships result in two context foreign keys to the same referenced table, resulting in an error during synthesization.
  • MPD-2395 - When creating a data connector, the schema field is marked as mandatory for databases that don't require it.
  • MPD-2381 - For Ad hoc jobs and cloud storage catalogs, the linked table's first column is automatically selected as the foreign key.
  • MPD-2378 - When a table has an unexpected character, the error message doesn't mention the issue as such, nor does it state where it occurs.
  • MPD-2339 - If there is only one referring table, it doesn't show up in the Primary key and referring tables section.
  • MPD-2281 - The column settings drawer shows the incorrect generation method for Smart Select and context foreign keys.
  • MPD-2060 - For users of the free version, Local storage is no longer an option when creating data catalogs.
  • MCD-1381 - Missing values in the numerical columns of Parquet files are not correctly read.
  • MCD-1373 - The Smart Select algorithm throws an error if the referring table is empty.
  • MCD-1364 - The database data connector throws an error if there are empty tables.

v2.4.2

Sep 28th, 2022

Improvements

  • Multiple synthesization jobs started at the same time will now be processed one by one instead of all at once.

Resolved issues

  • MPD-2371 - Tables are not shown in alphabetical order in the 'Database contents' section of the database table selection step.
  • MPD-2357 - The job settings' column details of uploaded Parquet files show Auto-detect instead of encoding types.
  • MPD-2356 - Parquet files cannot be used as a seed for the Generate more data feature.
  • MPD-2351 - When starting an Ad hoc job, users can upload 2 different files as a subject table.
  • MPD-2347 - Reference tables' primary keys are not copied but generated.
  • MCD-1325 - QA report generation fails when analyzing database datetime columns that contain values in an unknown format.
  • MCD-1327 - Sequence lengths are incorrectly calculated in an edge case scenario.

v2.4.1

Sep 12th, 2022

Improvements

  • Ad hoc jobs can now synthesize Parquet files.
  • CSV files can now have semicolons (;) as well as commas (,) as column separators.

Resolved issues

  • MPD-2194 - When creating or modifying a data connector, the Test connection button doesn't check whether the specified schema can be accessed.
  • MPD-2178 - Whitespaces in the header row of CSV files cause issues during synthesization.
  • MCD-1275 - QA report generation fails when synthesizing Parquet files.
  • MCD-1273 - Incorrect processing of scientific notation in CSV files.
  • MCD-1266 - Certain datetime ranges are incorrectly processed as strings.
  • MCD-1265 - Restrictive rules causing the QA report to fail in certain edge cases.
  • MCD-1261 - Long warning messages within the app's architecture causes it to crash.
  • MCD-1260 - QA report fails when a column is configured as 'mock data'.
  • MCD-1259 - Incremental timestamps in time-series data may generate inconsistent synthetic data when configured as ITT.
  • MCD-1258 - QA report fails when a numerical column is completely empty.
  • MCD-1257 - Synthesization fails if the linked table's entries are not linked to the subjects in the subject table.

Security updates

Security updates have been made to the following components:

  • Java and Python libraries
  • RabbitMQ
  • Internal PostgreSQL database
  • Keycloak

v2.4

Aug 29th, 2022

Synthesize databases even if they don't have a schema, and impress your colleagues with its QA report.

  • Relationship manager
    Use the relationship manager to add and modify relationships so that you can tailor the synthetic version of your database entirely to your use case. It's specifically designed to help you synthesize databases without schema or with an incomplete schema.

  • A QA report for everyone
    You can now download and share the QA report of your synthetic databases with your colleagues. Not only did we make it easy to share, but also easy to read! + We worked on numerous improvements that help you assess synthetic data quality and convey the message that your synthetic data is privacy-secure and an accurate representation of your company's valuable data assets.

Relationship Manager

Whether your database is small or large, with or without schema, we've got you covered. You can now complete the relationships between your database's tables so that all of its data assets can be properly secured and accurately synthesized, QA report included.

And if you're dealing with partially defined relationships and don't know which ones are missing, you can count on us as well. Our handy 'Tables without relations' filter gets you going in no time!

Working with the relationship manager is not complicated either. Watch this 6-minute video tutorial to get you up to speed.

QA report

  • Improved interactive charts help you easily pinpoint and identify potential accuracy issues in the synthetic data.
  • There's no need to wait for it either! QA report generation now takes seconds per table rather than minutes, so you can immediately assess the quality of your synthetic data.
  • Explainer sections in the report help the reader understand what they’re looking at.
  • The QA report now comes in a handy, self-contained HTML document that retains all interactive charts when sharing it across your business and partnerships.

Resolved issues

  • MPD-2198 - When the ‘number of generated subjects’ is left blank, the number of training subjects is used if defined, instead of the number of subjects in the subject table.
  • MPD-2185 - Incorrect number of columns reported in the QA report.
  • MPD-2180 - ‘Cancel training’ and ‘Cancel generation’ buttons are not working when synthesizing data.
  • MCD-2057 - UI issues when creating an Oracle database data connector.
  • MCD-1177 - Incorrect handling of SID and SERVICE_NAME connections to Oracle databases.
  • MCD-1169 - The QA report of certain datasets have an Incorrect placement of labels in the correlation matrix.
  • MCD-1163 - Numerical columns may generate a casting exception during generation causing a job failure.

v2.3

Jul 7h, 2022

Whether you're a student, small business, or enterprise, our Synthetic Data Platform is ready to serve your needs

  • Effortless onboarding with our new video tutorials
    Our new video tutorials help users start synthesizing your company's valuable data assets right away and help them understand what's going on in each step.
  • Audit logs for compliance and security
    MOSTLY AI's audit log keeps track of who accessed the system, what they looked at, and what actions they took.
  • Improved synthesization of your database's sequences
    The order of your linked tables' lists, sequences, and time-series data embodies valuable information. MOSTLY AI now allows you to sort your linked tables by column so that all sequential information is optimally preserved.

Free edition

The best AI-driven synthetic data generator is available free of charge forever for generating up to 100K rows daily. If you want to generate high-quality, privacy-safe synthetic versions of your datasets for machine learning, testing or data sharing use cases, MOSTLY AI's synthetic data generator is at your service. And it's available straight from your browser after a simple registration.

Effortless user onboarding with video tutorials

Our new video tutorials help users start synthesizing your company's valuable data assets right away and help them understand what's going on in each step. There are three video tutorials available:

  • Privacy-secure your customer data
    Users will learn to synthesize a table with basic customer profile information, such as their name, address, birth date, etc., and get a glimpse into the type of insights they can obtain from it.

  • Privacy-secure behavioral customer data
    Users will learn to synthesize a subject table-linked table dataset and understand how to deal with lists, sequences, and time-series data.

  • Create a realistic and secure test database
    Test engineers will learn to create a subset of a production database that is privacy secure and referentially intact while maintaining all business rules and relevant business scenarios for testing.

Audit logs for compliance and security

System administrators can now retrieve an audit log from the MOSTLY AI Synthetic Data Platform. It keeps track of information regarding who accessed the system, what they looked at, and what actions they took. This temporal information is important to proving compliance and security.

Improved synthesization of your database's sequences

The order of your linked tables' lists, sequences, and time-series data embodies valuable information. MOSTLY AI now allows you to sort your linked tables by column so that all sequential information is optimally preserved. For time-series data, you can now also select the ITT (Inter-Transaction-Time) encoding type. It models the time interval between two subsequent events, resulting in a very accurate rendering of the time between events.

Resolved issues

  • MPD-2090 - License renewal issues.
  • MPD-2033 - QA report generation crashes when a CSV file contains \n symbols.
  • MCD-1131 - Out of memory issues when synthesizing subject table-linked table datasets.
  • MCD-1079 - The trained AI model is lost when the training crashes.
  • MCD-1070 - In rare cases, numerical values are incorrectly detected as boolean values.

v2.2

May 1st, 2022

  • Transform your business with synthetic data that's effortlessly privacy-secure, efficient, and fast*
  • Take advantage of a synthetic data engine that's mindful of your time and hardware resources.
  • Benefit from a much-simplified preparation of your synthesization jobs. The web UI now serves your goals, while MOSTLY AI handles complex configurations in the background.
  • Our new user management system lets you create groups, manage group-level access permissions, and lets users share synthetic data assets with these groups.
  • MySQL support enables synthetic data in the cloud, integrating MOSTLY AI with cloud databases like AWS Aurora, Google Cloud SQL, and many more.

A smarter, faster & more efficient MOSTLY AI

For the past few months, we have been working to make synthetic data work for your business. Here are some of the highlights:

  • We achieved a more than two-fold increase in synthesization speed, significantly reducing the resource footprint of synthetic data in your company.

  • Preparing synthetic data has become much simpler. The engine now determines the best AI model and outlier protection settings for your dataset and use case.

  • Benefit from better resilience for missing files, rows, columns, and so on. Defects in your data sources will no longer cause issues.

Increased speed

Benefit from a smaller synthetic data footprint and shorter time-to-data

  • MOSTLY AI can not only process datasets virtually limitless in size, it can now ingest and encode them faster than before.

  • Overall AI model training speeds halved, and wide tables now benefit from a faster training time for the first epoch.

  • We achieved a ten-fold increase in synthetic data generation performance. What MOSTLY AI used to generate in minutes can now be done in seconds.

Better synthetic data

Privacy-security is now out-of-the-box and takes zero effort to realize

  • MOSTLY AI protects rare categories by replacing them with non-rare categories. Release 2.2 replaces them in a context-aware manner. For instance, if a female data subject has a rare name, it will be replaced with a female non-rare name.

  • Rare category protection can no longer be adjusted or turned off.

  • Extreme values are now protected in all numerical formats, including datetime and ITT.

  • Lists, sequences, and time-series data now benefit from extreme sequence length protection.

  • Improved accuracy of sequence length distributions in the synthetic data, as minimum sequence lengths are now respected.

Simpler preparation

  • Use the batch size AI model training parameters to balance training speed with memory availability. The appropriate learning rate is now calculated in the background.

  • If your synthesization job doesn't run as desired, you can choose a smaller or bigger AI model size to mitigate the issue.

  • The job summary now shows a progress bar for each epoch, giving you an indication of how long AI model training will take.

  • The "generate more data" function for synthesization jobs created with release 2.2 will now work with all upcoming versions of MOSTLY AI.

Manage users and groups

Create groups and let users share assets across them

  • As an admin, you can now create groups and manage group-level access permissions. This makes it easier to manage permissions for multiple users or reassign individual users if they change jobs in the organization.

  • As a user, you can now share synthetic data assets with your group or with other groups.

MySQL Data connector

Use the MySQL family of databases for synthetic data

The MySQL data connector enables synthetic data in the cloud and integrates MOSTLY AI with cloud databases like AWS Aurora, Google Cloud SQL, and many more.

Resolved issues

  • MPD-1781 - License issues due to restarted VMs.
  • MPD-1439 - The data connector details view doesn't show the database name.
  • MCD-952 - The AI server crashes when there's an issue with assigning foreign keys using Smart Select.
  • MCD-939 - Generation crashed if the precision is specified for columns with floating point numbers.

v2.1

Apr 26th, 2021

Equip yourself with the most comprehensive synthetic data platform on the market

MOSTLY AI 2.1 continues our mission to deliver an enterprise-grade synthetic data platform and remain the leader in the tabular synthetic data space.

  • We now support the DB2 family of databases, enabling synthetic data for mainframe applications.

  • Our new Text encoding type allows you to synthesize unstructured natural language. MOSTLY AI 2.1 now covers all tabular data types, from categories to geolocation data and beyond. The world is all yours!

  • Benefit from searchable and interactive charts in the QA report, allowing you to intuitively spot opportunities to further improve synthetic data quality.

Synthetic text

Put unstructured natural language texts to use in your AI/ML applications.

Insurance claim reports, medical diagnoses, and other types of unstructured texts are very rich sources of information, capturing details that aren't present in numbers or other structured forms of data.

Our new Text encoding type allows you to privacy-protect these texts and put them to use in various AI/ML use cases, for example:

  • Named-entity recognition
  • Sentiment analysis
  • Testing—by generating real descriptions
  • E-commerce analytics—by synthesizing customers' search keywords

DB2 Data connector

Use DB2 databases for synthetic data

You can now connect MOSTLY AI to the DB2 family of databases. + Use them as a data source or as a destination, and enable synthetic data for mainframe applications.

image -whats-new/whats-new-21-DB2-connector.png[DB2 data connector, width#"60%"]

Updated QA Report

All privacy and accuracy charts are now in a hand's reach

Our privacy and accuracy charts are now available in the web UI, so you can intuitively evaluate the quality of your synthetic data.

Spot opportunities to further improve quality and immediately apply them to the job settings.

  • Use the search function to look up specific columns.
  • Interactive charts allow you to learn more about specific data points.
  • Enlarge them to study them in detail.

image -whats-new/whats-new-21-QA-report.png[Updated QA report]

Resolved issues

  • MPD-1596 - Job cancellation hangs when using AWS ECR.
  • MCD-885 - Job won't start if the String pattern of the Custom string mock data type is not defined.
  • MCD-754 - The Generate more data feature crashes with some of the supported datetime formats.
  • MCD-806 - Jobs may crash if they process very wide tables.
  • MCD-760 - AI model training crashes when consistency correction and GPU acceleration are both active.

v2.0

Nov 2nd, 2021

Synthetize your data wherever it is

Mostly AI 2.0 is now capable of synthesizing entire databases!

It connects to your data sources, recognizes its columns and their relationships, and provides you with a synthetic version of your data wherever you need it.

There are no more limits to what you can synthesize. Connect to your databases, buckets, and files without any hurdles.

Be ready for the synthetic data revolution. It’s already here.

New UI

A new customer centric UI

With Mostly AI 2.0 we introduce a new UI!

The new UI has been redesigned with a customer centric approach.

The task of creating a new synthesization job has never been easier.

And it looks cool too!

Multi-table data catalog

Synthesize complex data structures

With Mostly AI 2.0 it is now possible to define multi-table data catalogs!

The complexity of your data source is now represented in the data catalog:

  • Support of primary keys,
  • Support of foreign keys,
  • and Referential integrity.

The platform understands the relationships between all the tables and create a synthesization plan based on these relationships.

The result is synthetic version of your data in its original form!

Parallel computing

Better performance thanks to parallelization

Thanks to a major architectural redesign, the Mostly AI platform now supports parallel computing.

In case of multi-table synthetic generation, the Mostly AI platform will intelligently divide the tasks that can be calculated in parallel in the available VMs.

Data connectors

Create your data source once and re-use it!

You can now define data connectors in the Mostly AI platform!

Data connector can be used as a source of data or as a data destination.

You can fetch data from your production data lake or database and push them wherever you need!

Mock data

A perfect way to test the extremes

Some of the biggest challenges when testing software can be getting the software into some very specific states. You want to test that the new error message works, but this message is only shown when something breaks. You may have no direct control over and you really need to manipulate this data in order to perform your tests.

You can now define Mock Data in the generation process!

Mock data makes it possible to simulate errors and circumstances that would otherwise be very difficult to create in a real world environment.

New QA report

A new and intuitive QA report

The new QA Report is available directly in the UI. You can explore the results of your generation job and see if there are privacy or accuracy warnings!