Welcome to the comprehensive guidelines for configuring and generating synthetic data using the MOSTLY AI Synthetic Data Platform. We realize that the ability to create high-quality synthetic data has become an invaluable asset for a wide range of industries. Whether you are developing machine learning models, testing software applications, or enhancing data privacy, generating accurate synthetic data can provide a realistic and privacy-preserving alternative to real data.
With this document, we intend to assist you in gaining the full potential of MOSTLY AI while adhering to best practices. By following these guidelines, you can master the configuration of synthetic data with MOSTLY AI to suit your specific needs but also gain insights into various tips, tricks, and potential pitfalls to avoid during the data generation process. Our goal is to empower you with the knowledge and tools to make the most of our product.
Let’s get started with some basic concepts of the MOSTLY AI Synthetic Data Platform that are important to understand.
When deploying MOSTLY AI on different virtual machine (VM) sizes and working with different data dimensions, it is essential to understand how these factors can impact accuracy, training times, and overall performance. This section provides insights what to expect from specific VM sizes and manipulating data dimensions, including wide tables, the number of columns, and the number of rows.
The performance of MOSTLY AI can vary based on the VM size you deploy to. Different VM sizes offer varying computational resources when it comes to CPU and memory.
- Accuracy: Larger VM sizes with more computational power can lead to increased accuracy due to enhanced processing capabilities. Models trained on more powerful VMs might perform better, especially for complex tasks or larger datasets.
- Training times: Larger VM sizes can significantly reduce training times. More resources allow for faster model training, resulting in quicker training iterations. However, balancing the VM size with your budget and the urgency of results is important.
Note: The decision what VM size to use depends on your data dimensions. For more information, see the sections below.
When working with data, particularly wide tables with a varying number of columns and rows, the following considerations apply:
- Wide Tables
Wide tables, with a high number of features or columns, can pose challenges for training. While they offer more information, they can lead to increased model complexity. Reducing your dataset to include only necessary features based on your use case might enhance training efficiency.
- Number of columns
The number of columns in your dataset directly impacts the time required for processing and training. More columns lead to longer training times and increased memory usage. We recommend you only include essential features for your use case and avoid unnecessary complexity.
- Below 100 Columns: Working within this range is manageable and you will most likely have a smooth processing and training experiences.
- Between 100 and 500 Columns: Within this bracket you might encounter challenges, particularly contingent on encoding types. Some encoding types (such as Text) increase processing times and memory usage.
- Above 500 Columns: We typically do not recommend to synthesize more than 500 columns at once. Instead consider preprocessing steps, such as feature selection, that can help to streamline training efficiency and mitigate potential issues from data complexity.
- Number of Rows
The number of rows in your dataset affects the model's generalization ability. Larger datasets often lead to better model performance by providing more diverse examples. However, larger datasets also require more data preprocessing and training time.
- Subject tables
- Subject tables below 1 million rows are suitable.
- Within the 1-10 million range, encoding types can impact performance. Consider using the sampling feature to only work with a subset of the data.
- Beyond 10 million, sampling becomes a viable strategy to ensure manageable processing and training.
- Linked tables
- The linked table's sequence length (linked records per subject) is critical. If it's substantial, sampling subjects or limiting sequence length becomes essential.
- Sequence lengths below 100 usually result in a smooth experience.
- Memory issues might arise in the range of 100-1,000; solutions include reducing the batch size, model size, or even reducing the number of subjects.
- Beyond 1,000, adjusting sequence length, reducing the number of subjects, or altering batch size becomes relevant.
- Subject tables
- Number of tables
- The number of tables within your dataset influences the synthetic data generation. As the number of tables increases, factors like processing time, model complexity, and training duration come into play. Balancing dataset richness and the potential challenges associated with multiple tables is essential.
- Processing Time Impact: More tables extend the time required to generate synthetic data. The complexity of managing inter-table relationships and correlations contributes to this effect.
- Inter-Table Correlations: When tables are related, the model aims to capture and maintain these correlations between the tables. Consequently, model complexity increases, leading to prolonged training times.
- To avoid such a scenario, we suggest evaluating the necessity of each table. Consider whether merging certain tables can streamline data processing and training without sacrificing critical insights.
Remember that every dataset is unique and context matters. Prioritizing essential features ensures your model's efficacy without being hindered by unnecessary complexity. By adhering to such considerations, you can navigate the training process more effectively, optimizing resources and maximizing insights.
Here is how you can achieve best results when working with different VM sizes and data dimensions.
- Experimentation: It is recommended to experiment with various VM sizes, especially during the model training phase. Assess how accuracy and training times evolve with different resources, then decide what VM size you want to employ. Always balance the VM size with your budget and the urgency for results.
- Data Sampling: Consider data sampling techniques for large datasets to create manageable subsets for initial experimentation. This can help to test different configurations without spending excessive time on training or resources.
- Adding more computational resources will not necessarily reduce training times or increase the accuracy or quality of the synthetic dataset.
- Data sampling and reducing the size might also work against privacy. More data will work better with our privacy protection mechanisms.
- You may calculate the estimated cost of a machine in the cloud using the different cost calculators provided by the main cloud providers:
- VM Size: 64 CPUs, 256 GB RAM, 4 worker nodes
- Number of tables: 2
- Subject table:
- Number of rows: 5,000
- Number of columns: 16
- Linked table:
- Number of rows: 1,037,854
- Number of columns: 15
- Average sequence length: 207.5
- Training time:
- Subject table: 1min
- Linked table: 15 hrs
- Estimated cost using AWS calculator
- EC2 Instance: m5a.16xlarge
- On-Demand Hourly Cost: 2.752 USD
- Total on-demand cost of the job: 41 USD
- VM Size: 12 CPUs, 128 GB RAM, 4 worker nodes
- Number of tables: 2
- Subject table:
- Number of rows: 5,000
- Number of columns: 16
- Linked table:
- Number of rows: 1,037,854
- Number of columns: 15
- Average sequence length: 207.5
- Training time:
- Subject table: 1min
- Linked table: 90 hrs
- Estimated cost using AWS calculator
- EC2 Instance: r6g.4xlarge
- On-Demand Hourly Cost: 0.8064 USD
- Total on-demand cost of the job: 72 USD
Imagine you have a collection of information about different people and want to ensure their personal details stay private. Each person is a subject, which could be a person or any other entity.
For example, let's say you're responsible for handling data about customers for an online store. You want to keep their information private, such as their names, genders, heights, where they live, how much they earn, and so on. Instead of having all this sensitive information scattered around, you create a subject table.
This table is like a special place where you arrange all the details about each customer. A row in this table represents the profile of one customer. It's like having a separate card for each customer, where you jot down their name, gender, height, address, and income. Each piece of information you write on these cards is a field.
So, a subject table is like your secret-keeping tool, ensuring that you have all the important information about your customers or subjects while ensuring their private details are well-protected when turned into a synthetic version. It's an organized way to manage data and respect people's privacy at the same time.
- The number of columns determines the minimum requirement for the size of the subject table. We recommend that subject tables have at least 5,000 subjects (rows of data) to achieve good data quality. However, the platform will also work with fewer subjects and can even synthesize a subject table of only 100 subjects. Remember that the synthetic data quality will likely be poor, depending on the specific dataset. If your use case involves a low number of subjects, we recommend giving it a try and seeing for yourself if the platform can generate synthetic data of sufficient quality.
- Each subject must represent a distinct, real-world individual or entity.
- The subject table must not contain duplicate records representing the same individual or entity. That means a row describes only one subject.
- Each row is independent. The order of rows carries no information, and the contents of a single row do not affect other rows.
- Column names do not contain privacy-sensitive information. This is recommended for all kinds of tables (incl. linked tables)
Think of a linked table as a map that links events to the people they belong to. For example, imagine you are tracking orders in an online store. Each order is an event, and it is connected to a specific customer. This connection is like a thread that ties the event (order) to the customer (subject). Each customer might have more than one order.
- Each record in the linked table must include the unique ID of the subject record it is linked to (foreign key)
- Avoid long sequence lengths (number of linked records related to a single subject record) as this will increase the time required to synthesize the dataset.
When splitting a table into subject and linked tables, careful data pre-processing is essential to maintain data integrity and privacy. Follow these steps to effectively separate your data and create organized subject and linked tables:
- Identify Unique Subjects: Identify the unique individuals or entities in your dataset. Each distinct individual or entity should have its own row. Remove any duplicate records that might describe the same individual or entity multiple times.
- Extract Subject Information: Create a subject table by extracting all the attributes that relate to the individuals or entities. These attributes should include privacy-sensitive information such as names, genders, addresses, and any other relevant details that define each subject.
- Create linked table: Identify events or actions associated with each subject. Create a linked table that connects each event to the respective subject. In the example below, if you're tracking employees' absences, each absence will be linked to a specific employee.
- Privacy-Conscious Column Names: Ensure that column names in both the subject and linked tables do not contain privacy-sensitive information. Column names should be non-identifiable and not compromise privacy.
- Minimum Subject Table Size: Aim for a subject table with at least 5,000 subjects to achieve good data quality. However, the platform can work with smaller subject tables, but the synthetic data quality might vary based on the dataset.
- Validation and Quality Assurance: Before proceeding with model training and data synthesis, thoroughly validate your subject and linked tables. Check for any inconsistencies, duplicate records, or missing information.
Tips and tricks
- When generating high-quality synthetic data using MOSTLY AI, the quality of the subject table is not affected by the linked table. You might consider launching a job using only the subject table first to ensure that the synthetic data is private and of high quality. Once you feel comfortable with the synthetic version of your subject table you could then add your linked table and launch a new job. This will help you familiarize yourself with our product and nail the quality of your synthetic dataset.
- When working with linked tables and handling extensive datasets, adopting a gradual approach – starting with a low volume of data and gradually increasing the number of rows – can significantly enhance the efficiency and accuracy of your model training process. Additionally, it's essential to recognize that utilizing the entire dataset might not be necessary when dealing with large volumes of data. As the dataset grows to millions of records, the marginal information gained from each additional data point diminishes. Using the full dataset might lead to inefficient resource utilization without substantial improvements in model performance.
With the Turbo training setting you can generate a synthetic dataset quickly at the cost of accuracy or quality of your synthetic dataset. When you select Turbo the platform auto-updates the following training settings:
- Maximum training epochs is set to 1
- Training samples is set to 10,000
Tips and Tricks
By doing so, the training time will be reduced substantially making the Turbo option a good way to quickly create a synthetic dataset and test the platform. We recommend creating your first generator using this option to familiarize yourself with the platform and once you feel comfortable you can make your synthetic dataset to be of high quality by optimizing for Accuracy.
In preparing data for the generative AI model in MOSTLY AI, the choice of encoding types for each column is important. Encoding defines how the model processes and generates synthetic data based on your original dataset. While Numeric, Categorical, and Datetime data types are automatically identified by the platform, other encoding types need manual selection. This section provides some insights into different encoding types each catering to specific data characteristics. Whether you're working with textual information, numerical values, categorical attributes, or geolocation data, understanding encoding methods is important for training a high-quality synthetic data generator with MOSTLY AI.
To generate geolocation coordinates, use the Latitude, Longitude encoding type.
MOSTLY AI needs geolocation coordinates to be represented in a single field with comma-separated latitude and longitude values. You can find the details on the required format here.
Assume you have a geolocation dataset with the following structure, where Latitude and Longitude are stored in a separate column:
You would need to perform some pre-processing to bring the dataset to MOSTLY AI’s required structure. Below you can find the Python code to do so:
df["LatLong"] = df["Latitude"].astype(str) + ',' + df["Longitude"].astype(str)
Now the structure of the dataset should look like this:
You may opt to remove the ‘Latitude’ and ‘Longitude’ columns from the dataset or not include them through the UI at the Data Settings tab.
To be able to generate high-quality text data MOSTLY AI requires getting input from as many as possible data points. We recommend having at least 5,000 number of records containing text of up to 1,000 characters long. The better the quality of the text in the original data the higher the quality of the synthetic text MOSTLY AI can generate.
Bear in mind that when configuring a generator having text columns the training time will be significantly increased. MOSTLY AI utilizes a separate generative AI model dedicated to text columns hence the increase in the training times.
Short Text Sequences: For very short text, character sequence encoding, can be advantageous. Character-level encoding is computationally less intense compared to word-level or subword-level encoding methods.
High-precision numeric variables will require more training time to complete.
- Large integers:
- High precision float number:
If possible and not needed for the specific use case in mind, we first recommend reducing the precision of such variables, e.g.
- Large integers in millions:
- Float number with decreased precision:
If this is not possible, our default numerical type will sort this issue for you.
Numeric - Auto will heuristically select the most appropriate encoding type for each numeric variable. This selection is based on a simple heuristic that considers the following conditions:
- If a numeric variable has less than 100 distinct values and more than 99.9% of values are non-rare, it will be encoded using 'Numeric Discrete.'
- If a numeric variable has equal to or fewer than three-digit positions, it will be encoded using 'Numeric - Digit.'
- For all other cases, 'Numeric - Binned' encoding will be applied.
In more detail, the three encoding types for numerical variables are:
- Numeric - Digit: The numerical variables will be recognized as numbers
- Numeric - Discrete: In many cases and depending on the use case, numerical variables might be considered categorical variables. Choose this encoding type, if a numeric variable is actually better represented as categorical. Note: With this encoding type our Rare Category Protection feature will not be in use. Instead, we will create all the discrete numerical values.
- A binary column with 0 and 1 values. This will be a digit but it should be categorical
- Postcode column with values 1234, 4563, 7635, 9836, etc. A postcode doesn’t make sense to be considered a numeric variable. Instead, it should be a category as it reflects a specific area of a city.
- Numeric - Binned: For large integers, MOSTLY AI will bin your numerical column into quantiles and consider these as categories during training. Then from each bin, a number will be sampled to generate the final numerical column in the synthetic dataset.
Consider you're working with a dataset containing a numerical column that consists of large integer values, such as "Number of Products Sold."
- Initial Data: Your original dataset includes the "Number of Products Sold" column, which contains a wide range of integer values, from 1,000 to 1,000,000.
- Binning: During preprocessing, MOSTLY AI employs binning. It divides the range of integer values into quantiles, creating several bins that group similar values together. For instance, the bins might be defined as
- "Low Sales": 1,000 - 10,000
- "Moderate Sales: 10,001 - 100,000 and
- "High Sales": 100,001 - 1,000,000
- Categorization and Sampling: These bins effectively transform the numerical data into categorical information. The model then treats these categories as distinct during training. For synthetic data generation, the model samples a value from each bin. For instance, it might sample 1,500 from "Low Sales", 50,000 from "Moderate Sales", and 900,000 from "High Sales".
- Synthetic Data: The generated synthetic dataset reflects the same pattern of distribution as the original data but ensures privacy and data protection. The "Number of Products Sold" column in the synthetic dataset now comprises sampled values from each bin, maintaining the overall distribution.
High cardinality columns contribute to an increase in training time.
Often low-quality data such as typos in the categories can increase the cardinality of a categorical variable. We recommend solving such issues prior to creating a generator as part of your data preparation step. By doing so, you increase the quality of your synthetic dataset as well.
Imagine that we have a column that holds information about ‘Types of Diabetes’. The correct categories in this case would be:
- Type 1
- Type 2
Instead due to typos in the column’s content, we might see additional categories, such as:
- Type A
- Type B
- Diabetes 1
For the above example instead of having two categories, the cardinality of the column is seven.
Many times a subject table contains personal information such as the name, last name and email address of an individual. In this case, MOSTLY AI offers the option of generating mock data. The benefit of this feature is that generating mock data does not contribute to an increase in training time. The platform is not learning any patterns or correlations with other columns, but instead randomly generating mock data. Hence the training time of a generator decreases substantially when opting to use this feature for specific columns.
Note: These columns will be created independently so no relationships with other columns will be learned
- Name and Last name of individuals
- You can easily do that with our Person data type where you can generate random names from the English language
- With our Email data type you can generate highly realistic but random emails that contain the domain name of a free email service provider
- We often see synthetic data training with a table where a column contains a constant value. In this case, our Constant data type of mock data could easily disregard the column of training and speed up the process of building a generator. Then the corresponding column in the synthetic dataset will be assigned the constant value that you have given
MOSTLY AI supports more mock data types. For more information, see Mock data.
We refer to sequence length as the number of linked records belonging to each data subject. For instance, for the example of customers and transactions, sequence length is the number of transactions that a customer performs within a given timeframe.
The longer the sequence lengths in the linked table, the longer it will take to create a synthetic dataset.
- Before creating a generator it is recommended to perform an exploratory analysis of the sequence length of your subjects. We recommend seeing whether the distribution of the sequence length is skewed due to some subjects having many events. In this case, we recommend treating those subjects as outliers and considering removing them from the dataset.
- Sequences with lengths above 1,000 are not recommended as the training time of our algorithm might increase substantially. Hence the reason why we trim sequence length at 1,000 events for each subject by default. However, this can be manually configured.
Tips and tricks
- Statistical Significance: We suggest taking the 99.5 percentile of the sequence length distribution and setting this as a limit when configuring a generator through MOSTLY AI. By doing so, you're working with a value that represents the vast majority of sequences while excluding extreme outliers. This ensures your synthetic data reflects the most common patterns. That can be found in our Advanced Table Settings under the parameter ‘Max records per subject’
- Balancing Accuracy: While long sequences might provide complex insights, they can also complicate processing. The 99.5 percentile serves as a balanced compromise that retains accuracy without introducing unnecessary complexity.
By limiting the number of records per subject, the computational resources required to process the dataset and train the Generative AI model will be reduced.
Sequence length optimization and preprocessing for linked tables
Optimizing sequence length for sequential data and applying preprocessing techniques are crucial steps in preparing your data for efficient synthesis. Here are some considerations and steps to achieve this:
- Sequence Length Limitation: Chopping sequences longer than 1500 is a reasonable approach, especially if extremely long sequences are outliers that may not provide significant insights. Remember to also set the “maximum sequence length” parameter during training to your chosen value, it is 1000 by default.
- Thinning Sequences: To further reduce sequence length, thinning the data is a smart strategy. For example, if your sequential events are collected every 5 seconds but you only need data every 20 seconds, you can select every fourth data point while retaining temporal correlation. Pandas offers a resampling function which can, for example, take the mean value of all data points in the chosen interval.
- Plausibility Testing: Applying plausibility tests to remove events with implausible values can enhance data quality. Removing data points that indicate unrealistic scenarios or other anomalies can lead to more accurate insights.
- Granularity vs. Temporal Correlation: There is a trade-off between event granularity and capturing temporal correlation. This trade-off is significant and can vary depending on your use case. It's wise to consider both fine-grained events for short time spans and meaningful events across longer periods to strike a balance.
- Subject-Specific Considerations: Keep in mind that the optimal sequence length can vary for different subjects. Some subjects might have more meaningful events within a shorter timeframe, while others might have longer, less frequent events. Tailor the sequence length to each subject as needed.
- Iterative Approach: Consider an iterative approach to find the right balance. Experiment with different preprocessing techniques and sequence lengths to identify what works best for your specific use case.
Impact of Training Parameters on Model Training
When training a synthetic data generator using MOSTLY AI, several key parameters, including maximum training epochs, model size, and batch size, play a crucial role in determining accuracy, training times, and overall performance.
- To ensure an optimal training experience, MOSTLY AI recommends keeping the default settings for advanced parameters. Our product is designed to select the best combination of settings for your dataset.
- In case you would like to change the defaults it is important to experiment with different values for these parameters. Finding the right combination that balances accuracy and training times is often achieved through trial and error.
Below is a breakdown of how these parameters affect different aspects of the training process:
- Maximum Training Epochs
- Effect on Accuracy: Increasing the maximum training epochs can lead to better accuracy initially, as the model has more opportunities to fine-tune its parameters.
- Effect on Training Times: Higher maximum epochs result in longer training times. Models need to process the entire dataset more times, which can be time-consuming. Striking the right balance between accuracy and training times is crucial.
- Model Size
- Effect on Accuracy: A larger model with more parameters can capture more complex patterns in the data, leading to improved accuracy.
- Effect on Training Times: Larger models generally require more training time due to the increased number of parameters that need adjustment. Training a larger model can lead to longer iterations and a slower convergence rate.
- Batch Size:
- Effect on Accuracy: Smaller batch sizes can lead to better generalisation as the model experiences more diverse examples during training. Larger batch sizes might result in faster convergence initially, but they can lead to poorer generalization.
- Effect on Training Times: Larger batch sizes can accelerate training times by allowing more data to be processed in parallel. However, this comes at the cost of memory consumption. Smaller batch sizes can extend training times due to the frequency of parameter updates.
Identifying the source of data quality or privacy issues can be very difficult. Below is a list of common issues.
|Bad univariate fit||High number of N/As|
|High amount of rare category labels||Incorrect encoding type|
|Incorrect sequence length||Too high batch size on linked table|
|High number of business rules violations||Training goal is set to Speed or Turbo instead of Accuracy|
- High amount of
NaNcan make the Identical match share fail.
- Privacy tests can contain false positives because of sampling and stochastic tests.
- In case of good accuracy, repeating a synthesization and testing privacy again makes sense.
Numerical encoding of categorical values
Values, such as ZIP codes, can be indistinguishable from continuous values. This results in the generation of invalid ZIP codes and difficult-to-learn business rules.
The solution is to change the encoding type to Categorical.
Incorrect datetime format
Incorrect formatting of a date column results in it being encoded as a categorical column. Below is an example of an incorrectly formatted, and thus incorrectly encoded,
Bivariate relationships are lost
If the bivariate relation between two columns is lost (left example) or weakened (right example) in the synthetic data, MOSTLY AI recommends that you increase the number of training samples, and ensure that the training goal is set to Accuracy.