With the Generation mood setting in MOSTLY AI, you can configure how strictly the generated synthetic data samples follow the distribution of the original data.
You define the generation mood on a column of data. You can do so for a single, multiple, or all columns in a table. For each column, you can select a different generation mood.
Generation mood provides a granular selection of options that you can use to fine-tune how conservatively or creatively MOSTLY AI generates synthetic data samples.
By default, MOSTLY AI applies a Representative generation mood to all columns. With the Representative option, the generated data samples mimic the distribution of samples in the original data and they accurately represent the common as well as any of the less common data samples from the original dataset.
A conservative generation mood results in the increased generation of data samples that are similar or close to the more common ones in the original data. With a conservative generation mood, you can prioritize improved rule adherence over the accuracy of the generated data. In particular, a mild conservative mood can be an effective setting to eliminate impossible combinations without significantly distorting the representativeness of the data.
A creative generation mood boosts the generation of samples that are less common or even uncommon in the original data. In statistical terms, the creative generation mood boosts the generation of outliers at the expense of common data samples. You can use a creative generation mood to intentionally dilute business-sensitive information that is part of your original data. You can then use the generated synthetic data to stress-test systems with novel, unusual, but at the same time plausible data samples.
You can think of a common data sample as a subject (or a row of data) that has a combination of values which appear more frequently among the rest of the subjects.
For a concrete example, you can examine the UCI Adult dataset.
If you consider only one column, for example, the age of subjects, the UCI Adult dataset contains subjects between the age of 17 and 86. Examining further, you can establish that most of the subjects are aged between 20 and 60. With this in mind, you can think of a common data sample in the UCI Adult dataset as a subject that has age between 20 and 60.
If you consider more columns (such as, gender, marital status, education, working hours), a common data sample can be one of the following.
- 52 year old, Male, White, Married, Bachelor's degree, working 40h week
- 40 year old, Female, White, Married, Bachelor's degree, working 40h week
In terms of the UCI Adult dataset, uncommon data samples are subjects that have a combination of values that appear less frequently.
- 85 year old, Female, White, Married, Master's degree, working 40h week
- 17 year old, Male, White, Married, Grade 11, working 40h week
Here is a diagram that gives examples of the correlation between generation mood and data samples for the UCI Adult dataset.
You can set the generation from the Data settings page.
- On the Home page, click Start for the UCI Adult dataset.
- Select the Data settings tab.
- Set the generation mood on all columns. Start by clicking Edit multiple columns.
- Select all columns by selecting the check box in the upper left of the table.
- From the Generation mood column name, select one of the options. For example, Extremely boring.
- Click OK in the confirmation dialog box.
- Click Return to column list in the upper right.
- Click Create a synthetic dataset.
When the generation job finishes, the synthetic data contains the distribution of data samples based on the generation mood you selected.
If you want to check the influence of the generation mood on the distribution of data in the synthetic dataset, you can do so from the Data QA report.
After you configure generation mood for the synthetic data, you can check its impact on the distribution of data samples. For example, you can examine the Age chart under Univariate distribution in the Data QA report.
- From the Synthetic datasets tab, open the completed synthetic dataset.
- Select QA report from the sidebar on the right.
- Scroll down and select Data QA report.
- Select Univariate distributions.
- Examine the Age chart.
When you examine the Age distribution chart, you can notice the impact of the Extremely crazy generation mood setting on the Age column of the UCI Adult dataset.
The green line shows the distribution of the synthetic data, while the black line shows the distribution of the original data.
With an Extremely boring generation mood, the number of data samples that are considered inliers are boosted. This means that MOSTLY AI generated a higher number of subjects with ages in the range of 20 - 60 than the ones that appear in the original data.