Data augmentation
Data rebalancing

Rebalance your data

Use rebalancing to change the distribution of a categorical variable in a dataset. Rebalancing does not just impact the single variable you choose to rebalance. All other variables in the dataset are also impacted based on the correlations they have with that variable. For example, you can use rebalancing to create a large number of relevant business scenarios out of a few that are present in your original data. Or you can use it to simulate what-if scenarios based on your existing, historical data, or upsample minority classes to help downstream machine learning algorithms pick up their patterns.

💡

Rebalancing is only available for subject tables and can only be applied to a single column with the categorical encoding type.

💡

Rebalancing cannot guarantee an increase in downstream model performance. Up or downsampling of a class in the original dataset only helps the downstream model in some instances.

In the steps below we demonstrate the rebalancing feature with an Insurance policy dataset, where we change the distribution of the age group variable.

Steps

  1. Click Create synthetic data. Data rebalancing - Step 1
  2. Download the Insurance policy dataset, upload it, and click Proceed. Data rebalancing - Step 2
  3. Click the Data settings tab, scroll all the way down to find the age_bins column, and click on the cog icon to open the column settings drawer. Data rebalancing - Step 3
  4. Configure the column settings like in the screenshot below and click Save. Make sure that the names of the categories you enter match exactly the names in the original data. If you enter a category that does not exist in the original data, the specified category and its percentage are ignored. You do not need to enter all categories to add up to 100%. MOSTLY AI will keep the share of the categories that were not defined as is. Data rebalancing - Step 4
  5. Click Create a synthetic dataset. Data rebalancing - Step 4
  6. On the Output settings tab, select Download as CSV/Parquet and click Create a synthetic dataset again.

Result

The resulting rebalanced data looks as shown in the charts below. You can see that the share of the age group 25-30 in the synthetic dataset has increased, whereas the share of the age group 30-40 has decreased. To review the charts and the distributions of each variable, go to the QA Report and check the Univariate distributions of the Data QA report.

You will notice there that the distributions of some other variables of the dataset, for example incident_city or insured_education_level also changed. The changes reflect what this simulated group of on average younger insurance policy holders would look like.

Data rebalancing - Step 1