Synthetic data holds the promise of addressing the underrepresentation of minority classes in tabular data sets by adding new, diverse, and highly realistic synthetic samples. In this post, we'll benchmark AI-generated synthetic data for upsampling highly unbalanced tabular data sets. Specifically, we compare the performance of predictive models trained on data sets upsampled with synthetic records to that of well-known upsampling methods, such as naive oversampling or SMOTE-NC.
Our experiments are conducted on multiple data sets and different predictive models. We demonstrate that synthetic data can improve predictive accuracy for minority groups as it creates diverse data points that fill gaps in sparse regions in feature space.
Our results highlight the potential of synthetic data upsampling as a viable method for improving predictive accuracy on highly unbalanced data sets. We show that upsampled synthetic training data consistently results in top-performing predictive models, in particular for mixed-type data sets containing a very low number of minority samples, where it outperforms all other upsampling techniques.
AI-generated synthetic data, which we refer to as synthetic data throughout, is created by training a generative model on the original data set. In the inference phase, the generative model creates statistically representative, synthetic records from scratch.
The use of synthetic data has gained increasing importance in various industries, particularly due to its primary use case of enhancing data privacy. Beyond privacy, synthetic data offers the possibility to modify and tailor data sets to our specific needs. In this blog post, we investigate the potential of synthetic data to improve the performance of machine learning algorithms on data sets with unbalanced class distributions, specifically through the synthetic upsampling of minority classes.
Class imbalance is a common problem in many real-world tabular data sets where the number of samples in one or more classes is significantly lower than the others. Such imbalances can lead to poor prediction performance for the minority classes, often of greatest interest in many applications, such as detecting fraud or extreme insurance claims.
Traditional upsampling methods, such as naive oversampling or SMOTE, have shown some success in mitigating this issue. However, the effectiveness of these methods is often limited, and they may introduce biases in the data, leading to poor model performance. In recent years, synthetic data has emerged as a promising alternative to traditional upsampling methods. By creating highly realistic samples for minority classes, synthetic data can significantly improve the accuracy of predictive models.
While upsampling methods like naive oversampling and SMOTE are effective in addressing unbalanced data sets, they also have their limitations. Naive oversampling mitigates class imbalance effects by simply duplicating minority class examples. Due to this strategy, they bear the risk of overfitting the model to the training data, resulting in poor generalization in the inference phase.
SMOTE, on the other hand, generates new records by interpolating between existing minority-class samples, leading to higher diversity. However, SMOTE’s ability to increase diversity is limited when the absolute number of minority records is very low. This is especially true when generating samples for mixed-type data sets containing categorical columns. For mixed-type data sets, SMOTE-NC is commonly used as an extension for handling categorical columns.
SMOTE-NC may not work well with non-linear decision boundaries, as it only linearly interpolates between minority records. This can lead to SMOTE-NC examples being generated in an “unfavorable” region of feature space, far from where additional samples would help the predictive model place a decision boundary.
All these limitations highlight the need for exploring alternative upsampling methods, such as synthetic data upsampling, that can overcome these challenges and improve the accuracy of minority group predictions.
The strength of upsampling minority classes with AI-generated synthetic data is that the generative model is not limited to upsampling or interpolating between existing minority classes. Most AI-based generators can create realistic synthetic data examples in any region of feature space and, thus, considerably increase diversity. Because they are not tied to existing minority samples, AI-based generators can also leverage and learn from the properties of (parts of) the majority samples that are transferable to minority examples.
An additional strength of using AI-based upsampling is that it can be easily extended to more complex data structures, such as sequential data, where not only one but many rows in a data set belong to a single data subject. This aspect of synthetic data upsampling is, however, out of the scope of this study.
In this post, we present a comprehensive benchmark study comparing the performance of predictive models trained on unbalanced data upsampled with AI-generated synthetic data, naive upsampling, and SMOTE-NC upsampling. Our experiments are carried out on various data sets and using different predictive models.
For every data set we use in our experiments, we run through the following steps (see Fig. 1):
We run four publicly available data sets (Figure 1) of varying sizes through steps 1–5: Adult, Credit Card, Insurance, and Census (Kohavi and Becker). All data sets tested are of mixed type (categorical and numerical features) with a binary, that is, a categorical target column.
In step 2 (Fig. 1), we downsample minority classes to induce strong imbalances. For the smaller data sets with ~30k records, downsampling to minority-class fractions of 0.1% results in extremely low numbers of minority records.
The downsampled Adult and Credit Card unbalanced training data sets contain as little as 19 and 18 minority records, respectively. This scenario mimics situations where data is limited and extreme cases occur rarely. Such setups create significant challenges for predictive models, as they may encounter difficulty making accurate predictions and generalizing well on unseen data.
Please note that the holdout sets on which the trained predictive models are scored are not subject to extreme imbalances as they are sampled from the original data before downsampling is applied. The imbalance ratios of the holdout set are moderate and vary from 6 to 24%.
In the evaluation, we report both the AUC-ROC and the AUC-PR due to the moderate but inhomogeneous distribution of minority fractions in the holdout set. The AUC-ROC is a very popular and expressive metric, but it is known to be overly optimistic on unbalanced optimization problems. While the AUC-ROC considers both classes, making it susceptible to neglecting the minority class, the AUC-PR focuses on the minority class as it is built up by precision and recall.
The largest differences between upsampling techniques are observed in the AUC-ROC when balancing training sets with a substantial class imbalance of 0.05% to 0.5%. This scenario involves a very limited number of minority samples, down to 19 for the Adult unbalanced training data set.
For the RF and the LGBM classifiers trained on the balanced hybrid data set, the AUC-ROC is larger than the ones obtained with other upsampling techniques. Differences can go up to 0.2 (RF classifier, minority fraction of 0.05%) between the AI-based synthetic upsampling and the second-best method.
The AUC-PR shows similar yet less pronounced differences. LGBM and XGB classifiers trained on the balanced hybrid data set perform best throughout almost all minority fractions. Interestingly, results for the RF classifier are mixed. Upsampling with synthetic data does not always lead to better performance, but it is always among the best-performing methods.
While synthetic data upsampling improves results through most of the minority fractions for the XGB classifier, too, the differences in performance are less pronounced. Especially the XGB classifier trained on the highly unbalanced training data performs surprisingly well. This suggests that the XGB classifier is better suited for handling unbalanced data.
The reason for the performance differences in the AUC-ROC and AUC-PR is due to the low diversity and, consequently, overfitting when using naive or SMOTE-NC upsampling. These effects are visible in, e.g., the ROC and PR curves of the LGBM classifier for a minority fraction of 0.1% (fig. 3).
Every point on these curves corresponds to a specific prediction threshold for the classifier. The set of threshold values is defined by the variance of probabilities predicted by the models when scored on the holdout set. For both the highly unbalanced training data and the naively upsampled one, we observe very low diversity, with more than 80% of the holdout samples predicted to have an identical, very low probability of belonging to the minority class.
In the plot of the PR curve, this leads to an accumulation of points in the area with high precision and low recall, which means that the model is very conservative in making positive predictions and only makes a positive prediction when it is very confident that the data point belongs to the positive, that is, the minority class. This demonstrates the effect of overfitting on a few samples in the minority group.
SMOTE-NC has a much higher but still limited diversity, resulting in a smoother PR curve which, however, still contains discontinuities and has a large segment where precision and recall change rapidly with small changes in the prediction threshold.
The hybrid data set offers high diversity during model training, resulting in almost every holdout sample being assigned an unique probability of belonging to the minority class. Both ROC and PR curves are smooth and have a threshold of ~0.5 at the center, the point that is closest to the perfect classifier.
The limited power in creating diverse samples in situations where the minority class is severely underrepresented stems from naive upsampling and SMOTE-NC being limited to duplicating and interpolating between existing minority samples. Both methods are bound to a limited region in feature space.
Upsampling with AI-based synthetic minority samples, on the other hand, can, in principle, populate any region in feature space and can leverage and learn from properties of the majority samples which are transferable to minority examples, resulting in more diverse and realistic synthetic minority samples.
We analyze the difference in diversity by further “drilling down” the minority class (feature “income” equals “high”) and comparing the distribution of the feature “education” for the female subgroup (feature “sex” equals “female”) in the upsampled data sets (fig. 4).
For a minority fraction of 0.1%, this results in only three female minority records. Naive upsampling and SMOTE-NC have a very hard time generating diversity in such settings. Both just duplicate the existing categories “Bachelors”, “HS-grade”, and "Assoc-acdm,” resulting in a strong distortion of the distribution of the “education” feature as compared to the distribution in the holdout data set.
The distribution of the hybrid data has some imperfections, too, but it recovers the holdout distribution to a much better degree. Many more “education” categories are populated, and, with a few exceptions, the frequencies of the holdout data set are recovered to a satisfactory level. This ultimately leads to a larger diversity in the hybrid data set than in the naively balanced or SMOTE-NC balanced one.
We quantitatively assess diversity with the Shannon entropy, which measures the variability within a data set particularly for categorical data. It provides a measure of how uniformly the different categories of a specific feature are distributed within the data set.
The Shannon Entropy (SE) of a specific feature is defined as
where p(i) represents the probability of occurrence, i.e. the relative frequency of category i. SE ranges from 0 to log2(N), where N is the total number of categories. A value of 0 indicates maximum certainty with only one category, while higher entropy implies greater diversity and uncertainty, indicating comparable probabilities p(i) across categories.
In Figure 5, we report the Shannon entropy for different features and subgroups of the high-income population. In all cases, data diversity is the largest for the holdout data set. The downsampled training data set (unbalanced) has a strongly reduced SE, especially when focusing on the small group of high-income women. Naive and SMOTE-NC upsampling cannot recover any of the diversity in the holdout as both are limited to the categories present in the minority class. In line with the results presented in the paragraph above, synthetic data recovers the SE, i.e., the diversity of the holdout data set, to a large degree.
The Credit Card data set has similar properties as the Adult data set. The number of records, features, and the original, moderate imbalance are comparable. This again results in a very small number of minority records (18) after downsampling to a 0.1% minority fraction.
The main difference between them is the fact that Credit Card consists of more numeric features. The performance of different upsampling techniques on the unbalanced Credit Card training data set shows similar results to the Adult Data set, too. AUC-ROC and AUC-PR for both LGBM and RF classifiers improve over naive upsampling and SMOTE-NC when using the hybrid data set.
Again, the performance of the XGB model is more comparable between the different balanced data sets and we find very good performance for the highly-unbalanced training data set. Here, too, the hybrid data set is always among the best-performing upsampling techniques.
Interestingly, SMOTE-NC performs worst almost throughout all the metrics. This is surprising because we expect this data set, consisting mainly of numerical features, to be favorable for the SMOTE-NC upsampling technique.
The Insurance data set is larger than Adult and Census resulting in a larger number of minority records (268) when downsampling to the 0.1% minority fraction. This leads to a much more balanced performance between different upsampling techniques.
A notable difference in performance only appears for very small minority fractions. For minority fractions below 0.5%, both the AUC-ROC and AUC-PR of LGBM and XGB classifiers trained on the hybrid data set are consistently larger than for classifiers trained on other balanced data sets. The maximum performance gains, however, are smaller than those observed for “Adult” and “Credit Card”.
The Census data set has the largest number of features of all the data sets tested in this study. Especially, the 28 categorical features pose a challenge for SMOTE-NC, leading to poor performance in terms of AUC-PR.
Comparably to the Insurance data set, the performance of the LGBM classifier severely deteriorates when trained on highly unbalanced data sets. On the other hand, the XGB model excels and performs very well even on unbalanced training sets.
The Census data set highlights the importance of carefully selecting the appropriate model and upsampling technique when working with data sets that have high dimensionality and unbalanced class distributions, as performances can vary a lot.
Upsampling with synthetic data mitigates this variance, as all models trained on the hybrid data set are among the best performers across all classifiers and ranges of minority fractions.
AI-based synthetic data generation can provide an effective solution to the problem of highly unbalanced data sets in machine learning. By creating diverse and realistic samples, upsampling with synthetic data generation can improve the performance of predictive models. This is especially true for cases where not only the minority fraction is low but also the absolute number of minority records is at a bare minimum. In such extreme settings, training on data upsampled with AI-generated synthetic records leads to better performance of prediction models than upsampling with SMOTE-NC or naive upsampling. Across all parameter settings explored in this study, synthetic upsampling resulted in predictive models which rank among the top-performing ones.
TABLE OF CONTENT
Data anonymization tools can be your best friends or your data quality’s worst enemies. Sometimes both. Anonymizing data is never easy, and it gets trickier when:
You try to do your best and use data anonymization tools on a daily basis. You have removed all sensitive information, masked the rest, and randomized for good measure. So, your data is safe now. Right?
As the Austrians—Arnold Schwarzenegger included—say: Schmäh! Which roughly translates as bullshit. Why do so many data anonymization efforts end up being Schmäh?
Data anonymization tools conveniently automate the process of data anonymization with the goal of making sure that no individual included in the data can be re-identified. The most ancient of data anonymization tools, namely aggregation and the now obsolete rounding, were born in the 1950s. The concept of adding noise to data as a way to protect anonymity entered the picture in the 1970s. We have come a long way since then. Privacy-enhancing technologies were born in the 90s and have been evolving since, offering better, safer, and more data-friendly data anonymization tools.
Data anonymization tools must constantly evolve since attacks are also getting more and more sophisticated. Today, new types of privacy attacks using the power of AI, can reidentify individuals in datasets that are thought of as anonymous. Data privacy is a constantly shifting field with lots of moving targets and constant pressure to innovate.
Although a myriad of data anonymization tools exist, we can differentiate between two groups of data anonymization tools based on how they approach privacy in principle. Legacy data anonymization tools work by removing or disguising personally identifiable information, or so-called PII. Traditionally, this means unique identifiers, such as social security numbers, credit card numbers, and other kinds of ID numbers.
The trouble with these types of data anonymization tools is that no matter how much of the data is removed or modified, a 1:1 relationship between the data subject and the data points remains. With the advances of AI-based reidentification attacks, it’s getting increasingly easier to find this 1:1 relationship, even in the absence of obvious PII pointers. Our behavior—essentially a series of events—is almost like a fingerprint. An attacker doesn’t need to know my name or social security number if there are other behavior-based identifiers that are unique to me, such as my purchase history or location history. As a result, state of the art data anonymization tools are needed to anonymize behavioral data.
Legacy data anonymization tools are often associated with manual work, whereas modern data privacy solutions incorporate machine learning and AI to achieve more dynamic and effective results. But let's have a look at the most common forms of traditional anonymization first.
Data masking is one of the most frequently used data anonymization approaches across industries. It works by replacing parts of the original data with asterisks or another placeholder. Data masking can reduce the value or utility of the data, especially if it's too aggressive. The data might not retain the same distribution or characteristics as the original, making it less useful for analysis.
The process of data masking can be complex, especially in environments with large and diverse datasets. The masking should be consistent across all records to ensure that the data remains meaningful. The masked data should adhere to the same validation rules, constraints, and formats as the original dataset. Over time, as systems evolve and new data is added or structures change, ensuring consistent and accurate data masking can become challenging.
The biggest challenge with data masking: to decide what to actually mask. Simply masking PII from data using Python, for example, still has its place, but the resulting data should not be considered anonymized by any stretch of the imagination. The problem are quasi identifiers (= the combination of attributes of data) that if left unprocessed still allow re-identification in a masked dataset quite easily.
Pseudonymization is strictly speaking not an anonymization approach as pseudomized data is not anonymous data. However, it's very common and so we will explain it here. Pseudonymization replaces private identifiers with fake identifiers, or pseudonyms or removes private identifiers alltogether. While the data can still be matched with its source when one has the right key, it can't be matched without it. The 1:1 relationship remains and can be recovered not only by accessing the key but also by linking different datasets. The risk of reversibility is always high, and as a result, pseudonymization should only be used when it’s absolutely necessary to reidentify data subjects at a certain point in time.
The pseudonyms typically need a key for the transformation process. Managing, storing, and protecting this key is critical. If it's compromised, the pseudonymization can be reversed.
What’s more, under GDPR, pseudonymized data is still considered personal data, meaning that data protection obligations continue to apply.
Overall, while pseudonymization might be a common practice today, it should only be used as a stand-alone tool when absolutely necessary. Pseudonymization is not anonymization and pseudonymized data should never be considered anonymized.
This method reduces the granularity of the data. For instance, instead of displaying an exact age of 27, the data might be generalized to an age range, like 20-30. Generalization causes a significant loss of data utility by decreasing data granularity. Over-generalizing can render data almost useless, while under-generalizing might not provide sufficient privacy.
You also have to consider the risk of residual disclosure. Generalized data sets might contain enough information to infer about individuals, especially when combined with other data sources.
Data swapping or perturbation describes the approach of replacing original data values with values from other records. The privacy-utility trade-off strikes again: perturbing data leads to a loss of information, which can affect the accuracy and reliability of analyses performed on the perturbed data. However at the same time the achieved privacy protection is not very high. Protecting against re-identification while maintaining data utility is challenging. Finding the appropriate perturbation methods that suit the specific data and use case is not always straightforward.
Randomization is a legacy data anonymization approach that changes the data to make it less connected to a person. This is done through adding random noise to the data.
Some data types, such as geospatial or temporal data, can be challenging to randomize effectively while maintaining data utility. Preserving spatial or temporal relationships in the data can be complex.
Selecting the right approach (i.e. what variables to add noise to and how much) to do the job is also challenging since each data type and use case could call for a different approach. Choosing the wrong approach can have serious consequences downstream, resulting in inadequate privacy protection or excessive data distortion.
Data consumers could be unaware of the effect randomization had on the data and might end up with false conclusions. On the bright side, randomization techniques are relatively straightforward to implement, making them accessible to a wide range of organizations and data professionals.
Data redaction is similar to data masking, but in the case of this data anonymization approach, entire data values or sections are removed or obscured. Deleting PII is easy to do. However, it’s a sure-fire way to encounter a privacy disaster down the line. It’s also devastating for data utility since critical elements or crucial contextual information could be removed from the data.
Redacted data may introduce inconsistencies or gaps in the dataset, potentially affecting data integrity. Redacting sensitive information can result in a smaller dataset. This could impact statistical analyses and models that rely on a certain volume of data for accuracy.
The next-generation data anonymization tools, or so-called privacy-enhancing technologies take an entirely different, more use-case-centered approach to data anonymization and privacy protection.
The first group of modern data anonymization tools works by encrypting data in a way that allows for computational operations on encrypted data. The downside of this approach is that the data, well, stays encrypted which makes it very hard to work with such data if it was previously unknown the user. You can't perform e.g. exploratory analyses on encrypted data. In addition it is computationally very intensive and, as such, not widely available and cumbersome to use. As the price of computing power decreases and capacity increases, this technology is set to become more popular and easier to access.
Federated learning is a fairly complicated approach, enabling machine learning models to be trained on distributed datasets. Federated learning is commonly used in applications that involve mobile devices, such as smartphones and IoT devices.
For example, predictive text suggestions on smartphones can be improved without sending individual typing data to a central server. In the energy sector, federated learning helps optimize energy consumption and distribution without revealing specific consumption patterns of individual users or entities. However, these federated systems require the participation of all players, which is near-impossible to achieve if the different parts of the system belong to different operators. Simply put, Google can pull it off, while your average corporation would find it difficult.
A more readily available approach is an AI-powered data anonymization tool: synthetic data generation. Synthetic data generation extracts the distributions, statistical properties, and correlations of datasets and generates entirely new, synthetic versions of said datasets, where all individual data points are synthetic. The synthetic data points look realistic and, on a group level, behave like the original. As a data anonymization tool, reliable synthetic data generators produce synthetic data that is representative, scalable, and suitable for advanced use cases, such as AI and machine learning development, analytics, and research collaborations.
Secure Multiparty Computation (SMPC), in simple terms, is a cryptographic technique that allows multiple parties to jointly compute a function over their private inputs while keeping those inputs confidential. It enables these parties to collaborate and obtain results without revealing sensitive information to each other.
While it's a powerful tool for privacy-preserving computations, it comes with its set of implementation challenges, particularly in terms of complexity, efficiency, and security considerations. It requires expertise and careful planning to ensure that it is applied effectively and securely in practical applications.
Data anonymization encompass a diverse set of approaches, each with its own strengths and limitations. In this comprehensive guide, we explore ten key data anonymization strategies, ranging from legacy methods like data masking and pseudonymization to cutting-edge approaches such as federated learning and synthetic data generation. Whether you're a data scientist or privacy officer, you will find this bullshit-free table listing their advantages, disadvantages, and common use cases very helpful.
# | Data Anonymization Approach | Description | Advantages | Disadvantages | Use Cases |
---|---|---|---|---|---|
1 | Data Masking | Masks or disguises sensitive data by replacing characters with symbols or placeholders. | - Simplicity of implementation. - Preservation of data structure. | - Limited protection against inference attacks. - Potential negative impact on data analysis. | - Anonymizing email addresses in communication logs. - Concealing rare names in datasets. - Masking sensitive words in text documents. |
2 | Pseudonymization | Replaces sensitive data with pseudonyms or aliases or removes it alltogether. | - Preservation of data structure. - Data utility is generally preserved. - Fine-grained control over pseudonymization rules. | - Pseudomized data is not anonymous data. - Risk of re-identification is very high. - Requires secure management of pseudonym mappings. | - Protecting patient identities in medical research. - Securing employee IDs in HR records. |
3 | Generalization/Aggregation | Aggregates or generalizes data to reduce granularity. | - Simple implementation. | - Loss of fine-grained detail in the data. - Risk of data distortion that affects analysis outcomes. - Challenging to determine appropriate levels of generalization. | - Anonymizing age groups in demographic data. - Concealing income brackets in economic research. |
4 | Data Swapping/Perturbation | Swaps or perturbs data values between records to break the link between individuals and their data. | - Flexibility in choosing perturbation methods. - Potential for fine-grained control. | - Privacy-utility trade-off is challenging to balance. - Risk of introducing bias in analyses. - Selection of appropriate perturbation methods is crucial. | - E-commerce. - Online user behavior analysis. |
5 | Randomization | Introduces randomness (noise) into the data to protect data subjects. | - Flexibility in applying to various data types. - Reproducibility of results when using defined algorithms and seeds. | - Privacy-utility trade-off is challenging to balance. - Risk of introducing bias in analyses. - Selection of appropriate randomization methods is hard. | - Anonymizing survey responses in social science research. - Online user behavior analysis. |
6 | Data Redaction | Removes or obscures specific parts of the dataset containing sensitive information. | - Simplicity of implementation. | - Loss of data utility, potentially significant. - Risk of removing contextual information. - Data integrity challenges. | - Concealing personal information in legal documents. - Removing private data in text documents. |
7 | Homomorphic Encryption | Encrypts data in such a way that computations can be performed on the encrypted data without decrypting it, preserving privacy. | - Strong privacy protection for computations on encrypted data. - Supports secure data processing in untrusted environments. - Cryptographically provable privacy guarantees. | - Encrypted data cannot be easily worked with if previously unknown to the user. - Complexity of encryption and decryption operations. - Performance overhead for cryptographic operations. - May require specialized libraries and expertise. | - Basic data analytics in cloud computing environments. - Privacy-preserving machine learning on sensitive data. |
8 | Federated Learning | Trains machine learning models across decentralized edge devices or servers holding local data samples, avoiding centralized data sharing. | - Preserves data locality and privacy, reducing data transfer. - Supports collaborative model training on distributed data. - Suitable for privacy-sensitive applications. | - Complexity of coordination among edge devices or servers. - Potential communication overhead. - Ensuring model convergence can be challenging. - Shared models can still leak privacy. | - Healthcare institutions collaboratively training disease prediction models. - Federated learning for mobile applications preserving user data privacy. - Privacy-preserving AI in smart cities. |
9 | Synthetic Data Generation | Creates artificial data that mimics the statistical properties of the original data while protecting privacy. | - Strong privacy protection with high data utility. - Preserves data structure and relationships. - Scalable for generating large datasets. | - Accuracy and representativeness of synthetic data may vary depending on the generator. - May require specialized algorithms and expertise. | - Sharing synthetic healthcare data for research purposes. - Synthetic data for machine learning model training. - Privacy-preserving data sharing in financial analysis. |
10 | Secure Multiparty Computation (SMPC) | Enables multiple parties to jointly compute functions on their private inputs without revealing those inputs to each other, preserving privacy. | - Strong privacy protection for collaborative computations. - Suitable for multi-party data analysis while maintaining privacy. - Offers security against collusion. | - Complexity of protocol design and setup. - Performance overhead, especially for large-scale computations. - Requires trust in the security of the computation protocol. | - Privacy-preserving data aggregation across organizations. - Collaborative analytics involving sensitive data from multiple sources. - Secure voting systems. |
When it comes to choosing the right data anonymization approach, we are faced with a complex problem requiring a nuanced view and careful consideration. When we put all the Schmäh aside, choosing the right data anonymization strategy comes down to balancing the so-called privacy-utility trade-off.
The privacy-utility trade-off refers to the balancing act of data anonymization’ two key objectives: providing privacy to data subjects and utility to data consumers. Depending on the specific use case, the quality of implementation, and the level of privacy required, different data anonymization approaches are more or less suitable to achieve the ideal balance of privacy and utility. However, some data anonymization approaches are inherently better than others when it comes to the privacy-utility trade-off. High utility with robust, unbreakable privacy is the unicorn all privacy officers are hunting for, and since the field is constantly evolving with new types of privacy attacks, data anonymization must evolve too.
As it stands today, the best data anonymization approaches for preserving a high level of utility while effectively protecting privacy are the following:
Synthetic data generation techniques create artificial datasets that mimic the statistical properties of the original data. These datasets can be shared without privacy concerns. When properly designed, synthetic data can preserve data utility for a wide range of statistical analyses while providing strong privacy protection. It is particularly useful for sharing data for research and analysis without exposing sensitive information.
Privacy: high
Utility: high for analytical, data sharing, and ML/AI training use cases
Homomorphic encryption allows computations to be performed on encrypted data without the need to decrypt it. This technology is valuable for secure data processing in untrusted environments, such as cloud computing. While it can be computationally intensive, it offers a high level of privacy and maintains data utility for specific tasks, particularly when privacy-preserving machine learning or data analytics is involved. Depending on the specific encryption scheme and parameters chosen, there may be a trade-off between the level of security and the efficiency of computations. Also, increasing security often leads to slower performance.
Privacy: high
Utility: can be high, depending on the use case
SMPC allows multiple parties to jointly compute a function over their private inputs without revealing those inputs to each other. It offers strong privacy guarantees and can be used for various collaborative data analysis tasks while preserving data utility. SMPC has applications in areas like secure data aggregation and privacy-preserving collaborative analytics.
Privacy: High
Utility: can be high, depending on the use case
In the ever-evolving landscape of data anonymization strategies, the journey to strike a balance between preserving privacy and maintaining data utility is an ongoing challenge. As data grows more extensive and complex and adversaries devise new tactics, the stakes of protecting sensitive information have never been higher.
Legacy data anonymization approaches have their limitations and are increasingly likely to fail in protecting privacy. While they may offer simplicity in implementation, they often fall short in preserving the intricate relationships and structures within data.
Modern data anonymization tools, however, present a promising shift towards more robust privacy protection. Privacy-enhancing technologies have emerged as powerful solutions. These tools harness encryption, machine learning, and advanced statistical techniques to safeguard data while enabling meaningful analysis.
Furthermore, the rise of synthetic data generation signifies a transformative approach to data anonymization. By creating artificial data that mirrors the statistical properties of the original while safeguarding privacy, synthetic data generation provides an innovative solution for diverse use cases, from healthcare research to machine learning model training.
As the data privacy landscape continues to evolve, organizations must stay ahead of the curve. What is clear is that the pursuit of privacy-preserving data practices is not only a necessity but also a vital component of responsible data management in our increasingly vulnerable world.
In this tutorial, you will learn how to use synthetic rebalancing to improve the performance of machine-learning (ML) models on imbalanced classification problems. Rebalancing can be useful when you want to learn more of an otherwise small or underrepresented population segment by generating more examples of it. Specifically, we will look at classification ML applications in which the minority class accounts for less than 0.1% of the data.
We will start with a heavily imbalanced dataset. We will use synthetic rebalancing to create more high-quality, statistically representative instances of the minority class. We will compare this method against 2 other types of rebalancing to explore their advantages and pitfalls. We will then train a downstream machine learning model on each of the rebalanced datasets and evaluate their relative predictive performance. The Python code for this tutorial is publicly available and runnable in this Google Colab notebook.
Fig 1 - Synthetic rebalancing creates more statistically representative instances of the minority class
In heavily imbalanced classification projects, a machine learning model has very little data to effectively learn patterns about the minority class. This will affect its ability to correctly class instances of this minority class in the real (non-training) data when the model is put into production. A common real-world example is credit card fraud detection: the overwhelming majority of credit card transactions are perfectly legitimate, but it is precisely the rare occurrences of illegitimate use that we would be interested in capturing.
Let’s say we have a training dataset with 100,000 credit card transactions which contains 999,900 legitimate transactions and 100 fraudulent ones. A machine-learning model trained on this dataset would have ample opportunity to learn about all the different kinds of legitimate transactions, but only a small sample of 100 records in which to learn everything it can about fraudulent behavior. Once this model is put into production, the probability is high that fraudulent transactions will occur that do not follow any of the patterns seen in the small training sample of 100 fraudulent records. The machine learning model is unlikely to classify these fraudulent transactions.
So how can we address this problem? We need to give our machine learning model more examples of fraudulent transactions in order to ensure optimal predictive performance in production. This can be achieved through rebalancing.
We will explore three types of rebalancing:
The tutorial will give you hands-on experience with each type of rebalancing and provide you with in-depth understanding of the differences between them so you can choose the right method for your use case. We’ll start by generating an imbalanced dataset and showing you how to perform synthetic rebalancing using MOSTLY AI's synthetic data generator. We will then compare performance metrics of each rebalancing method on a downstream ML task.
But first things first: we need some data.
For this tutorial, we will be using the UCI Adult Income dataset, as well as the same training and validation split, that was used in the Train-Synthetic-Test-Real tutorial. However, for this tutorial we will work with an artificially imbalanced version of the dataset containing only 0.1% of high-income (>50K) records in the training data, by downsampling the minority class. The downsampling has already been done for you, but if you want to reproduce it yourself you can use the code block below:
def create_imbalance(df, target, ratio):
val_min, val_maj = df[target].value_counts().sort_values().index
df_maj = df.loc[df[target]==val_maj]
n_min = int(df_maj.shape[0]/(1-ratio)*ratio)
df_min = df.loc[df[target]==val_min].sample(n=n_min, random_state=1)
df_maj = df.loc[df[target]==val_maj]
df_imb = pd.concat([df_min, df_maj]).sample(frac=1, random_state=1)
return df_imb
df_trn = pd.read_csv(f'{repo}/census-training.csv')
df_trn_imb = create_imbalance(df_trn, 'income', 1/1000)
df_trn_imb.to_csv('census-training-imbalanced.csv', index=False)
Let’s take a quick look at this imbalanced dataset by randomly sampling 10 rows. For legibility let’s select only a few columns, including the income column as our imbalanced feature of interest:
trn = pd.read_csv(f'{repo}/census-training-imbalanced.csv')
trn.sample(n=10)
You can try executing the line above multiple times to see different samples. Still, due to the strong class imbalance, the chance of finding a record with high income in a random sample of 10 is minimal. This would be problematic if you were interested in creating a machine learning model that could accurately classify high-income records (which is precisely what we’ll be doing in just a few minutes).
The problem becomes even more clear when we try to sample a specific sub-group in the population. Let’s sample all the female doctorates with a high income in the dataset. Remember, the dataset contains almost 30 thousand records.
trn[
(trn['income']=='>50K')
& (trn.sex=='Female')
& (trn.education=='Doctorate')
]
It turns out there are actually no records of this type in the training data. Of course, we know that these kinds of individuals exist in the real world and so our machine learning model is likely to encounter them when put in production. But having had no instances of this record type in the training data, it is likely that the ML model will fail to classify this kind of record correctly. We need to provide the ML model with a higher quantity and more varied range of training samples of the minority class to remedy this problem.
MOSTLY AI offers a synthetic rebalancing feature that can be used with any categorical column. Let’s walk through how this works:
census-training-imbalanced.csv
and click “Proceed”.Fig 2 - Upload the original dataset to MOSTLY AI’s synthetic data generator.
Fig 3 - Navigate to the Data Settings of the Income column.
Fig 4 - Set the relevant settings to rebalance the income column.
Fig 5 - Launch the synthetic data generation
Once the synthesization is complete, you can download the synthetic dataset to disk. Then return to wherever you are running your code and use the following code block to create a DataFrame containing the synthetic data.
# upload synthetic dataset
import pandas as pd
try:
# check whether we are in Google colab
from google.colab import files
print("running in COLAB mode")
repo = 'https://github.com/mostly-ai/mostly-tutorials/raw/dev/rebalancing'
import io
uploaded = files.upload()
syn = pd.read_csv(io.BytesIO(list(uploaded.values())[0]))
print(f"uploaded synthetic data with {syn.shape[0]:,} records and {syn.shape[1]:,} attributes")
except:
print("running in LOCAL mode")
repo = '.'
print("adapt `syn_file_path` to point to your generated synthetic data file")
syn_file_path = './census-synthetic-balanced.csv'
syn = pd.read_csv(syn_file_path)
print(f"read synthetic data with {syn.shape[0]:,} records and {syn.shape[1]:,} attributes")
Let's now repeat the data exploration steps we performed above with the original, imbalanced dataset. First, let’s display 10 randomly sampled synthetic records. We'll subset again for legibility. You can run this line multiple times to get different samples.
# sample 10 random records
syn_sub = syn[['age','education','marital_status','sex','income']]
syn_sub.sample(n=10)
This time, you should see that the records are evenly distributed across the two income classes.
Let's now investigate all female doctorates with a high income in the synthetic, rebalanced dataset:
syn_sub[
(syn_sub['income']=='>50K')
& (syn_sub.sex=='Female')
& (syn_sub.education=='Doctorate')
].sample(n=10)
The synthetic data contains a list of realistic, statistically sound female doctorates with a high income. This is great news for our machine learning use case because it means that our ML model will have plenty of data to learn about this particular important subsegment.
Let’s now compare the quality of different rebalancing methods by training a machine learning model on the rebalanced data and evaluating the predictive performance of the resulting models.
We will investigate and compare 3 types of rebalancing:
The code block below defines the functions that will preprocess your data, train a LightGBM model and evaluate its performance using a holdout dataset. For more detailed descriptions of this code, take a look at the Train-Synthetic-Test-Real tutorial.
# import necessary libraries
import lightgbm as lgb
from lightgbm import early_stopping
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score
import seaborn as sns
import matplotlib.pyplot as plt
# define target column and value
target_col = 'income'
target_val = '>50K'
# define preprocessing function
def prepare_xy(df: pd.DataFrame):
y = (df[target_col]==target_val).astype(int)
str_cols = [
col for col in df.select_dtypes(['object', 'string']).columns if col != target_col
]
for col in str_cols:
df[col] = pd.Categorical(df[col])
cat_cols = [
col for col in df.select_dtypes('category').columns if col != target_col
]
num_cols = [
col for col in df.select_dtypes('number').columns if col != target_col
]
for col in num_cols:
df[col] = df[col].astype('float')
X = df[cat_cols + num_cols]
return X, y
# define training function
def train_model(X, y):
cat_cols = list(X.select_dtypes('category').columns)
X_trn, X_val, y_trn, y_val = train_test_split(X, y, test_size=0.2, random_state=1)
ds_trn = lgb.Dataset(
X_trn,
label=y_trn,
categorical_feature=cat_cols,
free_raw_data=False
)
ds_val = lgb.Dataset(
X_val,
label=y_val,
categorical_feature=cat_cols,
free_raw_data=False
)
model = lgb.train(
params={
'verbose': -1,
'metric': 'auc',
'objective': 'binary'
},
train_set=ds_trn,
valid_sets=[ds_val],
callbacks=[early_stopping(5)],
)
return model
# define evaluation function
def evaluate_model(model, hol):
X_hol, y_hol = prepare_xy(hol)
probs = model.predict(X_hol)
preds = (probs >= 0.5).astype(int)
auc = roc_auc_score(y_hol, probs)
f1 = f1_score(y_hol, probs>0.5, average='macro')
probs_df = pd.concat([
pd.Series(probs, name='probability').reset_index(drop=True),
pd.Series(y_hol, name=target_col).reset_index(drop=True)
], axis=1)
sns.displot(
data=probs_df,
x='probability',
hue=target_col,
bins=20,
multiple="stack"
)
plt.title(f"AUC: {auc:.1%}, F1 Score: {f1:.2f}", fontsize = 20)
plt.show()
return auc
# create holdout dataset
df_hol = pd.read_csv(f'{repo}/census-holdout.csv')
df_hol_min = df_hol.loc[df_hol['income']=='>50K']
print(f"Holdout data consists of {df_hol.shape[0]:,} records",
f"with {df_hol_min.shape[0]:,} samples from the minority class")
Let’s now train a LightGBM model on the original, heavily imbalanced dataset and evaluate its predictive performance. This will give us a baseline against which we can compare the performance of the different rebalanced datasets.
X_trn, y_trn = prepare_xy(trn)
model_trn = train_model(X_trn, y_trn)
auc_trn = evaluate_model(model_trn, df_hol)
With an AUC of about 50%, the model trained on the imbalanced dataset is just as good as a flip of a coin, or, in other words, not worth very much at all. The downstream LightGBM model is not able to learn any signal due to the low number of minority-class samples.
Let’s see if we can improve this using rebalancing.
First, let’s rebalance the dataset using the random oversampling method, also known as “naive rebalancing”. This method simply takes the minority class records and copies them to increase their quantity. This increases the number of records of the minority class but does not increase the statistical diversity. We will use the imblearn library to perform this step, feel free to check out their documentation for more context.
The code block performs the naive rebalancing, trains a LightGBM model using the rebalanced dataset and evaluates its predictive performance:
from imblearn.over_sampling import RandomOverSampler
X_trn, y_trn = prepare_xy(trn)
sm = RandomOverSampler(random_state=1)
X_trn_up, y_trn_up = sm.fit_resample(X_trn, y_trn)
model_trn_up = train_model(X_trn_up, y_trn_up)
auc_trn_up = evaluate_model(model_trn_up, df_hol)
We see a clear improvement in predictive performance, with an AUC score of around 70%. This is better than the baseline model trained on the imbalanced dataset, but still not great. We see that a significant portion of the “0” class (low-income) is being incorrectly classified as “1” (high-income).
This is not surprising because, as stated above, this rebalancing method just copies the existing minority class records. This increases their quantity but does not add any new statistical information into the model and therefore does not offer the model much data that it can use to learn about minority-class instances that are not present in the training data.
Let’s see if we can improve on this using another rebalancing method.
SMOTE upsampling is a state-of-the art upsampling method which, unlike the random oversampling seen above, does create novel, statistically representative samples. It does so by interpolating between neighboring samples. It’s important to note, however, that SMOTE upsampling is non-privacy-preserving.
The following code block performs the rebalancing using SMOTE upsampling, trains a LightGBM model on the rebalanced dataset, and evaluates its performance:
from imblearn.over_sampling import SMOTENC
X_trn, y_trn = prepare_xy(trn)
sm = SMOTENC(
categorical_features=X_trn.dtypes=='category',
random_state=1
)
X_trn_smote, y_trn_smote = sm.fit_resample(X_trn, y_trn)
model_trn_smote = train_model(X_trn_smote, y_trn_smote)
auc_trn_smote = evaluate_model(model_trn_smote, df_hol)
We see another clear jump in performance: the SMOTE upsampling boosts the performance of the downstream model to close to 80%. This is clearly an improvement from the random oversampling we saw above, and for this reason, SMOTE is quite commonly used.
Let’s see if we can do even better.
In this final step, let’s take the synthetically rebalanced dataset that we generated earlier using MOSTLY AI to train a LightGBM model. We’ll then evaluate the performance of this downstream ML model and compare it against those we saw above.
The code block below prepares the synthetically rebalanced data, trains the LightGBM model, and evaluates it:
X_syn, y_syn = prepare_xy(syn)
model_syn = train_model(X_syn, y_syn)
auc_syn = evaluate_model(model_syn, df_hol)
Both performance measures, the AUC as well as the macro-averaged F1 score, are significantly better for the model that was trained on synthetic data than if it were trained on any of the other methods. We can also see that the portion of “0”s incorrectly classified as “1”s has dropped significantly.
The synthetically rebalanced dataset has enabled the model to make fine-grained distinctions between the high-income and low-income records. This is strong proof of the value of synthetic rebalancing for learning more about a small sub-group within the population.
In this tutorial, you have seen firsthand the value of synthetic rebalancing for downstream ML classification problems. You have gained an understanding of the necessity of rebalancing when working with imbalanced datasets in order to provide the machine learning model with more samples of the minority class. You have learned how to perform synthetic rebalancing with MOSTLY AI and observed the superior performance of this rebalancing method when compared against other methods on the same dataset. Of course, the actual lift in performance may vary depending on the dataset, the predictive task, and the chosen ML model.
In addition to walking through the above instructions, we suggest experimenting with the following in order to get an even better grasp of synthetic rebalancing:
This article explains what data drift is, how it affects machine learning models in production, what the difference between data drift and concept drift is, and what you can do to tackle data drift using synthetic data.
“Data drift” is a term in machine learning that refers to the phenomenon in which a machine learning model’s performance slowly decreases over time. This happens because machine learning models are trained on historical data (i.e. “the past”) but then use current data (i.e. “the present”) when they are being used in production. In reality, the historical data and the current data may have different statistical characteristics and this is what we call “data drift”: the data used for predictions starts to drift from the data used for training. This means the machine learning model is no longer fully optimized for the data it is seeing.
Drift can be a big problem when using machine learning models in the real world, causing a decrease in predictive power. For example, let’s say we have trained a machine learning model to accurately predict the quarterly sales of a particular fashion brand. We then put this model into production.
At first it operates well: the actual data it is receiving (from the present) resembles the data that was used to train the model (from the past). But then something unexpected happens. A popular influencer spontaneously posts about the fashion brand and the post goes viral. Sales sky-rocket in a way that the machine learning model could never have foreseen because nothing like the unexpected viral post event was present in the training data.
This causes a significant change in the statistical distribution of the input data (i.e. “data drift”) and the machine learning model no longer performs at optimum performance. The model loses accuracy and may even produce unreliable predictions if the data distributions vary significantly.
There are different kinds of drift that can be observed in machine learning projects. Data drift refers specifically to the phenomenon in which the distribution of the real-world data used when the model is in production drifts from the data that was used for training.
Concept drift refers to the situation in which the relationship between features in the data changes over time. In this case, the pattern (or “concept”) that the machine learning model is trying to learn is evolving. In short, data drift deals with changes in the data that the model uses to make predictions, whereas concept drift refers to changes in the patterns between features in the data.
Data drift is a complex phenomenon that generally requires a multidimensional approach to solve. Some of the most effective things you can do to deal with data drift include:
In practice, retraining a machine learning model with fresh data is one of the most common methods used to deal with data drift. However, this approach comes with some drawbacks. Acquiring new data that is ready for training a machine learning model is often:
Synthetic data generation can help you tackle data drift by providing a high-quality, low-friction source of data on which you can retrain your machine learning models. Synthetic data generators enable you to produce virtually limitless data and often give you fine-grained control over the distributions of this new data. By accurately modeling new synthetic datasets, you can then update your machine learning model to incorporate the drifted data distribution.
We’ve broken it down into 5 steps for clarity:
Detecting data drift should be a fundamental part of any machine learning life cycle. There are many ways to perform data drift detection and many resources to learn about it. This article focuses on solutions that will help you fix data drift once it has been detected.
Before tackling data drift, it’s important that you have a good understanding of its nature and potential causes. Analyze your model and the incoming data to identify points where the data is drifting and analyze its statistical characteristics. This will help you understand how to incorporate the data drift into your updated model.
For example, in the case of the quarterly fashion sales predictions mentioned above, the fact that we can reliably trace the data drift to the viral influencer post helps us know how to deal with the data drift. It’s reasonable to expect the influencer post to have lasting effects on the fashion brand’s perception and future sales: we should therefore adjust our data projections to include some of the ripple effects of this unexpected sales boost.
On the contrary, if we had instead seen a massive but temporary drop in sales due to a failure in the webshop’s main server, we may want to choose not to incorporate this data at all in the projections for next quarter, the assumption here being that the webshop will not experience another failure.
Once you have a good understanding of the statistical nature and potential sources of your data drift, you can then proceed to use synthetic data generation to supplement your training dataset with cases that might occur due to data drift.
We’ll walk through how to generate the right kind of synthetic data to tackle your data drift with MOSTLY AI's synthetic data platform, using a technique called conditional generation.
4. Define the relationship between the two tables using the Data Settings tab and navigating to the settings for the table containing the predictor columns. Click on the gear icon to the right of the ID column and set the following settings:
Generation Method: Foreign Key
Foreign Key: Type: Context
Parent Table: <your-table-with-target-column>
Parent Primary column: <id-column-of-target-table>
Save the settings. Under the “Tables” tab you should now see that the predictor table has changed into a Linked Table (lime green color coding).
5. Once the job has been completed, select the “Generate more data” action on the right-hand side of the newly-generated dataset row and select “Generate with seed” to perform conditional generation.
6. Now upload a subject table with a different kind of distribution.
This subject table can be generated manually or programmatically and should contain the drifted distribution. The simulated subject table (containing the drifted target feature distribution) will be used to generate a synthetic dataset (i.e. the predictor columns) that would produce the new, drifted distribution.
In our viral fashion post example, we would create a simulation of the target feature (sales) that follows the “new training distribution” depicted in Figure 4 above and use this to generate a synthetic dataset.
Open-source Python packages like NumPy or SciPy enable you to perform fine-grained data simulation. You can use MOSTLY AI’s rebalancing feature to programmatically simulate drifted target feature distributions for categorical columns.
7. Repeat for all the different scenarios you want to model.
To properly accommodate all of the possible future scenarios, you may want to create multiple simulated datasets, each with a different assumption and associated distribution. In the case of our viral fashion post, we may want to create three simulations: one in which sales continue to skyrocket at the same rate as we saw this quarter, one in which sales just go back to ‘normal’ (i.e. the influencer post has no lasting effect), and a third scenario that takes the average of these two extremes. With these 3 synthetic datasets we can then train different models to predict 3 kinds of possible future scenarios.
With your freshly generated synthetic data ready, you can now proceed to re-train your machine learning model. You can use just the synthetic data or a mix of real and synthetic data, depending on the privacy requirements of your model.
Finally, make sure to put precise monitoring tools in place to continue to detect data drift. For example, you could use open-source Python libraries like Evidently or NannyML to keep track of your model performance throughout the machine learning lifecycle. When your model metrics indicate a recurrence of data drift, update your synthetic data to reflect the new distributions and re-train your model.
Synthetic data generation can help you tackle data drift by making it easy to simulate potential future scenarios based on new statistical distributions of the data. By providing a high-quality, low-friction source of data on which you can retrain your machine learning models, synthetic data generators enable you to produce virtually limitless data to model changes in the underlying data. MOSTLY AI gives you fine-grained control over the distributions of this new data so you can accurately model new synthetic datasets that take into consideration the drifted data distributions.
Try it out today – the first 100K rows of synthetic data are on us!
There is something different about Merkur Versicherung AG. It’s the oldest insurance company in Austria, but it doesn’t feel like it.
For starters, there’s the Campus HQ in Graz. An Illuminous race track lines the floor of the open plan space. The vibrant lobby is filled with eclectic artwork and unconventional furniture. And there’s a beautiful coffee dock beside the fully functioning gym in the lobby.
Then, there’s the people. One team in particular stands out amongst the crowd: the Merkur Innovation Lab. A group of self professed “geeks, data wizards, future makers” with some “insurance guys” thrown in for good measure. Insurance innovation is born right here. Daniela Pak-Graf, the managing director of Merkur Innovation Lab — the innovation arm of Merkur Insurance, told us in the Data Democratization Podcast:
“Merkur Innovation Lab is the small daughter, the small startup of a very old company. Our CEO had the idea, we have so much data, and we're using the data only for calculating insurance products, calculating our costs, and in the era of big data of Google, of Amazon, Netflix, there have to be more possibilities for health insurance data too. He said, "Yes, a new project, a new business, what can we do with our data?" Since 2020, we are doing a lot.”
Oh, and then there’s synthetic health data.
The Merkur Innovation Lab has fast become a blueprint for other organizations looking to develop insurance innovations by adopting synthetic data. In the following, we’ll introduce three insurance innovations powered by synthetic data adoption.
Like many other data-driven teams, the Merkur Innovation Lab team faced the challenge of ensuring data privacy while still benefiting from valuable insights. The team experimented with data anonymization and aggregation but realized that it fell short of providing complete protection. The search for a more comprehensive solution led them to the world of synthetic data.
According to Daniela Pak-Graf, the solution to the problem is synthetic data:
"We found our way around it, and we are innovating with the most sensitive data there is, health data. Thanks to MOSTLY."
Merkur didn’t waste time in leveraging the power of synthetic data to quickly unlock the insights contained within their sensitive customer data. The team has created a beautifully integrated and automated data pipeline that enables systematic synthetic data generation on a daily basis, fueling insurance innovations across the organization. Here’s how they crafted their synthetic data pipeline:
The end-to-end automated workflow has cut Merkur’s time-to-data from 1-month, to 1-day. The resulting synthetic granular health data is read into a dynamic dashboard to showcase a tailored ‘Monetary Analysis’ of Merkur’s customer population. And the data is available for consumption by anyone at any time. True data democratization and insurance innovation on the tap.
As we know, traditional data sharing approaches, particularly in sensitive industries like health and finance, often faced complexity due to regulatory constraints and privacy concerns. Synthetic data offered a quick and secure solution to facilitate data collaboration, without which scaling insurance innovations would be impossible.
According to Daniela:
“...one of the biggest opportunities is working with third parties. When speaking to other companies, not only insurance companies, but companies working with health data or customer data, there's always the problem, "How can we work together?" There are quite complex algorithms. I don't know, homomorphic encryption. No one understands homomorphic encryption, and it's not something which can be done quickly. Using synthetic data, it's a quick fix if you have a dedicated team who can work with synthetic data.”
One exciting collaboration enabled by synthetic data is Merkur Innovation Lab’s work with Stryker Labs. Stryker Labs is a startup focused on providing training management tools for professional athletes. The collaboration aims to extend the benefits of proactive healthcare and injury prevention to all enthusiasts and hobby athletes by merging diverse datasets from the adjacent worlds of sport and health. Daniela explained the concept:
“The idea is to use their expertise and our knowledge about injuries, the results, the medication, how long with which injury you have to stay in hospital, what's the prescribed rehabilitation, and so on. The idea is to use their business idea, our business idea, and develop a new one where the prevention of injuries is not only for professional sports, but also for you, me, the occasional runner, the occasional tennis player, the occasional, I don't know.”
This exciting venture has the potential to improve the well-being of a broader and more diverse population, beyond the privileged few who make it into the professional sporting ranks.
Another promising aspect of synthetic data lies in its potential to address gender bias and promote fairness in healthcare. By including a more diverse dataset, synthetic data can pave the way for personalized, fairer health services for women. In the future, Merkur Innovation Lab plans to leverage synthetic data to develop predictive models and medication tailored for women; it marks a step towards achieving better healthcare equality. According to Daniela:
“...it could be a solution to doing machine learning, developing machine learning algorithms with less bias. I don't know, minorities, gender equality. We are now trying to do a few POCs. How to use synthetic data for more ethical algorithms and less biased algorithms.”
Insurance companies have always been amongst the most data-savvy innovators. Looking ahead, we predict that the insurance sector will continue to lead the way in adopting sophisticated AI and analytics. The list of AI use cases in insurance continues to grow and with it, the need for fast and privacy safe data access. Synthetic data in insurance unlocks the vast amount of intelligence locked up in customer data in a safe and privacy-compliant way. Synthetic healthcare data platforms are becoming a focal point for companies looking to accelerate insurance innovations.
The Merkur Innovation Lab team of “geeks, data wizards, future makers” are only getting started on their synthetic data journey. However, they can already add “synthetic data trailblazers” to that list. They join a short (but growing) list of innovators in the Insurance space, like our friends, Humana, who are creating winning data-centric products with their synthetic data sandbox.
Machine learning and AI applications are becoming more and more common across industries and organizations. This makes it essential for more and more developers to understand not only how machine learning models work, but how they are developed, deployed, and maintained. In other words, it becomes crucial to understand the machine learning process in its entirety. This process is often referred to as “the machine learning life cycle”. Maintaining and improving the quality of a machine learning life cycle enables you to develop models that consistently perform well, operate efficiently and mitigate risks.
This article will walk you through the main challenges involved in ensuring your machine learning life cycle is performing at its best. The most important factor is the data that is used for training. Machine learning models are only as good as the data that goes into them; a classic example of “garbage in, garbage out”.
Synthetic data can play a crucial role here. Injecting synthetic data into your machine learning life cycle at key stages will improve the performance, reliability, and security of your models.
A machine learning life cycle is the process of developing, implementing and maintaining a machine learning project. It includes both the collection of data as well as the making of predictions based on that data. A machine learning life cycle typically consists of the following steps:
In reality, the process of a machine learning life cycle is almost never linear. The order of steps may shift and some steps may be repeated as changes to the data, the context, or the business goal occur.
There are plenty of resources out there that describe the traditional machine learning life cycle. Each resource may have a slightly different way of defining the process but the basic building blocks of a machine learning life cycle are commonly agreed upon. There’s not much new to add there.
This article will focus on how you can improve your machine learning life cycle using synthetic data. The article will discuss common challenges that any machine learning life cycle faces and show you how synthetic data can help you overcome these common problems. By the end of this article, you will have a clear understanding of how you can leverage synthetic data to boost the performance of your machine learning models.
The short version: synthetic data can boost your machine learning life cycle because it is:
Read on to learn more 🧐
Every machine learning life cycle encounters some, if not all, of the problems listed here:
Let’s take a look at each one of these problems and see how synthetic data can support the quality of your machine learning life cycle in each case.
The problem starts right at the first step of any machine learning project: data collection. Any machine learning model is only as good as the data that goes into it and collecting high-quality, usable data is becoming more and more difficult. While the overall volume of data available to analysts may well be exploding, only a small fraction of this can actually be used for machine learning applications. Privacy regulations and concerns obstruct many organizations from using available data as part of their machine learning project. It is estimated that only 15-20% of customers consent to their data being used for analytics, which includes training machine learning models.
Synthetic data is infinitely available. Use synthetic data to generate enough data to train your models without running into privacy concerns.
Once your generator has been trained on the original dataset it is able to generate as many rows of high-quality, synthetic data as you need for your machine learning application. This is a game-changer as you no longer have to scrape together enough high-quality rows to make your machine learning project work. Make sure to use a synthetic data generator that performs well on privacy-preservation benchmarks.
Real-world data is messy. Whether it’s due to human error, sensor failure or another kind of anomaly, real-world datasets almost always contain incorrect or missing values. These values need to be identified and either corrected or removed from the dataset. The first option (correction) is time-intensive and painstaking work. The second option (removal) is less demanding, but can lead to a decrease in the performance of the downstream machine learning model as it means removing valuable training data.
Even if the data sourcing is somehow perfect – what a world that would be! – your machine learning lifecycle may still be negatively impacted by an imbalanced dataset. This is especially relevant for classification problems with a majority and a minority class in which the minority class needs to be identified.
Fraud detection in credit card transactions is a good example of this: the vast majority of credit card transactions are perfectly acceptable and only a very small portion of transactions are fraudulent. It is crucial to credit card companies that this very small portion of transactions is properly identified and dealt with. The problem is that a machine learning model trained on the original dataset will not have enough examples of fraudulent behavior to properly learn how to identify them because the dataset is imbalanced.
Synthetic data can be better than real. Use synthetic data to improve the quality of your original dataset through smart imputation and synthetic rebalancing.
Many machine learning models suffer from embedded biases in the training data which negatively impact the model’s fairness. This can have negative effects on both societal issues as well as on companies’ reputation and profit. In one infamous case investigated by ProPublica, a machine learning model used by the U.S. Judicial system was shown to make biased decisions according to defendants’ ethnicity. This led to incorrect predictions on the likelihood of defendants to re-offend which in turn affected their access to early probation or treatment programs.
While there is no single cause for biased training data, one of the major problems is a lack of sufficient training data, leading to certain demographic groups being underrepresented. As we have seen, synthetic data can overcome this problem both because it is infinitely available and because imbalances in the data can be fixed using synthetic upsampling.
But biases in AI machine learning models are not always due to insufficient training data. Many of the biases are simply present in the data because we as humans are all biased to some degree and these human biases find their way into training data.
This is precisely where Fair Synthetic Data comes in. Fair Synthetic Data is data whose biases have been corrected through statistical tools such as demographic parity. By adding fairness constraints to their models, synthetic data generators are able to ensure these statistical measures of fairness.
Synthetic data can increase data fairness. Use synthetic data to deal with embedded biases in your dataset by increasing the size and diversity of the training data and ensuring demographic parity.
Once a machine learning model has been trained, it needs to be tuned in order to boost its performance. This is generally done through hyperparameter optimization in order to find the model parameter values that yield the best results. While this is a useful tool to enhance model performance, the improvements made tend to be marginal. The improvements are ultimately limited by the quality and quantity of your training data.
If you are working with a flexible capacity machine learning model (like XGBoost, LightGBM, or Random Forest), you may be able to use synthetic data to boost your machine learning model’s performance. While traditional machine learning models like logistic regression and decision trees have a low and fixed model capacity (meaning they can’t get any smarter by feeding them more training data), modern ensemble methods saturate at a much later point and can benefit from more training data samples.
In some cases, machine learning model accuracy can improve up to 15% by supplementing the original training data with additional synthetic samples.
Once your model performance has been fine-tuned, it will need to be maintained. Data drift is a common issue affecting machine learning models. As time passes, the distributions in the dataset change and the model is no longer operating at maximum performance. This generally requires a re-training of the model on updated data so that it can learn to recognize the new patterns.
Synthetic data can increase model accuracy. Use synthetic data to boost flexible-capacity model performance by providing additional training samples and to combat data drift by generating fresh data samples on demand.
The final step of any machine learning life cycle is the explanation and sharing of the model and its results.
Firstly, the project stakeholders are naturally interested in seeing and understanding the results of the machine learning project. This means presenting the results as well as explaining how the machine learning model arrived at these results. While this may seem straightforward at first, this may become complicated due to privacy concerns.
Secondly, many countries have AI governance regulations in place that require access to both the training data and the model itself. This may pose a problem if the training data is sensitive and cannot be shared further. In this case, high-quality, representative synthetic data can serve as a drop-in replacement. This synthetic data can then be used to perform model documentation, model validation and model certification. These are key components of establishing trust in AI.
Synthetic data safeguards privacy protection. Use synthetic data to support Explainable AI efforts by providing highly-representative versions of sensitive training datasets.
Synthetic data can address key pain points in any machine learning life cycle. This is because synthetic data can overcome limitations of the original, raw data collected from ‘the real world’. Specifically, synthetic data can be highly available, balanced, and unbiased.
You can improve the quality of your machine learning life cycle by using synthetic data to:
If you’re looking for a synthetic data generator that is able to consistently deliver optimal privacy and utility performance, give synthetic data generation a try today and let us know what you think – the first 100K rows of synthetic data are on us!
Acquiring real-world data can be challenging. Limited availability, privacy concerns, and cost constraints are the usual suspects making life difficult for the average data consumer. Generative AI synthetic data has emerged as a powerful solution to overcome these limitations. However, it’s not enough to add yet another tool to the tech stack. In order to serve the data consumer better, the data architecture also needs to change.
While traditional approaches involve synthesizing data from centralized storage or data warehouses, a more effective and efficient strategy is to bring generative AI synthetic data closer to the data consumer. In this blog post, we explore the importance of this approach and how it can unlock new possibilities in data-driven applications.
Traditional data synthesis approaches usually rely on centralized storage, creating bottlenecks and delays in data access. The centralized governance model hinders the agility and autonomy of data consumers, limiting their ability to respond to evolving needs quickly. Moreover, traditional synthesis methods need help to scale and accommodate diverse data requirements, making it challenging to meet the specific needs of individual data consumers. The one size fits all approach doesn’t work.
In many organizations, data owners focus on replacing legacy data anonymization processes with generative AI synthetic data to populate lower environments, mistaking data availability for data usability. Generating full versions of their production databases resolves the data accessibility problem but locks the power of generative AI synthetic data. It's crucial to move beyond the mindset of merely replacing original data with synthetic data and instead focus on bringing generative AI synthetic data closer to the data consumer. Not only in terms of proximity but also in terms of usability.
Organizations empower their teams with increased autonomy and agility. Data consumers gain greater control and flexibility in generating synthetic data tailored to their requirements, enabling faster decision-making, experimentation, and innovation.
Generative AI models can upsample minority classes for better representation, downsample high quantities of data for smaller but still representative datasets, and augment the data by filling the gaps in the original data. This level of customization and control allows data consumers to improve overall data quality and diversity and address the following data challenges:
Proximity to the data consumer minimizes delays in accessing and synthesizing data. Rather than relying on centralized storage, generative AI models can be deployed closer to the data consumer, ensuring faster generation and synthesis of synthetic data. This reduced latency results in more efficient workflows and quicker insights for data consumers.
The proximity between generative AI synthetic data generators and data consumers fosters data collaboration and innovation. Data consumers can work closely with the generative AI model creators, providing feedback and insights to improve the quality and relevance of the synthetic data. This collaborative approach facilitates faster innovation, experimentation, and prototyping, unlocking new possibilities in various domains.
In healthcare, generating synthetic data from patient data closer to the data consumer can revolutionize diagnostic and treatment research. Researchers and data scientists can utilize generative AI models to create synthetic patient data that captures a wide range of medical conditions, demographics, and treatment histories embedded in the real data. This synthetic data can be used to train and validate predictive models, enabling more accurate diagnosis, personalized treatment plans, and drug development without compromising patient privacy or waiting for access to original patient data. A healthcare data platform populated with synthetic health data can empower data consumers even outside the organization, like in the case of Humana's synthetic data exchange, accelerating innovation, research and development.
In the financial industry, synthetic data generated from privacy sensitive financial transaction data brought closer to the data consumer can significantly improve fraud detection capabilities. Financial institutions can train machine learning models to identify and prevent fraudulent transactions by generating synthetic data representing various fraudulent activities, including upsampling fraud patterns. Using this upsampled synthetic data, organizations can stay ahead of evolving fraud techniques without compromising the privacy and security of original customer data.
There needs to be more than the traditional approach of data synthesis from centralized storage or data warehouses to meet the evolving needs of organizations. Bringing generative AI synthetic data closer to the data consumer offers a paradigm shift in data synthesis. More autonomy, less latency, improved privacy, and higher levels of customization are all among the benefits. Organizations must embrace this approach to promote collaboration, experimentation, and innovation, empowering organizations to unlock new possibilities and leverage the full potential of synthetic data.
By bringing generative AI synthetic data closer to the data consumer, we can embark on a transformative journey that empowers data consumers and accelerates the development of intelligent applications in various industries.
The process of creating synthetic data that resembles (potential) real-world or data is referred to as data simulation. It is widely used in a variety of domains, including statistics, machine learning, and computer science, for a variety of reasons, including testing algorithms, assessing models, and performing experiments.
Data simulation is the process of producing a dataset with specified traits and qualities that imitate the patterns, distributions, and correlations seen in real data or how one would expect to see it in real data (e.g. in the future). This generated data may be used to conduct studies, evaluate the efficacy of statistical methods or machine learning algorithms, and investigate various situations without the limits or limitations associated with real data collecting.
Synthetic data and data simulation are closely related concepts, as synthetic data is often generated through data simulation techniques. In the past few years the approach of ML generated synthetic data has becoming more and more popular. Artificial Intelligence and machine learning models are leveraged to create synthetic data of very high quality.
Data simulation is an invaluable tool for businesses of all sizes. It has several advantages that help with decision-making, risk assessment, performance evaluation, and model creation.
One of the key benefits of data simulation is its capacity to assist in making informed decisions. Organizations can explore numerous alternatives and analyze potential results by simulating different situations that closely reflect real-world settings. This enables individuals to make data-driven decisions, reducing uncertainty and increasing the possibility of obtaining desired outcomes.
Risk assessment and management are also greatly enhanced through data simulation. Organizations may simulate various risk scenarios, assess their likelihood and impact, and design risk-mitigation strategies. They may implement proper risk management strategies and defend themselves against possible threats by proactively identifying vulnerabilities and analyzing the potential repercussions of various risk variables.
When it comes to model development and testing, synthetic data generated through simulation is highly valuable. Organizations can train and test statistical or machine learning models in controlled environments by developing synthetic datasets that closely imitate the properties of real data. This allows them to uncover flaws, enhance model accuracy, and reduce error risk before deploying the models in real-world settings.
Data simulation comprises a wide range of methodologies and technologies that businesses may use to produce simulated data. These techniques and tools cater to different data characteristics and requirements, providing flexibility and versatility in data simulation. Let's take a closer look at some regularly used strategies and tools!
Random sampling is a key tool for data simulation. This method entails picking data points at random from an existing dataset or creating new data points based on random distributions. When the data has a known distribution or a representative sample is required, random sampling is valuable.
Another extensively used approach in data simulation is Monte Carlo simulation. It makes use of random sampling to describe and simulate complex systems that include inherent uncertainty. Monte Carlo simulation models a variety of possible outcomes by producing many random samples based on probability distributions. This approach is used in a variety of industries, including finance, physics, and engineering.
For data simulation, statistical modeling techniques such as regression analysis, time series analysis, and Bayesian modeling can be employed. Fitting statistical models to existing data and then utilizing these models to produce simulated data that closely mimics the original dataset are examples of these approaches.
To facilitate data simulation, various software packages and tools are available. AnyLogic is a sophisticated simulation program that allows for the modeling of agent-based, discrete events, and system dynamics. Simul8 is a well-known program for discrete event simulation and process modeling. Arena is a popular modeling and simulation tool for complex systems, processes, and supply chains. R and Python programming languages, along with packages like NumPy and SciPy, provide substantial capabilities for data simulation and modeling.
For enterprises seeking reliable insights and informed decision-making, ensuring the accuracy of simulated data in contrast to real-world data is critical. Several factors and practices can aid in achieving this precision, allowing for more relevant analysis.
Obtaining a thorough grasp of the data generation process is a critical first step towards realistic data modeling. Collaboration with subject matter experts and domain specialists gives important insights into essential aspects, connections, and distributions that must be included in the simulation. Organizations may build the framework for correct representation by understanding the complexities of the data generation process.
Validation and calibration play a vital role in ensuring the fidelity of simulated data. Comparing statistical properties, such as means, variances, and distributions, between the real and simulated datasets allows for an assessment of accuracy. Calibration involves adjusting simulation parameters and models to achieve a closer match between the simulated and real data, enhancing the quality of the simulation.
A feedback loop involving stakeholders and subject experts is essential. Gathering input and thoughts from folks who are familiar with the real data on a regular basis improves the simulation's accuracy. By incorporating their experience into the simulation process, tweaks and enhancements may be made, better matching the simulated data with the real-world environment. Validation against real data on a regular basis maintains the simulation's continuous fidelity.
While simulated data can closely resemble real-world data and offer numerous benefits, it is essential to acknowledge the inherent limitations and assumptions involved in the simulation process. Organizations should recognize the uncertainties and limitations associated with simulated data, using it as a complementary tool alongside real data for analysis and decision-making.
The assumptions and simplifications necessary to mimic real-world settings are one of the fundamental limits of data simulation. Simulated data may not fully reflect the complexities and nuances of the actual data generation process, resulting in possible disparities between simulated and real data. Organizations should be cautious of the assumptions they make as well as the amount of authenticity attained in the simulation.
The accuracy of simulated data is strongly dependent on the quality of the underlying simulation models. Models that are inaccurate or inadequate may fail to convey the complexities and interdependencies seen in real data, resulting in erroneous simulated data. It is vital to ensure the validity and accuracy of simulation models in order to provide relevant insights and dependable forecasts.
The quality and representativeness of the training data used to develop the simulation models are intrinsically dependent on simulated data. If the training data is biased or does not represent the target population successfully, the simulated data may inherit those biases. To reduce the possibility of biased simulations, organizations must carefully curate and choose representative training data.
Another danger in data simulation is overfitting, which occurs when models become highly fitted to the training data, resulting in poor generalization to unknown data. Organizations should take caution and not depend too much on simulated data that has not been thoroughly validated against real-world data. Real-world data should continue to be the gold standard for evaluating the performance and dependability of simulation models.
Data simulation is used in various use cases by banks and financial institutions. Here are the most important examples:
Training machine learning models on synthetic data rather than actual data can potentially increase their performance. This is achievable because synthetic data assists these models in learning and understanding patterns. In the realm of data simulation, there are two essential ways in which it can significantly enhance the representation of data: by supplying a greater number of samples than what may be available in the original dataset and, more notably, by providing additional examples of minority classes that would otherwise be under-represented. These two aspects of data simulation play a crucial role in addressing the challenges associated with imbalanced datasets and expanding the diversity of data for more robust analysis.
Firstly, the MOSTLY AI synthetic data generator allows organizations to generate a larger volume of synthetic data points beyond the existing dataset. This serves as a valuable advantage, particularly in situations where the original data is limited in size or scope. By artificially expanding the dataset through simulation, organizations gain access to a richer and more comprehensive pool of samples, enabling more accurate and reliable analysis. The additional samples offer increased coverage of the data space, capturing a wider range of patterns, trends, and potential outcomes.
Secondly, and perhaps more significantly, data simulation offers the opportunity to address the issue of under-representation of minority classes. In many real-world datasets, certain classes or categories may be significantly under-represented, leading to imbalanced distributions. This can pose challenges in accurately modeling and analyzing the data, as the minority classes may not receive adequate attention or consideration. MOSTLY AI provides a solution by generating synthetic examples specifically targeted towards these under-represented classes. By creating additional instances of the minority classes, the simulated data helps to balance the distribution and ensure a more equitable representation of all classes. This is particularly important in various domains, such as fraud detection, where the minority class (e.g., fraudulent cases) is often of particular interest.
As discussed above, it is important to note that data simulation is not without its own challenges. The process of generating synthetic data requires careful consideration and validation to ensure that the simulated samples accurately capture the characteristics and patterns of the real-world data. The quality of the simulation techniques and the fidelity of the generated data are critical factors that need to be addressed to maintain the integrity of the simulation process.
Recently, MOSTLY AI introduced data augmentation features. Rebalancing is one of them and it can be used as a data simulation tool. Many businesses, particularly financial institutions, suffer because the data they have accumulated over time is significantly skewed and biased towards a specific behavior. There are several examples of skewed/biased data sets investigating gender, age, ethnicity, or even occupation. As a result, decision-makers are sometimes unable to make the optimal decision that will help their firm flourish.
MOSTLY AI's rebalancing capability may be used as a simulation tool. The goal is to offer decision-makers an effective tool for better understanding and exploitation of newly acquired information that may influence their contradictory decision to make. Rebalancing may be a key and useful technique for testing multiple hypotheses and 'what-if' situations that may impact the entire organization's strategy change.
Take the insurance business as an example. Total yearly premiums and total claims amount are two of the most important KPIs for all insurers worldwide. One might use rebalancing as a simulation tool to answer questions like:
Using MOSTLY AI's rebalancing feature, we changed the insurance customer mix distribution toward younger audiences. MOSTLY AI's synthetic data generator then created the rest of the dataset's characteristics based on the new information. The two previously mentioned KPIs have been adjusted, and a decision-maker may notice that income has increased while costs have fallen.
Stakeholders can utilize the above comprehensive research to guide their judgments and perhaps adjust their organizational strategy.
In the dynamic and data-intensive landscape, data simulation has emerged as a powerful tool for organizations seeking to enhance decision-making, manage risks, and optimize operations. We've seen how data simulation helps organizations get useful insights and create informed strategies through a variety of effective use cases.
Data simulation has become an indispensable tool for organizations, providing them with the means to make evidence-based decisions, optimize strategies, and navigate complex landscapes. As organizations embrace the power of simulated data, they can unlock new insights, enhance their competitive advantage, and deliver superior services in an ever-changing world.
Data catalog tools enable centralized metadata management, providing a comprehensive inventory of all the data assets within an organization. A data catalog is a searchable, curated, and organized inventory of all the data sources, datasets, and data flows, along with information about their lineage, quality, and other attributes.
Data catalogs are the single source of truth for all data assets. Data catalogs make it easier for data users and stakeholders to discover, understand, and trust the data that is available to them. They provide detailed information about the structure, content, and context of data assets, including their data definitions, data types, and relationships to other data assets.
By providing a centralized view of data assets, data catalogs can help organizations to better manage and govern their data. Data catalog tools also facilitate compliance by providing visibility into the data lineage and usage, as well as access controls and permissions.
Data catalog tools are software applications that allow you to create and manage the collection of your data assets described above. These data catalog tools typically store, and share metadata about data assets, including data definitions, data types, data lineage, and data quality.
Developing and maintaining these extensive data assets without a dedicated tool is near-impossible once tables number in the thousands, which is often the case in larger organizations.
Data catalog tools have been around as long as computing, but with the evolution of large scale data and distributed data architectures, they became mission critical. The data catalog tools of today often incorporate machine learning and artificial intelligence capabilities, enabling them sometimes to even automatically classify, tag, and analyze data assets.
While data catalogs can make it easier to manage data assets, there are also several challenges associated with using data catalog tools.
Data catalogs rely on accurate and up-to-date metadata to be effective, and poor data quality can undermine the usefulness of the catalog. If the metadata is incomplete, inconsistent, or inaccurate, it can lead to confusion and misinterpretation of the data.
Data catalogs can be an important tool for enforcing data governance policies and managing data assets, but this requires careful planning and implementation to ensure that the right policies and procedures are in place. You can’t govern what you can’t see. An effective data catalog tool allows governance people to track governance initiatives and co-operate with other stakeholders across organizations. Confining governance to IT departments is a mistake that should be avoided. However, sharing data assets downstream comes with its own privacy issues. AI-generated, realistic, yet privacy-protective synthetic data can serve as a drop-in placement for production data samples.
Data catalogs need to integrate with other data management tools, such as data warehouses, data lakes, and ETL tools, in order to provide a comprehensive view of an organization's data assets. This can be challenging, particularly when dealing with legacy systems or complex data architectures.
Data catalogs require ongoing maintenance and updates to ensure that the metadata is accurate and up-to-date. This can be time-consuming and resource-intensive, particularly for larger organizations or those with complex data architectures.
Data catalogs can provide significant benefits, however, they require careful planning, implementation, and ongoing curation to be effective. In our experience, it pays to have a dedicated team of data stewards who truly care about data democratization.
A data consumer uses a data catalog to find data. They may use full text search across the entire data catalog content, or navigate in a more structured manner and use filters to search for very specific tables, for example. In most cases, the user ends up on the catalog page of a table. A table is the most relevant entity for a data consumer. On the table page, they can inspect the title, description, and any other custom fields at the table level, and go into the details of each column, as well.
Chief Data Officers can effectively improve the analytical capabilities, scale governance and increase data literacy using a reliable data catalog tool. Enabling people across the organization to self-serve data on the fly should be the ultimate goal, while keeping data privacy and governance policies top of mind too. An essential tool in the journey towards full data democratization is to develop, curate and catalog synthetic data products. These readily available, statistically near-identical datasets can accelerate all data-intensive processes from third party POCs to the development of accurate machine learning models.
Since data catalog tools typically display sample data on the table page, visible to every catalog user, there is a danger of accidentally revealing sensitive information, such as names, ages, salaries, health status and other privacy violations. The usual answer to the problem: just mask the sensitive columns. However, data masking renders the sample less useful by destroying readability and still failing to protect privacy in meaningful ways.
Synthetic data alternatives are needed to provide high readability and privacy protection to sample data displayed within data catalog tools. Furthermore, AI-powered synthetic data generation can also improve data quality by filling in gaps in the existing dataset or providing additional examples of rare or hard-to-find data points.
Some data catalog tools also include built-in SQL editors. If a user has a username and password or other credentials for the database in question, they can start querying the database from within the data catalog tool. They can reuse queries other users have published, and publish their own queries. Here, as well, it may be useful to direct the user (by default) to synthetic data rather than production data.
Synthetic data generation itself can be managed through data catalog tools. Datasets in need of full synthesization or data augmentation can be marked by data consumers directly in the data catalog tool, allowing seamless access to high quality, curated or even augmented synthetic datasets. In short, combining data catalogs with synthetic data can be an excellent way of accelerating time-to-value for any data project.
In this tutorial, we'll show you how to add synthetic data to Alation, a data catalog tool.