Dictionary entries Archive - MOSTLY AI powered by Syntho

A

Accuracy

Synthetic data needs to be accurate, i.e. as close as possible to real-word data so that it is very hard to distinguish between the real and the synthetic data. Accuracy is often evaluated by comparing the real and synthetic distributions of the variables and the correlations among them.

Accuracy-privacy trade-off

The accuracy-privacy trade-off refers to the trade-off between the privacy and the value of data. Data that is fully anonymized using traditional data anonymization techniques, so that an attacker cannot re-identify individuals is not of great value for statistical analysis. On the other hand, if data anonymization is insufficient, the data will be vulnerable to […]

Aggregation

Aggregation is one of the most commonly used data anonymization techniques. Aggregation is the process of grouping data and generating statistical summaries. This leads to low data utility.

AI

Artificial intelligence (AI) is the simulation of human intelligence processes by machines. A typical example of an AI method is a neural network that teaches computers to process data in a way that is inspired by the human brain. Artificial intelligence systems work by combining large data sets with intelligent, iterative algorithms that learn patterns […]

AI-based reidentification attack

A new type of privacy attack that uses AI to re-identify individuals based on their behavioral patterns. Read how AI-based reidentification attacks can be prevented.

Anonymization

Anonymization refers to the different techniques used to de-identify datasets, that is, to make sure that subjects in a dataset cannot be reidentified. Traditional anonymization techniques such as adding noise or grouping data retains a 1:1 link to original data subjects and re-identification risk remains. Behavioral or time-series data is especially hard to anonymize due […]

Attribute disclosure

Attribute disclosure is the risk of infering an additional information on a subject’s attribute in the anonymized data if that subject was included in the original data

Augmentation

Augmentation refers to the techniques used to increase the amount of data. Traditional data augmentation adds slightly different copies of the same dataset to increase data volume. Synthetic data generation is a much better data augmentation tool, since synthetic data generators can generate an arbitrary number of data subjects, keeping the correlations intact. As a […]

B

Balanced data

Balanced data is a term used in a classification task. If the target column contains one or more categories and each of them is represented equally, then the dataset is balanced.

Batch

Training data used to teach machine learning or AI models can be divided into one or more batches. The batch size refers to the number of records used for each training step. In each training step, internal model parameters are updated.

Behavioral data

Behavioral data is data that contains information about human behaviour. For each data subject, data consists of recorded sequences of events representing transactions, visits, clicks or other actions.

Bias

Bias in data occurs when data is not sufficiently representative and contains bias against a certain group of people. Bias in data produces biased models which can be unfair to humans.

Bivariate distribution

Bivariate distribution is a joint distribution with two variables. It gives probabilities for simultaneous outcomes of the two variables. Bivariate distribution of accurate synthetic data is very similar to the bivariate distribution of real data.

Boxplot

A standardized way of displaying the distribution of data based on five main statistics: minimum, first quartile, median, third quartile and maximum. Accurate synthetic data should have quartiles and median close to the real ones. It can, however, differ in the minimum and maximum values if these values only rely on a few individuals. This […]

Business rule

Business rules are deterministic rules, defined by business and must always be fulfilled.

C

Cardinality

The number of unique variable values. The higher the number of unique values, the more difficult the task of learning all the relationships with other variables leading to a larger synthetic model.

Categorical

A categorical variable has a fixed set of possible values. Synthetic data contains only the set of values that are already present in the actual data and shall be present in the same proportions.

Closeness

Synthetic data shall be as close as possible, but not too close to actual data. Accuracy and privacy, can both be understood as concepts of (dis)similarity, with the key difference being that the former is measured at an aggregate level and the latter at an individual level.

Coherence

In the context of synthetic data, Coherence refers to how logically consistent and realistic the relationships between different data attributes are. Coherent synthetic data preserves patterns, correlations, and dependencies found in real data, making it useful for reliable testing, analysis, or machine learning without introducing contradictions or unrealistic combinations.

Context

Synthetic data can also retain the temporal structure of actual data and it can be used as a tool to conditionally generate sequences for a given context, i.e. subjects with specific characteristics.

Correlation

Association (dependence) between two variables. They can be linearly or nonlinearly related.

D

Data generation

Data generation is the process of measuring and recording events and different phenomena, real or artificial by sensors, algorithms or people. Synthetic data generation is done by deep learning algorithms, which learn the characteristics of a data sample and are capable of generating more data with the same correlations and statistical properties.

Data-centric AI

Unlike a model-centric approach where the goal is to improve a model given a fixed data set, in the data-centric approach the model is fixed and the data is systematically changed or the data set is improved to improve the performance of the model.

DCR

Comparison of the overall distribution of synthetic data distances to closest records (DCR) in the original data is one of the possible (dis-)similarity-based privacy tests (having actual holdout records as a reference). Bad synthetic data is when the original target data is perturbed by noise. The DCR is intended to capture such scenarios.

Deterministic

Synthetic data generators shall detect and retain any possible statistical (=non-deterministic) relationships between any number of attributes. The generated synthetic data is only informed by the available data. On the other hand, in general it does not retain deterministic or algebraic rules between attributes 100%. However, rules defined by humans might be incorporated into the […]

Differential privacy

Differential privacy is a mathematical guarantee that anyone who sees the result of a differentially private analysis will draw the same conclusion about each individual's private information, whether that individual's private information is included in the input to the analysis or not.

Downstream model

Downstream models (i.e. regression or classification models) are used to evaluate the utility of the synthetic data. Synthetic data can be used as the replacement for the actual data if the accuracy of the model trained on the synthetic data is comparable to the accuracy of the model trained on actual data.

E

Early stopping

As with other machine learning models, the synthetic generator improves with training, leading to an increase in the accuracy of the generated synthetic data. However, training should be terminated earlier with pre-defined early stopping criteria. Continuing further training would increase the chances of overfitting the training data, which would pose a privacy risk.

Encoding

Encoding is the process of changing original data into a form that can be used by the synthetic generator model. Each column can have different encoding such as numerical, categorical, datetime or geo. In an ideal case, the detection of the encoding types is done automatically by the synthetic data generator.

Encryption

Encryption is the process of encoding information into a non-readable secret code that hides the information's true meaning. Only people or a system with access to a secret key can read it.

Explanaible AI

The term Explainable AI refers to a set of methods and processes that allows humans to understand the decisions or predictions made by the AI. For explainers (such as the SHAP explainer), original model and training data is needed. When the synthetic data is accurate, they can be used as drop-in replacement for actual data […]

Extreme-value protection

Extreme values have a high risk of disclosing sensitive information and need to be handled accordingly in order to protect against membership inference. Therefore, privacy-safe synthetic data might differ in the minimum and maximum values if these values only rely on a few individuals.

F

Fairness

Fairness or algorithmic fairness refers to different approaches to removing algorithmic bias from machine learning models. The process of data synthesization can be used to fix biases embedded in the data via upsampling minority groups, such as high earning women in a dataset. The challenge in creating fair algorithms is that fairness needs to be […]

Fake data

To generate fake data (also called mock data), one can use available open-source libraries such as Faker and generate it without a need to touch production data. This data is randomly generated for each specified column. Fake or mock data does not contain any of the properties and relationships of the original dataset. It is […]

Feature importance

Feature importance is a technique that calculates a score for all input features for a given model. The score simply represents the "importance" of each feature. The higher the score, the more influence a specific element has on the model. When training the downstream predictive model using original data and synthetic data, the feature importance […]

Fidelity

Fidelity, in other words accuracy, precision or realism, is one of the key aspects of synthetic data. The goal of synthetic data is to be accurate, i.e. as similar as possible to the original data that was used to generate the synthetic data.

G

Generalization

Data generalization is an old data anonymization method. To protect individuals' data in a dataset, unique values are replaced with generic values, which results in a privacy-utility trade off. K-anonymity is one of the most frequently applied generalization techniques. Unfortunately, generalization fails to protect privacy and also greatly reduces the utility of the data. Read […]

Generative Adversarial Network (GAN)

A Generative Adversarial Network (GAN) is a machine learning framework with two neural networks: a generator that creates synthetic data and a discriminator that tries to distinguish real from fake data, trained together in a competitive process.

Geolocation data

Geolocation data defines a physical location of a person or an object. The most common geolocation data format is made up of latitudes and longitudes. Synthetic geolocation data can be used to replace real coordinates to protect the privacy of data subjects.

H

Holdout set

Holdout data (also called testing data) refers to a portion of original data that is held out of the data sets used for training and validating synthetic data models. The purpose is to provide a final unbiased comparison of the machine learning model's performance trained on the original and the synthetic data. Accurate synthetic data […]

Homogeneity attack

A homogoneity attack is a privacy attack that can be applied to data that is anonymized using a simple generalization technique if the data share the same values of their quasi-identifiers and have the same values for their sensitive attributes. If the groups do not contain different values, an attacker can reveal sensitive information simply […]

Hybrid data

Hybrid data represents the concept of enriching original data with synthetic data. A model trained on hybrid data can lead to better model accuracy compared to a model trained on the original data set. However, such an approach can only be used if privacy is not an issue.

Hyperparameters

Hyperparameters are parameters that control the synthetic model learning process and can affect the accuracy of the resulting synthetic data. The number of epochs, batch size or learning rate can be some of the possible hyperparameters.

I

Identity disclosure

Identity disclosure is the risk of actual individuals being linked to synthetically generated subjects. Classic anonymization anonymizes each record individually, so the risk of identity disclosure is high. In contrast, synthetic data does not maintain a 1:1 relationship between the synthetic record and the actual record.

Imputation

Imputation algorithms are algorithms that fill in (impute) missing values in a dataset. Representative synthetic data contains the same amount of missing values as the original data, and therefore in many cases missing values also need to be imputed in the synthetic data.

Inference attack

An inference attack is a technique for analyzing (synthetic) data to gain knowledge about the original subject or the entire database. Membership attacks or model inversion attacks are examples of inference attacks.

Injection of expertise

Injection of expertise in data synthesis is the concept of incorporating domain knowledge provided as additional input. The resulting synthetic data preserves the statistical properties obtained from the data as well as the domain knowledge of the rules.

Interpolation

Interpolation is the process of determining unknown values between two known values. Linear interpolation is the basic method, but it is very limited in multidimensional space. Instead, a synthetic data interpolation method that takes into account all the features and relationships between them may be better.

Iteration

In each iteration (training step) of the training process, internal model parameters are updated. The number of iterations is equal to the number of batches for epoch.

J

K

k-anonymity

k-anonymity is a traditional anonymization technique that preserves privacy by creating groups consisting of k records that are indistinguishable from each other, so that the probability that the person is identified based on the quasi-identifiers is not more than 1/k.

L

L-diversity

L-diversity is an extension of k-anonymity. Due to little diversity in sensitive attributes, k-anonymized data may be vulnerable to homogeneity attacks against which L-diversity protects.

L1 distance

L1 distance is one of the possible distance measures between two probability distribution vectors and it is calculated as the sum of the absolute differences. The smaller the distance between the observed probability vectors, the higher the accuracy of the synthetic data. L1 distance can be used to compare the empirical probability distributions of features […]

Learning rate

The learning rate is a hyperparameter that controls how much model parameters are adjusted in each iteration.

Linkage attack

A linkage attack is an attempt to re-identify an individual in the anonymized dataset by combining that data with other available information. One of the most famous successful linkage attack was done by researchers reidentifying part of the anonymized Netflix movie-ranking data by linking it to another non-anonymous database.

Loss function

The loss function serves as an evaluation of the training process, that is, how well the algorithm for generating synthetic data models the original data. The less loss, the more knowledge there is about the original data. To avoid overfitting, an early stop is usually implemented to terminate the training process as soon as the […]

M

Masking

Data masking is a general term for an anonymization technique where data is protected by masking or encrypting sensitive information. If individuals are fully protected by this method, the usefulness of anonymized data is very low.

Membership inference attack

A membership inference attack is the process of determining whether a sample comes from the training dataset of an ML model. In the case of data synthetization, a membership attack is successful if the attacker can identify from the generated synthetic data set whether the subject was part of the original data set. This is […]

Metadata

Metadata is data that contains information about data. When synthesizing original data, information such as the number of rows, the number of columns, or the cardinality of columns are examples of metadata that can be stored when analyzing data.

Mock data

Mock data is random data that follows a predefined column format. In contrast to generating synthetic data, generating mock data does not require training a model to preserve information from the original data set. Since no access to the original data is needed, the resulting data does not contain any statistical information from the original […]

Model inversion attack

A model inversion attack is a privacy attack where the attacker is able to reconstruct the original samples that were used to train the synthetic model from the generated synthetic data set.

N

Nearest neighbor

Nearest neighbor (NN) is the sample in the training dataset that has the shortest distance to the sample of interest. Synthetic data needs to be as close as possible to the original data but not too close. To explore the closeness, for each synthetic data sample, the nearest neighbor sample in the original data can […]

NNDR

The nearest neighbor distance ratio (NNDR) serves as a measure of proximity to outliers in the original data set. By definition, the NNDR is calculated as the ratio of the distance to the nearest record (nearest neighbor) and the second (or 3rd, 5th,...) nearest record. When calculating the NNDR for each synthetic records in the […]

Noise addition

Adding noise to original data is one of the privacy protection techniques. Data can be perturbed and randomized with a defined noise level. A higher noise level improves data privacy, but data value decreases significantly.

Non-linearity

Realistic data contains very complicated relationships that are non-linear. In statistics, nonlinearity is a relationship between a variable that cannot be described by a line. Therefore, to understand the non-linear relationships in the data, a more sophisticated model architecture such as neural networks must be used. Simple methods such as linear regression would not be […]

O

One-to-one relationship

All traditionally anonymized data sets retain a 1:1 link between each anonymized record and one real entity. That's also why traditional anonymized datasets have the same dataset size as the original dataset. Synthetic data does not suffer from this limitation because it contains artificially generated information without any connection to real persons. These 1:1 links […]

Open data

Open data is data that anyone can use freely. Data such as medical data or customer data cannot be freely shared because it contains sensitive information. However, anyone can use the synthetic version of such data.

Optimized data synthesization

Data synthesization can be optimized with regard to the quality requirement of the resulting synthetic data. Optimization is the task of finding a defined optimum. In the case of data synthesization, the main concern is the accuracy of the synthetic data, but in some cases it can also be the time required to obtain meaningful […]

Outlier

An outlier is a data point that is significantly different from the other samples in the data set. These data points are the most vulnerable to privacy attacks in the original dataset and should be handled with extra care during the synthetization.

Overfitting

Overfitting occurs when a model cannot generalize and instead fits the training data set too closely. In this case, only the training loss is decreased in each epoch, the validation loss is not.

P

Performance testing

Testing the performance of a prediction model on the original holdout dataset while training the model on the original training dataset versus the synthetic dataset is a key demonstration of whether the utility of the synthetic data is sufficient and can be used as a replacement for the original dataset for model development.

Permutation

Permutation is one of the traditional anonymization techniques. The values in the sensitive column are permuted or scrambled. This allows general statistics to be provided for each column separately, but the relationship between the columns and all correlation is lost. Moreover, despite the loss of data value, this method is not sufficient to preserve the […]

PET

PET is an acronym for privacy-enhancing technology. The main purpose of these technologies is data protection, which means minimizing the use of personal data and maximizing data security. Synthetic data is one of the most prominent privacy-enhancing technologies.

PII (Personal Identifiable Information)

Personal identifiable information is information that is a direct identifier of a person. Examples of PII are a social security number, passwords or email addresses. Removing PII from a dataset does not make it anonymous and is still considered to be personal data.

Privacy checks

Synthetic data is one of the privacy enhancing technologies, which means that preserving the privacy of the original data used to create the synthetic data is of utmost importance and needs to be controlled. Synthetic data should be as close as possible to the original data, but not too close to allow any privacy attack. […]

Probabilistic models

Probabilistic models are models that learn the probability distribution of the underlying data, i.e. automatically detect and store all possible statistical relationships between any number of attributes. The model is based only on the available data itself. Such a model cannot know with certainty whether a particular combination of attributes is impossible or certain. With […]

Programmable data

Programmable data is data that you can manipulate and control when generating synthetic data from original data. During this process, the statistical features of the data can be rebalanced, imputed, or have a stricter or looser adherence to the detected distributions and correlations. This allows you to improve downstream machine learning model performance, simulate what-if […]

Pseudonymization

Pseudonymization refers to a data management and de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers, or pseudonyms. Pseudonymization is not safe and is considered to be a legacy anonymization approach.

Q

Quality assurance

The quality assurance of synthetic data involves the comparison of the univariate and multivariate probability distributions of the columns of the original and synthetic data. The goal is to minimize the difference between them while preventing privacy leakage. This can be measured by various accuracy and privacy metrics. Such an evaluation serves as a great […]

Quasi-identifiers

Quasi-identifiers (also called indirect identifiers) are information that does not by itself identify a specific person, which is the case with PII (personal identifiable information). However, problems arise from deriving PII from a combination of multiple quasi-identifiers together provide a clue to identify a person. As an example, it has been shown that 63% of […]

R

Random data generation

Random data is generated, only mimicking the structure of real data. Random data is useful only if the data structure is important but not the content.

Randomization

Randomization is a traditional anonymization technique where values are modified according to predefined randomized patterns. One example is perturbation, which works by adding systematic noise to the data. The more noise is added, the more the value of the data decreases. At the same time, this method does not ensure privacy protection.

Rare category protection

Rare categories in original data can result in re-identification when regenerated in synthetic data, so privacy-protection mechanisms must be used during the synthesization process to ensure privacy. Example of rare categories are highly unique values in the dataset such as CEO as a job title.

Re-identification

Re-identification, also called de-anonymization, is the process of identifying a real person from anonymized data. Weak anonymization techniques, such as randomization, have a high risk of re-identification.

Rebalancing

Rebalancing is the technique of changing the balance of different classes in the data before applying a classifier to them. Unbalanced data can be changed to fully balanced by downsampling the majority class or upsampling the minority class by either replicating existing samples or creating new synthetic data points from the minority class.

Referential integrity

Referential integrity ensures that relationships between tables in a database remain consistent. That means that foreign keys must match valid primary keys, preventing orphaned or invalid references. It enforces data accuracy and reliability by maintaining valid links across related records. MOSTLY AI generates synthetic data that preserves referential integrity by learning and reproducing relationships across […]

Rejection sampling

Rejection sampling is filtering by rejecting samples. This technique can be used in synthetic generation as a post-processing so that samples with low fidelity can be rejected and will not be present in the final synthetic data set. An example is rejecting samples that do not meet deterministic business rules.

Rule-based data generation

Mock data is generated following specific rules defined by humans. Rule-based systems can be used for simple use cases with low, fixed requirements toward complexity.

S

Sampling

Sampling is the random selection of values or complete records based on a defined probability distribution. The generation of synthetic data is based on sampling, where the underlying distribution is learned during the training of the synthetic model and from which records are sampled to create a final synthetic dataset that contains all the properties […]

Sequential data

Sequential data is data arranged in sequences where order matters. Data points are dependent on other data points in this sequence. Examples of sequential data include customer grocery purchases, patient medical records, or simply a sequence of numbers with a pattern. A special type of sequential data is time series data.

Shap value

Shap (Shapley additive explanations) values are used in Explainable AI to better understand the output of machine learning models. It helps interpret prediction models as it shows the contribution and importance of each attribute on the predictions. Synthetic data can be used to transparently share this information. To calculate Shap values, it is necessary to […]

SMOTE

SMOTE is a synthetic minority oversampling technique based on nearest neighbor information. It was first developed for a numeric column where the minority class is upsampled by taking each sample of the minority class and its nearest neighbors and forming a linear combination of them. SMOTEN-C also takes categorical columns into account and selects the […]

Statistical parity

Statistical parity is one possible definition of fairness in ML, which adjusts the data so that decisions are made fairly without discrimination. The goal is to ensure the same probability of inclusion in the positive predicted class for each sensitive group. An example is that women and men are equally likely to be promoted at […]

Statistically significant

Statically significant is a term used in hypothesis testing. When you test some null hypothesis, such as whether sample S1 and sample S2 have the same median, you must consider not only the observed medians but also the variance present in the samples and construct a confidence interval that helps decide whether you can reject […]

Stochastic

Stochastic (random or probabilistic) is the property of having a random probability distribution or pattern that can be analyzed statistically. In the case of synthetic data generation, the original data has a probability distribution, so that all statistical (=non-deterministic) relationships between any number of attributes can be learned using stochastic modeling. Such modeling cannot ensure […]

Structured data

Structured data is data that is well structured and easily accessible to a human or computer. An example of structured data is tabular data that is stored in the form of rows and columns. CSV files and Parquet files are typical formats for structured data. Structured data is typically stored in relational databases.

Synthetic data

Synthetic data is generated by deep learning algorithms trained in real data samples. Synthetic data is used as a proxy for real data in a wide variety of use cases from data anonymization, AI and machine learning development, data sharing and data democratization. There are different synthetic data generators with different capabilities and synthetic data […]

T

T-closeness

T-closeness is a traditional anonymization technique that is based on the logic of k-anonymity and goes beyond l-diversity to protect against attribute disclosure. The method requires more generalizations, namely that the distribution of the sensitive attribute in any equivalence class is close to the the distribution of the attribute in the overall table.

Target data

The original data used as input for data synthetization is sometimes called target data because it is the target that the generated synthetic data wants to approach while maintaining its privacy.

Time series

Time series data is a special type of sequential data where the data is ordered by timestamps. An example could be the share price or the number of viewers per month. In general, this is information recorded in time, and the record in the current timestamp is affected by records in previous timestamps.

Training dataset

The training dataset is the set of original data used to train the synthetic model. The model learns all the data patterns from the training data. It is the largest set of original data used in synthetic data generation (compared to the validation set and holdout set).

Training loss

The training loss is the loss that is calculated on the training set. The goal of the learning process is to minimize training losses, which measure how well the model learns the patterns in the training data.

U

Unbalanced

Unbalanced data is data that has a very different proportion of target outputs. A typical case is information on fraudulent cases. Frauds are very rare in the available data, but it is of the utmost importance to predict them correctly, as they can lead to great damage. Upsampling of minority classes is one method that […]

Underfitting

Unferfitting is the opposite of overfitting. In the case of data synthesization, this means that the model is not well trained, which leads to a decrease in the quality of the resulting synthetic data and its usefulness. This can be caused by an inappropriate model, wrong settings of various hyperparameters, or short training.

Univariate distribution

A univariate distribution is a probability distribution of a single random variable. In the case of tabular synthetic data, this is the probability distribution of individual columns. A comparison of the univariate distributions of the original and synthetic data shows whether each of the columns is synthesized correctly. It does not evaluate any of the […]

Unstructured data

Unstructured data is data that is not stored in predefined formats. Unlike structured data, it is easier to collect, but usually more difficult to process and analyze, and cannot be easily used in data models. Images, sounds or videos are examples of unstructured data types.

Upsampling

Upsampling is a method of balancing the minority and majority classes by adding additional samples to the minority classes. The most trivial example is naive resampling, which randomly adds existing samples multiple times to the final balanced data set. Another more sophisticated method is SMOTE. The advantage of synthetic data is that its generation can […]

Utility

Utility is one of the key aspects of synthetic data generation. The data must preserve all insights of the original data so that the performance of a particular ML use case is similar to the performance of the same task when the original data is used.

V

Validation loss

The validation loss is the loss that is calculated on the validation set. To avoid overfitting, early stopping is implemented and training is stopped once the validation loss stops decreasing Continuing training further would increase the chances of overfitting on the training data which may pose a privacy risk.

Validation set

The validation dataset is used for model evaluation during the training process of the synthetic model. The model validates its accuracy against this data set, but does not learn from the validation data set.

Variance

Variance is a measure of variability. More precisely, it is the expectation of the squared deviation of a random variable from its mean. The higher the variance, the greater the difference can be between sample datasets. The synthetic data generated are therefore always different (although statistically similar) due to variance in data and stochastic modeling.

Variational Autoencoder (VAE)

A Variational Autoencoder (VAE) is a type of neural network that compresses input data into a smaller, structured latent space while learning to reconstruct the original data. Unlike a regular autoencoder, it assumes the latent space follows a probability distribution, allowing it to generate new, similar data samples. This makes VAEs useful for tasks like […]

W

X

Y

Z

The Synthetic Data Dictionary