Sampling
Sampling is the random selection of values or complete records based on a defined probability distribution. The generation of synthetic data is based on sampling, where the underlying distribution is learned during the training of the synthetic model and from which records are sampled to create a final synthetic dataset that contains all the properties […]
Read more
Sequential data
Sequential data is data arranged in sequences where order matters. Data points are dependent on other data points in this sequence. Examples of sequential data include customer grocery purchases, patient medical records, or simply a sequence of numbers with a pattern. A special type of sequential data is time series data.
Read more
Shap value
Shap (Shapley additive explanations) values are used in Explainable AI to better understand the output of machine learning models. It helps interpret prediction models as it shows the contribution and importance of each attribute on the predictions. Synthetic data can be used to transparently share this information. To calculate Shap values, it is necessary to […]
Read more
SMOTE
SMOTE is a synthetic minority oversampling technique based on nearest neighbor information. It was first developed for a numeric column where the minority class is upsampled by taking each sample of the minority class and its nearest neighbors and forming a linear combination of them. SMOTEN-C also takes categorical columns into account and selects the […]
Read more
Statistical parity
Statistical parity is one possible definition of fairness in ML, which adjusts the data so that decisions are made fairly without discrimination. The goal is to ensure the same probability of inclusion in the positive predicted class for each sensitive group. An example is that women and men are equally likely to be promoted at […]
Read more
Statistically significant
Statically significant is a term used in hypothesis testing. When you test some null hypothesis, such as whether sample S1 and sample S2 have the same median, you must consider not only the observed medians but also the variance present in the samples and construct a confidence interval that helps decide whether you can reject […]
Read more
Stochastic
Stochastic (random or probabilistic) is the property of having a random probability distribution or pattern that can be analyzed statistically. In the case of synthetic data generation, the original data has a probability distribution, so that all statistical (=non-deterministic) relationships between any number of attributes can be learned using stochastic modeling. Such modeling cannot ensure […]
Read more
Structured data
Structured data is data that is well structured and easily accessible to a human or computer. An example of structured data is tabular data that is stored in the form of rows and columns. CSV files and Parquet files are typical formats for structured data. Structured data is typically stored in relational databases.
Read more
Synthetic data
Synthetic data is generated by deep learning algorithms trained in real data samples. Synthetic data is used as a proxy for real data in a wide variety of use cases from data anonymization, AI and machine learning development, data sharing and data democratization. There are different synthetic data generators with different capabilities and synthetic data […]
Read more