Real datasets come with inconvenient constraints. Privacy regulations may prevent sharing customer records across teams or with vendors. Rare events — fraud transactions, equipment failures, disease diagnoses — appear too infrequently to train robust models. Class imbalance distorts classifiers toward the majority class. And in early development, the data we need may not exist yet.
Synthetic data addresses all of these. Depending on the method, we can generate records that preserve the statistical relationships of a real dataset without exposing individual records, oversample minority classes while respecting feature correlations, or create datasets with arbitrary properties to test a pipeline before production data arrives.
We cover simple statistical generation, controlled synthetic datasets, class imbalance oversampling with SMOTE, preserving joint distributions with the Synthetic Data Vault, generative model approaches, basic privacy concepts, and how to evaluate whether synthetic data is actually useful.
Code
import numpy as npimport pandas as pdfrom scipy import statsimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.datasets import make_classification, make_regressionfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import classification_report, roc_auc_scorefrom sklearn.preprocessing import StandardScalersns.set_style("whitegrid")np.random.seed(42)print("Libraries loaded.")
15.1 Generating from Distributions
The simplest form of synthetic data is sampling from parametric distributions. We define the marginal distribution of each feature, generate independently, and optionally impose correlation structure using a multivariate normal or a copula.
This approach works well when we know the rough shape of each variable — which we can read from the real data summary statistics — and when the correlations between variables are modest. It breaks down when variables have complex, non-linear dependencies or when the joint distribution has multimodal structure that a normal copula cannot capture.
Code
# Simulate a customer transaction dataset from scratchn =2000# Correlated features via Cholesky decompositioncorr = np.array([ [1.0, 0.55, -0.3], # age [0.55, 1.0, -0.2], # income [-0.3,-0.2, 1.0], # num_complaints])L = np.linalg.cholesky(corr)Z = np.random.randn(n, 3) @ L.Tdf = pd.DataFrame({"age": np.clip(30+12* Z[:, 0], 18, 75).astype(int),"annual_income": np.clip(60000+25000* Z[:, 1], 20000, 200000).astype(int),"num_complaints":np.clip(np.round(0.5+1.2* Z[:, 2]), 0, 8).astype(int),"tenure_years": np.random.exponential(3.5, n).clip(0, 20).round(1),"channel": np.random.choice(["web","mobile","branch"], n, p=[0.5,0.35,0.15]),})churn_logit =-3+0.03*df["num_complaints"] -0.01*df["tenure_years"]df["churned"] = (np.random.rand(n) <1/(1+np.exp(-churn_logit))).astype(int)print(df.head())print()print(df.describe().T.round(2))print()print(f"Churn rate: {df.churned.mean():.1%}")print(f"Age-Income correlation: {df["age"].corr(df["annual_income"]):.3f} (target: 0.55)")
15.2 Controlled Synthetic Datasets with scikit-learn
When the goal is to test a modeling pipeline rather than to mimic a specific real dataset, scikit-learn’s make_classification and make_regression offer precise control over the properties of the data: number of informative features, class overlap, cluster structure, noise level, and class weight.
These functions are particularly useful for understanding how an algorithm behaves under different conditions — how performance degrades with more noise, how many features are needed for good accuracy, what happens with extreme class imbalance — before committing to a real dataset where those factors are confounded with each other.
Code
from sklearn.datasets import make_classificationfrom sklearn.linear_model import LogisticRegression# Vary class imbalance and measure effect on AUCweights = [(0.5,0.5), (0.8,0.2), (0.9,0.1), (0.95,0.05), (0.99,0.01)]results = []for w in weights: X, y = make_classification( n_samples=2000, n_features=10, n_informative=5, n_redundant=2, weights=list(w), flip_y=0.01, random_state=42 ) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y) model = LogisticRegression(max_iter=1000) model.fit(X_train, y_train) auc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1]) minority_pct =f"{w[1]:.0%}" results.append({"minority_%": minority_pct, "AUC": round(auc,3),"minority_n_train": int(y_train.sum())})print(pd.DataFrame(results).to_string(index=False))
15.3 Handling Class Imbalance with SMOTE
When the minority class is rare, a classifier can achieve high accuracy simply by predicting the majority class always. SMOTE (Synthetic Minority Over-sampling Technique) addresses this by generating synthetic minority class examples along line segments connecting existing minority examples in feature space.
Unlike simple random oversampling (which just duplicates existing records), SMOTE creates new interpolated points, which tends to produce smoother decision boundaries and better generalization. Several variants exist:
SMOTE: baseline — interpolate between a minority point and one of its k-nearest minority neighbors
BorderlineSMOTE: focuses on minority examples near the class boundary, where the model is most uncertain
ADASYN: weights generation by difficulty — more synthetic samples near regions where the model struggles
Install: pip install imbalanced-learn
An important caution: SMOTE should be applied only to the training set, after the train/test split. Applying it before splitting causes data leakage and inflates test performance.
Code
# Baseline imbalanced vs SMOTE-resampled comparisontry:from imblearn.over_sampling import SMOTE smote_available =TrueexceptImportError: smote_available =Falseprint("pip install imbalanced-learn to run this cell")if smote_available: X, y = make_classification( n_samples=3000, n_features=10, n_informative=5, weights=[0.92, 0.08], flip_y=0.01, random_state=42 ) X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)# Baseline: train on imbalanced data clf_base = RandomForestClassifier(n_estimators=100, random_state=42) clf_base.fit(X_tr, y_tr) auc_base = roc_auc_score(y_te, clf_base.predict_proba(X_te)[:,1])# SMOTE: resample training set only sm = SMOTE(random_state=42) X_res, y_res = sm.fit_resample(X_tr, y_tr) clf_sm = RandomForestClassifier(n_estimators=100, random_state=42) clf_sm.fit(X_res, y_res) auc_sm = roc_auc_score(y_te, clf_sm.predict_proba(X_te)[:,1])print(f"Original minority class: {y_tr.sum()} / {len(y_tr)} ({y_tr.mean():.1%})")print(f"After SMOTE: {y_res.sum()} / {len(y_res)} ({y_res.mean():.1%})")print()print(f"AUC — baseline: {auc_base:.4f}")print(f"AUC — with SMOTE: {auc_sm:.4f}")print()print("Classification report (SMOTE model):")print(classification_report(y_te, clf_sm.predict(X_te)))
15.4 Preserving Statistical Structure with SDV
Simple distribution sampling and SMOTE both ignore the joint distribution of the full dataset — the correlations, conditional distributions, and categorical-continuous relationships that make data realistic. The Synthetic Data Vault (SDV) library models the full joint distribution and generates records that preserve these relationships.
The GaussianCopulaSynthesizer fits a copula to model dependencies between variables, transforming each marginal to a standard normal and modeling the correlation structure of the resulting normal vectors. It handles mixed datatypes (continuous, categorical, datetime) and can enforce constraints (e.g., age > 0, start_date < end_date).
For more complex distributions, the CTGANSynthesizer uses a conditional GAN architecture specifically designed for tabular data, and TVAESynthesizer uses a variational autoencoder. Both handle multimodal distributions and complex interactions that the Gaussian copula misses.
Install: pip install sdv
Code
try:from sdv.single_table import GaussianCopulaSynthesizerfrom sdv.metadata import SingleTableMetadata sdv_available =TrueexceptImportError: sdv_available =Falseprint("pip install sdv to run this cell")if sdv_available: metadata = SingleTableMetadata() metadata.detect_from_dataframe(df) synth = GaussianCopulaSynthesizer(metadata) synth.fit(df) synthetic_df = synth.sample(num_rows=2000)print("Real data stats:")print(df[["age","annual_income","num_complaints"]].describe().T[["mean","std"]].round(1))print()print("Synthetic data stats:")print(synthetic_df[["age","annual_income","num_complaints"]].describe().T[["mean","std"]].round(1))print()print("Real correlation (age, income): ", df["age"].corr(df["annual_income"]).round(3))print("Synthetic correlation (age, income):", synthetic_df["age"].corr(synthetic_df["annual_income"]).round(3))else:# Show what the output looks likeprint("SDV preserves marginal distributions and pairwise correlations.")print("After fitting, synth.sample(num_rows=N) returns a DataFrame")print("with the same schema and statistical structure as the original.")
15.5 Privacy Considerations
Synthetic data is not automatically private. A model trained on sensitive data can memorize individual records, allowing an adversary to reconstruct them from synthetic samples. This is especially true of generative models trained on small datasets or to very high fidelity.
Differential privacy (DP) provides a formal guarantee: any synthetic record could plausibly have been generated from a dataset with any single real record changed or removed. The privacy parameter \(\varepsilon\) controls the tradeoff — smaller \(\varepsilon\) means stronger privacy but lower data quality.
In practice, DP-synthetic data is implemented by adding calibrated noise to the sufficient statistics of the generative model before sampling. The diffprivlib library (IBM) provides DP-aware ML algorithms. The smartnoise-sdk package provides DP synthetic data generation.
Even without formal DP, several practical measures reduce re-identification risk: capping extreme values, adding small amounts of noise, suppressing rare combinations of quasi-identifiers, and not generating synthetic records for groups smaller than k individuals (k-anonymity).
Code
# Demonstrate the Gaussian mechanism: adding DP noise to a statistic# (A toy illustration — real DP libraries handle sensitivity and composition)true_mean_income = df["annual_income"].mean()n_records =len(df)# Sensitivity of the mean: max change one record can cause# For bounded data [lo, hi], sensitivity = (hi - lo) / nlo, hi =20000, 200000sensitivity = (hi - lo) / n_recordsepsilon_values = [0.01, 0.1, 1.0, 10.0]print(f"True mean income: ${true_mean_income:,.0f}")print(f"Sensitivity: ${sensitivity:.2f}")print()print("{:>8}{:>16}{:>18}{:>12}".format("epsilon","DP noise sigma","DP mean estimate","abs error"))for eps in epsilon_values: sigma = sensitivity / eps dp_mean = true_mean_income + np.random.normal(0, sigma) err =abs(dp_mean - true_mean_income)print(f"{eps:>8.2f}{sigma:>16,.1f} ${dp_mean:>16,.0f} ${err:>10,.0f}")print()print("Smaller epsilon = more privacy = more noise = less accurate estimate.")
15.6 Evaluating Synthetic Data
Two properties matter: fidelity (does the synthetic data look like the real data statistically?) and utility (is a model trained on synthetic data useful for predicting on real data?).
Fidelity is measured by comparing marginal and joint distributions — KS tests per feature, chi-square tests for categoricals, correlation matrix comparison. SDV’s evaluate_quality function automates this.
Utility is measured by the Train on Synthetic, Test on Real (TSTR) protocol: train a downstream model on synthetic data, evaluate it on a held-out real test set, and compare its performance to a model trained on real data. A smaller gap between the two indicates better synthetic data utility.
Code
# Train on Synthetic, Test on Real (TSTR) evaluationfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import roc_auc_scorefeature_cols = ["age","annual_income","num_complaints","tenure_years"]target_col ="churned"X_real = df[feature_cols].valuesy_real = df[target_col].valuesX_tr_r, X_te_r, y_tr_r, y_te_r = train_test_split(X_real, y_real, test_size=0.25, random_state=0)# Model trained on real data (upper bound)scaler = StandardScaler()lr_real = LogisticRegression(max_iter=1000)lr_real.fit(scaler.fit_transform(X_tr_r), y_tr_r)auc_real = roc_auc_score(y_te_r, lr_real.predict_proba(scaler.transform(X_te_r))[:,1])# Model trained on our distribution-sampled synthetic dataX_syn = df[feature_cols].sample(1000, replace=True, random_state=7).valuesX_syn = X_syn + np.random.normal(0, X_syn.std(axis=0)*0.05, X_syn.shape) # add small noisey_syn = (np.random.rand(1000) <0.08).astype(int) # synthetic labels with approx churn ratelr_syn = LogisticRegression(max_iter=1000)lr_syn.fit(scaler.fit_transform(X_syn), y_syn)auc_syn = roc_auc_score(y_te_r, lr_syn.predict_proba(scaler.transform(X_te_r))[:,1])print("TSTR Evaluation:")print(f" AUC (trained on real data): {auc_real:.4f} <-- upper bound")print(f" AUC (trained on synthetic data): {auc_syn:.4f}")print()print("The closer the synthetic AUC to the real-data AUC, the more useful")print("the synthetic data is as a substitute for the real thing.")# Feature KS tests for fidelityprint("Kolmogorov-Smirnov fidelity tests (p > 0.05 = indistinguishable):")for col in ["age","annual_income","num_complaints","tenure_years"]: real_vals = df[col].values# Use bootstrap resample as our proxy for synthetic syn_vals = df[col].sample(1000, replace=True).values + np.random.normal(0, df[col].std()*0.05, 1000) ks, p = stats.ks_2samp(real_vals, syn_vals)print(f" {col:<20} KS={ks:.3f} p={p:.3f}")
15.7 Key Takeaways
Synthetic data serves four purposes: privacy preservation, handling rare events, correcting class imbalance, and early-stage development
Sampling from distributions is quick and controllable but ignores joint structure; use it when marginal distributions are all that matter
Apply SMOTE only to the training set, after the train/test split; never before
SDV’s GaussianCopulaSynthesizer preserves correlations and mixed datatypes; CTGAN handles multimodal and complex joint distributions
Formal differential privacy provides mathematical guarantees but at a fidelity cost; practical measures (noise, suppression) reduce risk without formal guarantees
Evaluate synthetic data on both fidelity (KS/chi-square) and utility (TSTR AUC gap)