Generating and validating synthetic clinical data

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Motivation

Sharing individual-level clinical data across institutions is often restricted by privacy regulations and institutional review boards. Synthetic data preserves the statistical properties of real data while reducing re-identification risk, enabling multi-site collaboration without data transfer.

Example: synthesizing patient records

library(syntheticdata)

set.seed(42)
real <- data.frame(
  age     = rnorm(500, mean = 65, sd = 12),
  sbp     = rnorm(500, mean = 135, sd = 22),
  sex     = sample(c("Male", "Female"), 500, replace = TRUE),
  smoking = sample(c("Never", "Former", "Current"), 500,
                   replace = TRUE, prob = c(0.4, 0.35, 0.25)),
  outcome = rbinom(500, 1, 0.28)
)
head(real)
#>        age      sbp    sex smoking outcome
#> 1 81.45150 157.6411   Male   Never       0
#> 2 58.22362 155.1250 Female Current       1
#> 3 69.35754 134.9460   Male   Never       0
#> 4 72.59435 137.9922   Male  Former       0
#> 5 69.85122 119.1566 Female Current       0
#> 6 63.72651 130.6413 Female  Former       0

Parametric synthesis (Gaussian copula)

The default method estimates marginal distributions empirically and captures the joint dependence structure via a Gaussian copula on normal scores. This preserves both marginal shapes and pairwise correlations.

syn <- synthesize(real, method = "parametric", n = 500, seed = 1)
syn
#> 
#> ── Synthetic data (parametric)
#> 500 real -> 500 synthetic records
#> Variables: age, sbp, sex, smoking, outcome

Validation

validate_synthetic() computes four classes of metrics:

val <- validate_synthetic(syn)
val
#> 
#> ── Synthetic data validation
#> ks_statistic_mean: 0.0247 (Good fidelity)
#> correlation_diff: 0.0141 (Excellent)
#> discriminative_auc: 0.5117 (Indistinguishable)
#> nn_distance_ratio: 0.8595 (Moderate risk)

KS statistic: distributional similarity (lower is better).
Correlation difference: preservation of variable associations.
Discriminative AUC: can a classifier distinguish real from synthetic? Values near 0.5 mean indistinguishable.
NN distance ratio: privacy metric. Values above 1 indicate synthetic records are not memorizing real individuals.

Comparing methods

compare_methods() runs all three synthesis methods on the same data and returns a single comparison table:

comp <- compare_methods(real, seed = 1)
comp
#> 
#> ── Synthesis method comparison
#> # A tibble: 12 × 4
#>    method     metric              value interpretation   
#>  * <chr>      <chr>               <dbl> <chr>            
#>  1 parametric ks_statistic_mean  0.0247 Good fidelity    
#>  2 parametric correlation_diff   0.0141 Excellent        
#>  3 parametric discriminative_auc 0.512  Indistinguishable
#>  4 parametric nn_distance_ratio  0.988  Moderate risk    
#>  5 bootstrap  ks_statistic_mean  0.142  Acceptable       
#>  6 bootstrap  correlation_diff   0.0181 Excellent        
#>  7 bootstrap  discriminative_auc 0.505  Indistinguishable
#>  8 bootstrap  nn_distance_ratio  1.09   Good privacy     
#>  9 noise      ks_statistic_mean  0.135  Acceptable       
#> 10 noise      correlation_diff   0.0162 Excellent        
#> 11 noise      discriminative_auc 0.501  Indistinguishable
#> 12 noise      nn_distance_ratio  1.36   Good privacy

Privacy risk assessment

privacy_risk() provides a deeper privacy audit with three metrics: nearest-neighbor distance ratio, membership inference accuracy, and (optionally) attribute disclosure risk for sensitive columns.

pr <- privacy_risk(syn, sensitive_cols = "age")
pr
#> 
#> ── Privacy risk assessment
#> [!] nn_distance_ratio: 1.0682 (Medium)
#> [OK] membership_inference_acc: 0.497 (Low)
#> [OK] attribute_disclosure_age: 2e-04 (Low)

Downstream model fidelity

model_fidelity() trains a predictive model on synthetic data and evaluates it on real data. The real-data baseline uses in-sample evaluation as an upper bound.

mf <- model_fidelity(syn, outcome = "outcome")
mf
#> # A tibble: 2 × 3
#>   train_data metric value
#>   <chr>      <chr>  <dbl>
#> 1 real       auc    0.523
#> 2 synthetic  auc    0.502

A synthetic-trained model with AUC close to the real-trained baseline indicates that the synthetic data preserves the predictive signal.

Privacy-utility trade-off

Higher noise_level improves privacy but reduces utility:

results <- list()
for (nl in c(0.05, 0.1, 0.2, 0.5)) {
  s <- synthesize(real, method = "noise", noise_level = nl, seed = 1)
  v <- validate_synthetic(s)
  results <- c(results, list(data.frame(
    noise_level = nl,
    ks = v$value[v$metric == "ks_statistic_mean"],
    privacy = v$value[v$metric == "nn_distance_ratio"]
  )))
}
do.call(rbind, results)
#>   noise_level        ks   privacy
#> 1        0.05 0.1373333 0.7011986
#> 2        0.10 0.1346667 1.3123723
#> 3        0.20 0.1393333 1.9435324
#> 4        0.50 0.1673333 4.6841489

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.