The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

DataFusionGDM: Getting Started

DataFusionGDM Team

Overview

DataFusionGDM provides tools to simulate genetic distance matrices (GDM), compare/alignment of distance spaces via MDS and Procrustes, and evaluate imputation under structured missingness (BESMI).

Installation

# Install from GitHub
if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
remotes::install_github("jiashuaiz/DataFusion-GDM")

Simulation and visualization

library(DataFusionGDM)
res <- run_genetic_scenario("default", n_pops = 30, seed = 2025)
#> 
#>  Raw distances before transformation :
#> Dimensions: 30 30 
#> Is symmetric: TRUE 
#> Range of distances: 0 30.34014 
#> Mean distance: 15.88792 
#> Diagonal values - min: 0 max: 0 
#> Quantiles: 0%: 0 25%: 7.938344 50%: 16.54102 75%: 23.1751 100%: 30.34014 
#> 
#>  Final genetic distances after transformation :
#> Dimensions: 30 30 
#> Is symmetric: TRUE 
#> Range of distances: 0 1 
#> Mean distance: 0.7895569 
#> Diagonal values - min: 0 max: 0 
#> Quantiles: 0%: 0 25%: 0.7268217 50%: 0.8404666 75%: 0.9433305 100%: 1
# Display MDS plot (heatmap requires ComplexHeatmap, not shown here)
res$plots$mds()

MDS + Procrustes

# create two related matrices from the simulated GDM
G <- res$results$distance_matrix
A <- G + matrix(rnorm(length(G), 0, 0.02), nrow = nrow(G)); diag(A) <- 0
B <- G + matrix(rnorm(length(G), 0.03, 0.02), nrow = nrow(G)); diag(B) <- 0

mds <- perform_mds(A, B)
Yt <- apply_procrustes(mds$X, mds$Y, mds$Y)
B_cal <- coords_to_distances(Yt)
mean((A - B)^2)
#> [1] 0.001681216
mean((A - B_cal)^2)
#> [1] 0.01596116

BESMI (single dataset)

# Prepare a masked dataset in-memory
mask <- matrix(FALSE, nrow = nrow(G), ncol = ncol(G))
sel <- seq_len(min(5, nrow(G)))
mask[sel, sel] <- TRUE
M_input <- G; M_input[mask] <- NA

# Impute
impt <- besmi_iterative_imputation(M_input, M_mask = mask, M_real = G, max_iterations = 3)
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Number of logged events: 10
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Number of logged events: 15
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Number of logged events: 20
str(impt$metrics)
#> 'data.frame':    3 obs. of  9 variables:
#>  $ k              : logi  NA NA NA
#>  $ bs             : logi  NA NA NA
#>  $ iteration      : num  1 2 3
#>  $ imputation_dis : num  0.316 0.333 0.317
#>  $ propagation_dis: num  0.0668 0.0619 0.0598
#>  $ runtime        : num  NA NA NA
#>  $ improvement_pct: num  NA NA NA
#>  $ converged      : logi  FALSE FALSE FALSE
#>  $ averaged       : logi  FALSE TRUE TRUE

Reproducible pipelines

See inst/examples for fuller pipelines that write results to disk in a project context.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.