The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

soilVAE soilVAE badge

CRAN status R-CMD-check DOI Python TensorFlow

Supervised Variational Autoencoder (VAE) regression for high‑dimensional predictors (e.g., VIS–NIR–SWIR soil spectroscopy), implemented in Python TensorFlow/Keras and exposed in R via reticulate.

The README is also the main reproducible case study, mirroring the vignette (vignettes/soilVAE-workflow.Rmd) so a reader can understand the why, how, and performance without opening additional files.


What soilVAE does

Given spectra \(x\in\mathbb{R}^p\) and a continuous soil property \(y\in\mathbb{R}\), soilVAE learns:

Objective (supervised \(\beta\)-VAE)

We minimize a weighted sum:

\[ \mathcal{L}(x,y) = \underbrace{\|x-\hat x\|_2^2}_{\text{reconstruction}} \;+\; \beta\;\underbrace{D_{KL}\!\left(q_\phi(z\mid x)\,\|\,\mathcal{N}(0,I)\right)}_{\text{regularization}} \;+\; \alpha\;\underbrace{\|y-\hat y\|_2^2}_{\text{regression}}. \]

In the package API, these correspond to beta_kl = β and alpha_y = α.


Installation

CRAN (once accepted)

install.packages("soilVAE")

Development version (GitHub)

# install.packages("remotes")
remotes::install_github("HugoMachadoRodrigues/soilVAE")

Python / TensorFlow setup that does not surprise the user

Because deep learning depends on external Python libraries, this README uses a defensive pattern:

  1. detect whether Python + TF/Keras are available
  2. if not, show exactly how to create a conda env using conda-forge only
  3. run the VAE only when the environment is ready

Important: reticulate “locks” the Python used per R session. If you change env variables or use_*() calls, restart R.

library(reticulate)

# Make sure reticulate isn't forced to a missing python
Sys.unsetenv("RETICULATE_PYTHON")

# Create env (if needed)
if (!"soilvae-tf" %in% conda_list()$name) {
  conda_create("soilvae-tf", python_version = "3.11")
}

# Install core deps from conda-forge
conda_install("soilvae-tf", packages = c("pip", "numpy"), channel = "conda-forge")

# Install TF/Keras via pip inside the env
py_install(c("tensorflow>=2.13", "keras>=3"), pip = TRUE, envname = "soilvae-tf")

Now, in the same R session:

library(soilVAE)
soilVAE::vae_configure(conda = "soilvae-tf")
reticulate::py_config()

Option B: point to an existing Python executable

library(soilVAE)
soilVAE::vae_configure(python = "C:/path/to/python.exe")

Reproducible case study (spectra -> pre-processing -> PLS baseline -> soilVAE)

This follows the workflow style commonly used in soil spectral inference tutorials (e.g., Wadoux et al., 2021) (Wadoux et al. 2021), with a direct comparison between a PLS baseline and soilVAE.

Packages

set.seed(19101991)

pkgs <- c("prospectr", "pls", "reticulate")
for (p in pkgs) if (!requireNamespace(p, quietly = TRUE)) install.packages(p)

library(prospectr)
library(pls)
library(reticulate)

if (!requireNamespace("soilVAE", quietly = TRUE)) {
  stop("soilVAE is not installed. Install it with remotes::install_github('HugoMachadoRodrigues/soilVAE').")
}
library(soilVAE)

# Defensive: detect Python + TF/Keras early, so the README can render everywhere.
has_py <- reticulate::py_available(initialize = FALSE)
has_tf <- FALSE
if (has_py) {
  try(reticulate::py_config(), silent = TRUE)
  has_tf <- reticulate::py_module_available("tensorflow") &&
    reticulate::py_module_available("keras")
}

Data

This example assumes you ship datsoilspc inside the package (data/datsoilspc.rda).

This dataset is provided and described at Geeves et al. (1995)

Geeves, G. W. (Guy W.) & New South Wales. Department of Conservation and Land Management & CSIRO. Division of Soils. (1995). The physical, chemical and morphological properties of soils in the wheat-belt of southern N.S.W. and northern Victoria / G.W. Geeves … [et al.]. Glen Osmond, S. Aust. : CSIRO Division of Soils

data("datsoilspc", package = "soilVAE")
str(datsoilspc)
#> 'data.frame':    391 obs. of  5 variables:
#>  $ clay       : num  49 7 56 14 53 24 9 18 33 27 ...
#>  $ silt       : num  10 24 17 19 7 21 9 20 13 19 ...
#>  $ sand       : num  42 69 27 67 40 55 83 61 54 55 ...
#>  $ TotalCarbon: num  0.15 0.12 0.17 1.06 0.69 2.76 0.66 1.36 0.19 0.16 ...
#>  $ spc        : num [1:391, 1:2151] 0.0898 0.1677 0.0778 0.0958 0.0359 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:2151] "350" "351" "352" "353" ...
#>  - attr(*, "na.action")= 'omit' Named int 392
#>   ..- attr(*, "names")= chr "63"

Expected structure:

Utility: evaluation metrics (base R)

We replicate typical “quantitative” metrics used in soil spectroscopy:
RMSE, MAE, R², bias (ME), RPIQ, and RPD.

eval_quant <- function(y, yhat) {
  y <- as.numeric(y)
  yhat <- as.numeric(yhat)

  ok <- is.finite(y) & is.finite(yhat)
  y <- y[ok]
  yhat <- yhat[ok]

  if (length(y) < 3) {
    return(list(
      n = length(y),
      ME = NA_real_, MAE = NA_real_, RMSE = NA_real_,
      R2 = NA_real_, RPIQ = NA_real_, RPD = NA_real_
    ))
  }

  err <- yhat - y
  me <- mean(err)
  mae <- mean(abs(err))
  rmse <- sqrt(mean(err^2))

  ss_res <- sum((y - yhat)^2)
  ss_tot <- sum((y - mean(y))^2)
  r2 <- if (ss_tot == 0) NA_real_ else 1 - ss_res / ss_tot

  rpiq <- stats::IQR(y) / rmse
  rpd  <- stats::sd(y) / rmse

  list(
    n = length(y),
    ME = me,
    MAE = mae,
    RMSE = rmse,
    R2 = r2,
    RPIQ = rpiq,
    RPD = rpd
  )
}

as_df_metrics <- function(x) {
  data.frame(
    n = x$n,
    ME = round(x$ME, 2),
    MAE = round(x$MAE, 2),
    RMSE = round(x$RMSE, 2),
    R2 = round(x$R2, 2),
    RPIQ = round(x$RPIQ, 2),
    RPD = round(x$RPD, 2),
    stringsAsFactors = FALSE
  )
}

Plot reflectance spectra

matplot(
  x = as.numeric(colnames(datsoilspc$spc)),
  y = t(as.matrix(datsoilspc$spc)),
  xlab = "Wavelength / nm",
  ylab = "Reflectance",
  ylim = c(0, 1),
  type = "l",
  lty = 1,
  col = rgb(0.5, 0.5, 0.5, alpha = 0.3)
)
Raw reflectance spectra (datsoilspc$spc)

Raw reflectance spectra (datsoilspc$spc)

Convert reflectance to absorbance

datsoilspc$spcA <- log(1 / as.matrix(datsoilspc$spc))

matplot(
  x = as.numeric(colnames(datsoilspc$spcA)),
  y = t(datsoilspc$spcA),
  xlab = "Wavelength / nm",
  ylab = "Absorbance",
  ylim = c(0, 4),
  type = "l",
  lty = 1,
  col = rgb(0.5, 0.5, 0.5, alpha = 0.3)
)
Absorbance spectra (datsoilspc$spcA)

Absorbance spectra (datsoilspc$spcA)

Preprocessing: resample (5 nm) + SNV + moving average

oldWavs <- as.numeric(colnames(datsoilspc$spcA))
newWavs <- seq(min(oldWavs), max(oldWavs), by = 5)

datsoilspc$spcARs <- prospectr::resample(
  X = datsoilspc$spcA,
  wav = oldWavs,
  new.wav = newWavs,
  interpol = "linear"
)

datsoilspc$spcASnv <- prospectr::standardNormalVariate(datsoilspc$spcARs)
datsoilspc$spcAMovav <- prospectr::movav(datsoilspc$spcASnv, w = 11)

wavs <- as.numeric(colnames(datsoilspc$spcAMovav))

matplot(
  x = wavs,
  y = t(datsoilspc$spcAMovav),
  xlab = "Wavelength / nm",
  ylab = "Absorbance (SNV + movav)",
  type = "l",
  lty = 1,
  col = rgb(0.5, 0.5, 0.5, alpha = 0.3)
)
Preprocessed spectra (datsoilspc$spcAMovav)

Preprocessed spectra (datsoilspc$spcAMovav)

Split calibration vs validation

set.seed(19101991)

calId <- sample(seq_len(nrow(datsoilspc)), size = round(0.75 * nrow(datsoilspc)))
datC <- datsoilspc[calId, ]
datV <- datsoilspc[-calId, ]  # <-- TEST

par(mfrow = c(1, 2))
hist(datC$TotalCarbon, main = "Calibration (datC)", xlab = "Total carbon")
hist(datV$TotalCarbon, main = "TEST (datV)", xlab = "Total carbon")
Calibration vs TEST splits (datC vs datV)

Calibration vs TEST splits (datC vs datV)

par(mfrow = c(1, 1))

Baseline model: PLS

We fit PLS on calibration and evaluate on validation.

maxc <- 30

soilCPlsModel <- pls::plsr(
  TotalCarbon ~ spcAMovav,
  data = datC,
  method = "oscorespls",
  ncomp = maxc,
  validation = "CV"
)

plot(soilCPlsModel, "val", main = "PLS CV performance (datC)", xlab = "Number of components")
PLS CV performance (datC)

PLS CV performance (datC)

Choose number of components (example uses nc = 14).

nc <- 14

# Refit on full datC with chosen nc (PLS itself uses all comps up to maxc; prediction uses nc)
soilCPlsPred_C <- as.numeric(predict(soilCPlsModel, ncomp = nc, newdata = datC$spcAMovav))
soilCPlsPred_T <- as.numeric(predict(soilCPlsModel, ncomp = nc, newdata = datV$spcAMovav))

par(mfrow = c(1, 2))
plot(datC$TotalCarbon, soilCPlsPred_C, xlab="Observed", ylab="Predicted", main="PLS (datC)", pch=16); abline(0,1)
plot(datV$TotalCarbon, soilCPlsPred_T, xlab="Observed", ylab="Predicted", main="PLS (TEST datV)", pch=16); abline(0,1)
PLS predictions (datC + datV)

PLS predictions (datC + datV)

par(mfrow = c(1, 1))

Metrics (PLS)

We use the same evaluation function used in many soilspec workflows.

pls_cal <- eval_quant(datC$TotalCarbon, soilCPlsPred_C)
pls_tst <- eval_quant(datV$TotalCarbon, soilCPlsPred_T)

as_df_metrics(pls_cal)
#>     n ME  MAE RMSE   R2 RPIQ  RPD
#> 1 293  0 0.37 0.56 0.86 2.04 2.63
as_df_metrics(pls_tst)
#>    n   ME  MAE RMSE   R2 RPIQ  RPD
#> 1 98 0.02 0.36 0.52 0.69 2.34 1.81

Supervised VAE regression: soilVAE

Availability check (TensorFlow/Keras)

This chunk detects if Python + TensorFlow + Keras can be loaded.
If not available, the VAE section is skipped (vignette still builds).

has_py <- reticulate::py_available(initialize = FALSE)

has_tf <- FALSE
if (has_py) {
  try(reticulate::py_config(), silent = TRUE)
  has_tf <- reticulate::py_module_available("tensorflow") &&
            reticulate::py_module_available("keras")
}

has_py
#> [1] TRUE
has_tf
#> [1] TRUE

Prepare matrices (same predictors as PLS preprocessing)

Prepare Train/Val internal split within datC (no y transform; scale X only)

set.seed(19101991)

nC <- nrow(datC)
id_tr <- sample(seq_len(nC), size = round(0.80 * nC))
datC_tr <- datC[id_tr, ]
datC_va <- datC[-id_tr, ]

# y stays in original units (no transformation)
y_tr <- as.numeric(datC_tr$TotalCarbon)
y_va <- as.numeric(datC_va$TotalCarbon)

# X: scale predictors using TRAIN center/scale only
X_tr_raw <- as.matrix(datC_tr$spcAMovav)
X_va_raw <- as.matrix(datC_va$spcAMovav)
X_te_raw <- as.matrix(datV$spcAMovav)   # TEST

X_tr <- scale(X_tr_raw)
X_center <- attr(X_tr, "scaled:center")
X_scale  <- attr(X_tr, "scaled:scale")

# safe scaling: avoid division by zero
X_scale[X_scale == 0] <- 1

X_va <- scale(X_va_raw, center = X_center, scale = X_scale)
X_te <- scale(X_te_raw, center = X_center, scale = X_scale)

# sanity checks (dims)
dim(X_tr)
#> [1] 234 421
length(y_tr)
#> [1] 234
dim(X_va)
#> [1]  59 421
length(y_va)
#> [1] 59
dim(X_te)
#> [1]  98 421

Prepare Train/Val internal split within datC (no y transform; scale X only)

We model TotalCarbon using the preprocessed spectra matrix spcAMovav.

Fit + evaluate soilVAE (skipped if TF/Keras unavailable)

reticulate::py_run_string("
import os
import random
import numpy as np
import tensorflow as tf

os.environ['PYTHONHASHSEED'] = '0'
random.seed(19101991)
np.random.seed(19101991)
tf.random.set_seed(19101991)
")

Sys.setenv(TF_DETERMINISTIC_OPS = "1")

Sys.setenv(TF_CPP_MIN_LOG_LEVEL = "2")         # reduce logs INFO/WARN
# Sys.setenv(TF_ENABLE_ONEDNN_OPTS = "0")

if (!has_tf) {
  message("TensorFlow/Keras not available; skipping soilVAE section.")
} else {
  
  Sys.setenv(TF_ENABLE_ONEDNN_OPTS="0")
  
  # Optional: force a specific python/venv/conda, if needed.
  # soilVAE::vae_configure(conda = "soilvae-tf")

  grid_vae <- data.frame(
    latent_dim = c(8L, 16L, 32L, 64L),
    dropout    = c(0.20, 0.30),
    lr         = c(5e-4),
    beta_kl    = c(0.01),
    alpha_y    = c(5),
    epochs     = c(500L),
    batch_size = c(64L, 128L),
    patience   = c(50L),
    stringsAsFactors = FALSE
  )

  grid_vae$hidden_enc <- list(c(512L, 256L, 128L))
  grid_vae$hidden_dec <- list(c(128L, 256L, 512L))

  tuned <- soilVAE::tune_vae_train_val(
    X_tr = X_tr, y_tr = y_tr,
    X_va = X_va, y_va = y_va,
    seed = 19101991,
    grid_vae = grid_vae
  )

  best <- soilVAE::select_best_from_grid(tuned$tuning_df, selection_metric = "euclid")

  cfg <- best$best

  # Refit on full datC (train+val) using early stopping monitored on the internal val (datC_va)
  m_vae <- soilVAE::vae_build(
    input_dim  = ncol(X_tr),
    hidden_enc = as.integer(strsplit(cfg$hidden_enc_str, "-")[[1]]),
    hidden_dec = as.integer(strsplit(cfg$hidden_dec_str, "-")[[1]]),
    latent_dim = as.integer(cfg$latent_dim),
    dropout    = as.numeric(cfg$dropout),
    lr         = as.numeric(cfg$lr),
    beta_kl    = as.numeric(cfg$beta_kl),
    alpha_y    = as.numeric(cfg$alpha_y)
  )

  soilVAE::vae_fit(
    model = m_vae,
    X = X_tr, y = y_tr,
    X_val = X_va, y_val = y_va,
    epochs = as.integer(cfg$epochs),
    batch_size = as.integer(cfg$batch_size),
    patience = as.integer(cfg$patience),
    verbose = 0L
  )

  yhat_tr <- as.numeric(soilVAE::vae_predict(m_vae, X_tr))
  yhat_va <- as.numeric(soilVAE::vae_predict(m_vae, X_va))
  yhat_te <- as.numeric(soilVAE::vae_predict(m_vae, X_te))

  # Metrics: internal train/val + FINAL TEST
  vae_trn <- eval_quant(y_tr, yhat_tr)
  vae_val <- eval_quant(y_va, yhat_va)
  vae_tst <- eval_quant(as.numeric(datV$TotalCarbon), yhat_te)

  # Plots
  par(mfrow = c(1, 3))
  plot(y_tr, yhat_tr, main="soilVAE (Train)", xlab="Observed", ylab="Predicted", pch=16); abline(0,1)
  plot(y_va, yhat_va, main="soilVAE (Val)",   xlab="Observed", ylab="Predicted", pch=16); abline(0,1)
  plot(as.numeric(datV$TotalCarbon), yhat_te, main="soilVAE (TEST datV)", xlab="Observed", ylab="Predicted", pch=16); abline(0,1)
  par(mfrow = c(1, 1))
}
Train/Val splits for soilVAE

Train/Val splits for soilVAE

Compare PLS vs soilVAE (TEST = datV)

We present a compact comparison table.

if (!has_tf) {

  tab <- rbind(
    cbind(Model = "PLS", Split = "Calibration (datC)", as_df_metrics(pls_cal)),
    cbind(Model = "PLS", Split = "TEST (datV)",        as_df_metrics(pls_tst))
  )

} else {

  tab <- rbind(
    cbind(Model = "PLS",    Split = "Calibration (datC)", as_df_metrics(pls_cal)),
    cbind(Model = "PLS",    Split = "TEST (datV)",        as_df_metrics(pls_tst)),
    cbind(Model = "soilVAE",Split = "Train (internal)",   as_df_metrics(vae_trn)),
    cbind(Model = "soilVAE",Split = "Val (internal)",     as_df_metrics(vae_val)),
    cbind(Model = "soilVAE",Split = "TEST (datV)",        as_df_metrics(vae_tst))
  )

}

row.names(tab) <- NULL
tab
#>     Model              Split   n    ME  MAE RMSE   R2 RPIQ  RPD
#> 1     PLS Calibration (datC) 293  0.00 0.37 0.56 0.86 2.04 2.63
#> 2     PLS        TEST (datV)  98  0.02 0.36 0.52 0.69 2.34 1.81
#> 3 soilVAE   Train (internal) 234 -0.07 0.31 0.44 0.92 2.54 3.60
#> 4 soilVAE     Val (internal)  59 -0.10 0.33 0.51 0.76 2.36 2.04
#> 5 soilVAE        TEST (datV)  98 -0.04 0.33 0.47 0.74 2.56 1.97

If TensorFlow/Keras was not available, you can still use the PLS section and install a compatible Python stack later.

Notes for reproducibility

-   Python (\>= 3.9)

-   TensorFlow (\>= 2.13)

-   Keras (\>= 3)

Notes for life

Education without ethics is only rhetoric.

Power without reflection is violence

Wadoux, Alexandre M. J.-C., Brendan Malone, Budiman Minasny, Mario Fajardo, and Alex B. McBratney. 2021. Soil Spectral Inference with R: Analyzing Digital Soil Spectra Using the R Programming Environment. Progress in Soil Science. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-64896-1.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.