The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Big-memory workflows with bigPLScox

Frédéric Bertrand

Cedric, Cnam, Paris
frederic.bertrand@lecnam.net

2025-11-06

Motivation

A central feature of bigPLScox is the ability to operate on file-backed bigmemory::big.matrix objects. This vignette demonstrates how to prepare such datasets, fit models with big_pls_cox() and big_pls_cox_gd(), and integrate them with cross-validation helpers. The examples complement the introductory vignette “Getting started with bigPLScox”.

Preparing a big.matrix

We simulate a moderately large design matrix and persist it to disk via bigmemory::filebacked.big.matrix(). Using file-backed storage allows models to train on datasets that exceed the available RAM.

library(bigPLScox)
library(bigmemory)

set.seed(2024)
n_obs <- 5000
n_pred <- 100

X_dense <- matrix(rnorm(n_obs * n_pred), nrow = n_obs)
time <- rexp(n_obs, rate = 0.2)
status <- rbinom(n_obs, 1, 0.7)

big_dir <- tempfile("bigPLScox-")
dir.create(big_dir)
X_big <- filebacked.big.matrix(
  nrow = n_obs,
  ncol = n_pred,
  backingpath = big_dir,
  backingfile = "X.bin",
  descriptorfile = "X.desc",
  init = X_dense
)

The resulting big.matrix can be reopened in future sessions via its descriptor file. All big-memory modelling functions accept either an in-memory matrix or a big.matrix reference.

Fitting big-memory models

big_pls_cox() runs the classical PLS-Cox algorithm while streaming data from disk.

fit_big <- big_pls_cox(
  X = X_big,
  time = time,
  status = status,
  ncomp = 5
)

head(fit_big$scores)
#>               [,1]          [,2]          [,3]          [,4]          [,5]
#> [1,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [2,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [3,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [4,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [5,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [6,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
str(fit_big)
#> List of 9
#>  $ scores  : num [1:5000, 1:5] -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 ...
#>  $ loadings: num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 ...
#>  $ weights : num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 ...
#>  $ center  : num [1:100] 0.982 0.982 0.982 0.982 0.982 ...
#>  $ scale   : num [1:100] 1.65e-07 1.65e-07 1.65e-07 1.65e-07 1.65e-07 ...
#>  $ cox_fit :List of 10
#>   ..$ coefficients     : num [1:5] NA NA 1.82e+41 NA NA
#>   ..$ var              : num [1:5, 1:5] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ loglik           : num [1:2] -26230 -26230
#>   ..$ score            : num 8.72e-20
#>   ..$ iter             : int 1
#>   ..$ linear.predictors: num [1:5000] 0.000392 0.000392 0.000392 0.000392 0.000392 ...
#>   ..$ residuals        : Named num [1:5000] 0.8605 0.7913 -0.0129 0.1737 0.9958 ...
#>   .. ..- attr(*, "names")= chr [1:5000] "1" "2" "3" "4" ...
#>   ..$ means            : num [1:5] -4.45e-06 -1.67e-19 -1.67e-32 -1.71e-46 -1.15e-59
#>   ..$ method           : chr "efron"
#>   ..$ class            : chr "coxph"
#>  $ keepX   : int [1:5] 0 0 0 0 0
#>  $ time    : num [1:5000] 1.024 1.5182 0.1047 5.9911 0.0424 ...
#>  $ status  : num [1:5000] 1 1 0 1 1 1 1 1 0 1 ...
#>  - attr(*, "class")= chr "big_pls_cox"

The gradient-descent variant big_pls_cox_gd() uses stochastic optimisation and is well-suited for very large datasets.

fit_big_gd <- big_pls_cox_gd(
  X = X_big,
  time = time,
  status = status,
  ncomp = 5,
  max_iter = 100,
  tol = 1e-4
  )

head(fit_big$scores)
#>               [,1]          [,2]          [,3]          [,4]          [,5]
#> [1,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [2,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [3,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [4,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [5,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
#> [6,] -4.450742e-06 -1.669808e-19 -1.671178e-32 -1.707642e-46 -1.148375e-59
str(fit_big)
#> List of 9
#>  $ scores  : num [1:5000, 1:5] -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 -4.45e-06 ...
#>  $ loadings: num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 ...
#>  $ weights : num [1:100, 1:5] -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 ...
#>  $ center  : num [1:100] 0.982 0.982 0.982 0.982 0.982 ...
#>  $ scale   : num [1:100] 1.65e-07 1.65e-07 1.65e-07 1.65e-07 1.65e-07 ...
#>  $ cox_fit :List of 10
#>   ..$ coefficients     : num [1:5] NA NA 1.82e+41 NA NA
#>   ..$ var              : num [1:5, 1:5] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ loglik           : num [1:2] -26230 -26230
#>   ..$ score            : num 8.72e-20
#>   ..$ iter             : int 1
#>   ..$ linear.predictors: num [1:5000] 0.000392 0.000392 0.000392 0.000392 0.000392 ...
#>   ..$ residuals        : Named num [1:5000] 0.8605 0.7913 -0.0129 0.1737 0.9958 ...
#>   .. ..- attr(*, "names")= chr [1:5000] "1" "2" "3" "4" ...
#>   ..$ means            : num [1:5] -4.45e-06 -1.67e-19 -1.67e-32 -1.71e-46 -1.15e-59
#>   ..$ method           : chr "efron"
#>   ..$ class            : chr "coxph"
#>  $ keepX   : int [1:5] 0 0 0 0 0
#>  $ time    : num [1:5000] 1.024 1.5182 0.1047 5.9911 0.0424 ...
#>  $ status  : num [1:5000] 1 1 0 1 1 1 1 1 0 1 ...
#>  - attr(*, "class")= chr "big_pls_cox"

Both functions return objects that expose the latent scores and loading vectors, allowing downstream visualisations and diagnostics identical to their in-memory counterparts.

Cross-validation on big matrices

Cross-validation for big-memory models is supported through the list interface. This enables streaming each fold directly from disk.

set.seed(2024)
data_big <- list(x = X_big, time = time, status = status)
cv_big <- cv.coxgpls(
  data_big,
  nt = 5,
  ncores = 1,
  ind.block.x = c(10, 40)
)
cv_big$opt_nt

For large experiments consider combining foreach::foreach() with doParallel::registerDoParallel() to parallelise folds.

Timing snapshot

The native C++ solvers substantially reduce wall-clock time compared to fitting through the R interface alone. The bench package provides convenient instrumentation; the chunk below only runs when it is available.

if (requireNamespace("bench", quietly = TRUE)) {
  bench::mark(
    streaming = big_pls_cox(X_big, time, status, ncomp = 5, keepX = 0),
    gd = big_pls_cox_gd(X_big, time, status, ncomp = 5, max_iter = 150),
    iterations = 5,
    check = FALSE
  )
}
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 streaming    30.4ms   30.5ms      32.7    8.81MB     8.19
#> 2 gd             13ms   13.2ms      75.2    6.79MB    18.8

Deviance residuals with big matrices

Once a model has been fitted we can evaluate deviance residuals using the new C++ backend. Supplying the linear predictor avoids recomputing it in R and works with any matrix backend.

eta_big <- predict(fit_big, type = "link")
dr_cpp <- computeDR(time, status, engine = "cpp", eta = eta_big)
max(abs(dr_cpp - computeDR(time, status)))
#> [1] 3.877472

Cleaning up

Temporary backing files can be removed after the analysis. In production pipelines you will typically keep the descriptor file alongside the binary data.

rm(X_big)
file.remove(file.path(big_dir, "X.bin"))
#> [1] TRUE
file.remove(file.path(big_dir, "X.desc"))
#> [1] TRUE
unlink(big_dir, recursive = TRUE)

Additional resources

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.