The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Type: Package
Title: Methods and Reproducible Workflows for Partial Least Squares with Missing Data
Version: 0.2.0
Date: 2026-04-07
Depends: R (≥ 4.1.0)
Imports: mice, plsRglm, stats, utils, VIM
Suggests: bcv, knitr, mlbench, plsdof, rmarkdown, testthat (≥ 3.0.0)
Author: Titin Agustin Nengsih [aut], Frederic Bertrand [aut, cre], Myriam Maumy-Bertrand [aut]
Maintainer: Frederic Bertrand <frederic.bertrand@lecnam.net>
Description: Methods-first tooling for reproducing and extending the partial least squares regression studies on incomplete data described in Nengsih et al. (2019) <doi:10.1515/sagmb-2018-0059>. The package provides simulation helpers, missingness generators, imputation wrappers, component-selection utilities, real-data diagnostics, and reproducible study orchestration for Nonlinear Iterative Partial Least Squares (NIPALS)-Partial Least Squares (PLS) workflows.
License: GPL-3
Encoding: UTF-8
LazyData: true
VignetteBuilder: knitr
Config/testthat/edition: 3
RoxygenNote: 7.3.3
URL: https://fbertran.github.io/missPLS/, https://github.com/fbertran/missPLS
BugReports: https://github.com/fbertran/missPLS/issues
NeedsCompilation: no
Packaged: 2026-04-08 09:17:07 UTC; bertran7
Repository: CRAN
Date/Publication: 2026-04-13 15:00:02 UTC

missPLS: Methods and Reproducible Workflows for Partial Least Squares with Missing Data

Description

Methods-first tooling for reproducing and extending the partial least squares regression studies on incomplete data described in Nengsih et al. (2019) doi:10.1515/sagmb-2018-0059. The package provides simulation helpers, missingness generators, imputation wrappers, component-selection utilities, real-data diagnostics, and reproducible study orchestration for Nonlinear Iterative Partial Least Squares (NIPALS)-Partial Least Squares (PLS) workflows.

Author(s)

Maintainer: Frederic Bertrand frederic.bertrand@lecnam.net

Authors:

See Also

Useful links:


Add missing values to a predictor matrix

Description

Create MCAR or MAR missingness on the predictor matrix x. Missingness is generated column-wise so that each predictor receives approximately the same missing-data proportion, matching the simulation strategy used in the original work.

Usage

add_missingness(
  x,
  y,
  mechanism = c("MCAR", "MAR"),
  missing_prop,
  seed = NULL,
  mar_y_bias = 0.8
)

Arguments

x

Predictor matrix or data frame.

y

Numeric response vector.

mechanism

Missingness mechanism: "MCAR" or "MAR".

missing_prop

Missing-data proportion as a fraction (0.05) or a percentage (5).

seed

Optional random seed. If supplied, it is used only for this call.

mar_y_bias

Proportion of missing values assigned to the upper half of the observed y values under the MAR mechanism.

Value

A list with components x_incomplete, missing_mask, missing_prop, mechanism, and seed.

Examples

sim <- simulate_pls_data(n = 20, p = 10, true_ncomp = 2, seed = 1)
miss <- add_missingness(sim$x, sim$y, mechanism = "MCAR", missing_prop = 10, seed = 2)
mean(is.na(miss$x_incomplete))

Bromhexine dataset

Description

Bromhexine in pharmaceutical syrup used in the article and thesis.

Usage

bromhexine

Format

A misspls_dataset list with components:

name

Dataset name.

x

A numeric ⁠23 x 64⁠ predictor matrix.

y

A numeric response vector of length 23.

data

A data frame with response y and predictors x1 to x64.

source

A short source reference.

preprocessing

Dataset preprocessing notes.

notes

Additional study notes.

Source

Goicoechea and Olivieri (1999a), calibration and test files bundled in extra_docs/pls_data.


Diagnose a real dataset

Description

Compute correlation summaries and VIF-style diagnostics for a packaged real dataset.

Usage

diagnose_real_data(dataset, cor_threshold = 0.7)

Arguments

dataset

A packaged dataset name or misspls_dataset object.

cor_threshold

Absolute-correlation threshold used when reporting predictor pairs and predictor-response associations.

Value

A list with correlation and VIF summaries.

Examples

diag_bromhexine <- diagnose_real_data("bromhexine")
names(diag_bromhexine)

Impute a predictor matrix

Description

Apply one of the imputation strategies used in the article and thesis.

Usage

impute_pls_data(
  x,
  method = c("mice", "knn", "svd"),
  seed = NULL,
  m,
  k = 15L,
  svd_rank = 10L,
  svd_maxiter = 1000L
)

Arguments

x

Incomplete predictor matrix or data frame.

method

Imputation method: "mice", "knn", or "svd".

seed

Optional random seed forwarded to stochastic imputers when supported.

m

Number of imputations for method = "mice". By default this is set to the missing-data percentage rounded to the nearest integer.

k

Number of neighbours for method = "knn".

svd_rank

Target rank for method = "svd".

svd_maxiter

Maximum number of iterations for the fallback SVD routine.

Value

A misspls_imputation object.

Examples

sim <- simulate_pls_data(n = 20, p = 10, true_ncomp = 2, seed = 1)
miss <- add_missingness(sim$x, sim$y, mechanism = "MCAR", missing_prop = 10, seed = 2)
imp <- impute_pls_data(miss$x_incomplete, method = "knn", seed = 3)
length(imp$datasets)

Octane dataset

Description

Octane in gasoline from NIR data used in the article and thesis.

Usage

octane

Format

A misspls_dataset list with components:

name

Dataset name.

x

A numeric ⁠68 x 493⁠ predictor matrix.

y

A numeric response vector of length 68.

data

A data frame with response y and predictors x1 to x493.

source

A short source reference.

preprocessing

Dataset preprocessing notes.

notes

Additional study notes.

Source

Goicoechea and Olivieri (2003), calibration and test files bundled in extra_docs/pls_data.


Complete-case ozone dataset

Description

Los Angeles ozone pollution complete-case dataset used in the article and thesis.

Usage

ozone_complete

Format

A misspls_dataset list with components:

name

Dataset name.

x

A numeric ⁠203 x 12⁠ predictor matrix.

y

A numeric response vector of length 203.

data

A data frame with response y and predictors x1 to x12.

source

A short source reference.

preprocessing

Dataset preprocessing notes.

notes

Additional study notes.

Source

mlbench::Ozone, restricted to the 203 complete observations used in the published analysis.


Run a real-data study

Description

Run a real-data study

Usage

run_real_data_study(
  dataset,
  seed = NULL,
  missing_props = seq(5, 50, 5),
  mechanisms = c("MCAR", "MAR"),
  reps = 1L,
  baseline_reps = 100L,
  max_ncomp = 12L,
  criteria = c("Q2-LOO", "Q2-10fold", "AIC", "AIC-DoF", "BIC", "BIC-DoF"),
  incomplete_methods = c("nipals_standard", "nipals_adaptative"),
  imputation_methods = c("mice", "knn", "svd"),
  folds = 10L,
  mar_y_bias = 0.8
)

Arguments

dataset

A packaged dataset name or misspls_dataset object.

seed

Optional base random seed.

missing_props

Missing-data proportions as fractions or percentages.

mechanisms

Missing-data mechanisms.

reps

Number of replicate missingness draws for each mechanism and proportion.

baseline_reps

Number of repeated complete-data ⁠Q2-10fold⁠ selections used to determine ⁠t**⁠.

max_ncomp

Maximum number of extracted components.

criteria

Criteria evaluated on incomplete and imputed data.

incomplete_methods

Incomplete-data NIPALS workflows.

imputation_methods

Imputation methods.

folds

Number of folds used by "Q2-10fold".

mar_y_bias

MAR bias parameter passed to add_missingness().

Value

A data frame with one row per study run.


Run a simulation study

Description

Run the simulation workflows used in the article and thesis.

Usage

run_simulation_study(
  dimensions = list(c(500L, 100L), c(500L, 20L), c(100L, 20L), c(80L, 25L), c(60L, 33L),
    c(40L, 50L), c(20L, 100L)),
  true_ncomp = c(2L, 4L, 6L),
  missing_props = seq(5, 50, 5),
  mechanisms = c("MCAR", "MAR"),
  reps = 1L,
  seed = NULL,
  max_ncomp = 8L,
  criteria = c("Q2-LOO", "Q2-10fold", "AIC", "AIC-DoF", "BIC", "BIC-DoF"),
  incomplete_methods = c("nipals_standard", "nipals_adaptative"),
  imputation_methods = c("mice", "knn", "svd"),
  folds = 10L,
  mar_y_bias = 0.8
)

Arguments

dimensions

List of ⁠(n, p)⁠ integer pairs.

true_ncomp

Vector of true component counts.

missing_props

Missing-data proportions as fractions or percentages.

mechanisms

Missing-data mechanisms.

reps

Number of replicates.

seed

Optional base random seed.

max_ncomp

Maximum number of extracted components.

criteria

Criteria evaluated on complete and imputed data.

incomplete_methods

Incomplete-data NIPALS workflows.

imputation_methods

Imputation methods.

folds

Number of folds used by "Q2-10fold".

mar_y_bias

MAR bias parameter passed to add_missingness().

Value

A data frame with one row per study run.


Select the number of PLS components

Description

Select the number of components for complete, imputed, or incomplete-data PLS workflows.

Usage

select_ncomp(
  x,
  y,
  method = c("complete", "nipals_standard", "nipals_adaptative"),
  criterion = c("Q2-LOO", "Q2-10fold", "AIC", "AIC-DoF", "BIC", "BIC-DoF"),
  max_ncomp,
  seed = NULL,
  folds = 10L,
  threshold = 0.0975
)

Arguments

x

Predictor matrix, dataset object, or misspls_imputation object.

y

Numeric response vector. This may be omitted when x already contains a response.

method

Selection workflow: "complete", "nipals_standard", or "nipals_adaptative".

criterion

Selection criterion: "Q2-LOO", "Q2-10fold", "AIC", "AIC-DoF", "BIC", or "BIC-DoF".

max_ncomp

Maximum number of components to consider.

seed

Optional random seed used by the cross-validation and imputation aggregation steps.

folds

Number of cross-validation folds used by "Q2-10fold".

threshold

Threshold applied to Q2 criteria.

Value

A one-row data frame describing the selected component count.

Examples

sim <- simulate_pls_data(n = 25, p = 10, true_ncomp = 2, seed = 1)
select_ncomp(sim$x, sim$y, method = "complete", criterion = "AIC", max_ncomp = 4, seed = 2)

Simulate PLS data

Description

Simulate a univariate-response PLS dataset using the Li et al.-style generator available in plsRglm.

Usage

simulate_pls_data(n, p, true_ncomp, seed = NULL, model = "li2002")

Arguments

n

Number of observations.

p

Number of predictors.

true_ncomp

True number of latent components.

seed

Optional random seed. If supplied, it is used only for this call.

model

Simulation model. Only "li2002" is currently supported.

Value

A list with components x, y, data, true_ncomp, seed, and model.

Examples

sim <- simulate_pls_data(n = 20, p = 10, true_ncomp = 2, seed = 42)
str(sim)

Summarize simulation or real-data study results

Description

Summarize simulation or real-data study results

Usage

summarize_simulation_study(results)

Arguments

results

A results data frame returned by run_simulation_study() or run_real_data_study().

Value

A grouped summary data frame.

Examples

sim_results <- run_simulation_study(
  dimensions = list(c(30, 12)),
  true_ncomp = 2,
  missing_props = numeric(0),
  mechanisms = character(0),
  reps = 2,
  seed = 1
)
summarize_simulation_study(sim_results)

Tetracycline dataset

Description

Tetracycline in serum used in the article and thesis.

Usage

tetracycline

Format

A misspls_dataset list with components:

name

Dataset name.

x

A numeric ⁠107 x 101⁠ predictor matrix.

y

A numeric response vector of length 107.

data

A data frame with response y and predictors x1 to x101.

source

A short source reference.

preprocessing

Dataset preprocessing notes.

notes

Additional study notes.

Source

Goicoechea and Olivieri (1999b), calibration and test files bundled in extra_docs/pls_data.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.