| Type: | Package |
| Title: | Proper Scoring Rules for Missing Value Imputation |
| Version: | 1.2.0 |
| Description: | Provides tools for evaluating and ranking missing value imputation methods using proper scoring rules. Implements the Energy-I-Score and the DR-I-Score for the assessment of deterministic, stochastic and multiple imputation methods for numerical and mixed datasets, following Näf et al. (2022) <doi:10.48550/arXiv.2106.03742> and Näf et al. (2025) <doi:10.48550/arXiv.2507.11297>. |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Imports: | energy, kernlab, pbapply, pbmcapply, ranger, scoringRules, stats |
| Suggests: | knitr, mice, rmarkdown, spelling, testthat (≥ 3.0.0) |
| Config/testthat/edition: | 3 |
| VignetteBuilder: | knitr |
| URL: | https://krystynagrzesiak.github.io/Iscores/ |
| License: | GPL-3 |
| Language: | en-US |
| NeedsCompilation: | no |
| Packaged: | 2026-06-08 17:05:50 UTC; Krysia |
| Author: | Krystyna Grzesiak |
| Maintainer: | Krystyna Grzesiak <krygrz11@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-06-08 18:20:25 UTC |
Compute the imputation KL-based scoring rules
Description
Compute the imputation KL-based scoring rules
Usage
DR_IScore(
X,
imputation_func = NULL,
X_imp = NULL,
m = 5,
n_proj = 100,
n_trees_per_proj = 5,
min_node_size = 10,
n_cores = 1,
projection_function = NULL,
...
)
Arguments
X |
data containing missing values denoted with NA's. |
imputation_func |
an imputing function. If |
X_imp |
a list of imputed datasets. If |
m |
the number of multiple imputations to consider, default to 5. |
n_proj |
an integer specifying the number of projections to consider for the score. |
n_trees_per_proj |
an integer, the number of trees per projection. |
min_node_size |
the minimum number of nodes in a tree. |
n_cores |
an integer, the number of cores to use. |
projection_function |
a function providing the user-specific projections. |
... |
used for compatibility |
Value
numeric value of the score obtained for provided imputation method.
References
This method is described in detail in:
Näf, Jeffrey, Meta-Lina Spohn, Loris Michel, and Nicolai Meinshausen. 2022. “Imputation Scores.” https://arxiv.org/abs/2106.03742.
Examples
set.seed(111)
X <- random_mcar_data(100, 3, 0.2)
imputation_func <- exp_imputation
DR_IScore(X, imputation_func, m = 2, n_proj = 10, n_trees_per_proj = 2 )
Balancing of Classes
Description
Balancing of Classes
Usage
class.balancing(X_proj_complete, Y.proj, drawA, X_imp, ids.with.missing, vars)
Arguments
X_proj_complete |
matrix with complete projected observations. |
Y.proj |
matrix with projected imputed observations. |
drawA |
vector of indices corresponding to current missingness pattern. |
X_imp |
matrix of full imputed observations. |
ids.with.missing |
vector of indices of observations with missing values. |
vars |
vectors of variables in projection. |
Value
a list of new X_proj_complete and Y.proj.
Combine two projection forests
Description
Combine two projection forests
Usage
combine2Forests(mod1, mod2)
Arguments
mod1 |
A fitted forest object. |
mod2 |
A fitted forest object. |
Value
A forest object containing trees from both input forests.
Combine a list of forests
Description
Combine a list of forests
Usage
combineForests(list.rf)
Arguments
list.rf |
A list of fitted forest objects. |
Value
A single forest object obtained by combining all forests in list.rf.
Calculates IScores for multiple imputation functions
Description
Calculates IScores for multiple imputation functions
Usage
compare_Iscores(X, methods_list, score = c("energy_IScore", "DR_IScore"), ...)
Arguments
X |
data containing missing values denoted with NA's. |
methods_list |
a named list of imputing functions. |
score |
a vector of names of scores to calculate. It can be
|
... |
other arguments to be passed to energy_IScore or DR_IScore |
Value
a vector of IScores for provided methods
Examples
set.seed(111)
X <- random_mcar_data(100, 3, 0.2)
methods_list <- list(exp = exp_imputation,
norm = norm_imputation)
compare_Iscores(X, methods_list = methods_list, m = 2,
n_proj = 10, n_trees_per_proj = 2 )
Compute the density ratio score
Description
Compute the density ratio score
Usage
compute_drScore(object, Z = Z, n_trees_per_proj, n_proj)
Arguments
object |
a crf object. |
Z |
a matrix of candidate points. |
n_trees_per_proj |
an integer, the number of trees per projection. |
n_proj |
an integer specifying the number of projections. |
Value
a numeric value, the DR I-Score.
Computation of the density ratio score
Description
Computes the density ratio score using a random forest model based on random projections.
Usage
densityRatioScore(
X,
X_imp,
pattern = NULL,
n_proj = 10,
n_trees_per_proj = 1,
projection_function = NULL,
min_node_size = 1,
normal_proj = TRUE
)
Arguments
X |
A numeric matrix of observed data that may contain missing values
denoted by |
X_imp |
A numeric matrix of imputed values with the same dimensions as
|
pattern |
A vector or pattern indicating the missingness structure. |
n_proj |
An integer specifying the number of random projections. |
n_trees_per_proj |
An integer specifying the number of trees grown per projection. |
projection_function |
A function that generates user-defined projections. |
min_node_size |
An integer specifying the minimum number of observations in a terminal node (leaf) of each tree. |
normal_proj |
Logical. If |
Details
The method builds multiple random forests on projected versions of the data to estimate the density ratio between observed and imputed distributions.
Value
An object representing a fitted random forest model based on random projections.
Convert a factor vector to one-hot encoding
Description
Converts a factor vector into a one-hot encoded matrix with one column per factor level.
Usage
do_one_hot(vec)
Arguments
vec |
A factor vector to be encoded. |
Details
Missing values in 'vec' are preserved as rows containing 'NA' values.
Value
A numeric matrix with one row per element of 'vec' and one column per factor level. Column names are prefixed with '"level_"'.
Energy distance
Description
Calculating energy distance/statistic.
Usage
edistance(X, X_imp, rescale = FALSE)
Arguments
X |
a complete original dataset (without missing values). |
X_imp |
an imputed dataset |
rescale |
a logical, indicating whether the returned value should be
rescaled. Default to |
Details
This function uses the eqdist.e function. According to this implementation, by default, the function returns the energy statistic which is given by
E(X, Y) = \frac{nm}{n + m} \hat{\varepsilon}{(X, Y)},
where \hat{\varepsilon}{(X, Y)} is the raw energy distance. To
obtain raw energy distance use rescale = TRUE.
Value
A numeric value giving the energy distance between the original dataset and the imputed dataset.
Examples
X <- matrix(rnorm(100), nrow = 25)
X_imp <- matrix(rnorm(100), nrow = 25)
edistance(X, X_imp)
Calculates Imputation Score for imputation function
Description
Calculates Imputation Score for imputation function
Usage
energy_IScore(
X,
imputation_func,
X_imp = NULL,
multiple = TRUE,
N = 50,
max_length = NULL,
skip_if_needed = TRUE,
scale = FALSE,
n_cores = 1,
silent = TRUE
)
Arguments
X |
data containing missing values denoted with NA's. |
imputation_func |
a function that imputes data. |
X_imp |
imputed dataset of the same size as |
multiple |
a logical indicating whether provided imputation method is a multiple imputation approach (i.e. it generates different values to impute for each call). Default to TRUE. Note that if multiple equals to FALSE, N is automatically set to 1. |
N |
a numeric value. Number of samples from imputation distribution H. Default to 50. |
max_length |
Maximum number of variables |
skip_if_needed |
logical, indicating whether some observations should be skipped to obtain complete columns for scoring. If FALSE, NA will be returned for column with no observed variable for training. |
scale |
a logical value. If TRUE, each variable is scaled in the score. |
n_cores |
a number of cores for parallelization. |
silent |
logical indicating whether warnings and messages should be printed. |
Details
This function relies on functions energy_Iscore_num and energy_Iscore_cat. Depending on the presence of factor-type data, these functions compute a score either for purely numerical data or for mixed data types.
If you want to compute the score for numerical data, make sure that the dataset does not contain any factor-type variables.
If you want to compute the score for categorical data, ensure that all categorical variables are preserved as factors.
If your imputation method does not support categorical variables represented as factors, implement a wrapper function that handles the appropriate data type conversions before and after imputation.
Value
a numerical value denoting weighted Imputation Score obtained for provided imputation function and a table with scores and weights calculated for particular columns.
References
Näf, J., Grzesiak, K., and Scornet, E. (2025). How to rank imputation methods? arXiv preprint. doi:10.48550/arXiv.2507.11297.
Examples
set.seed(111)
X <- random_mcar_data(100, 4)
imputation_func <- exp_imputation
energy_IScore(X, imputation_func)
X <- random_mcar_mixed_data(100, 4, 2)
imputation_func <- median_mode_imputation
energy_IScore(X, imputation_func)
energy-I-Score for imputation of mixed data (categorical and numerical)
Description
energy-I-Score for imputation of mixed data (categorical and numerical)
Usage
energy_Iscore_cat(
X,
imputation_func,
X_imp = imputation_func(X),
multiple = TRUE,
N = 50,
max_length = NULL,
skip_if_needed = TRUE,
scale = FALSE,
n_cores = 1,
silent = TRUE
)
Arguments
X |
data containing missing values denoted with NA's. |
imputation_func |
a function that imputes data. |
X_imp |
imputed dataset of the same size as |
multiple |
a logical indicating whether provided imputation method is a multiple imputation approach (i.e. it generates different values to impute for each call). Default to TRUE. Note that if multiple equals to FALSE, N is automatically set to 1. |
N |
a numeric value. Number of samples from imputation distribution H. Default to 50. |
max_length |
Maximum number of variables |
skip_if_needed |
logical, indicating whether some observations should be skipped to obtain complete columns for scoring. If FALSE, NA will be returned for column with no observed variable for training. |
scale |
a logical value. If TRUE, each variable is scaled in the score. |
n_cores |
a number of cores for parallelization. |
silent |
logical indicating whether warnings and messages should be printed. |
Details
The categorical variables should be stored as factors. If you need additional
conversion of the data (for example one-hot encoding) for imputation, please,
implement everything within imputation_func parameter. You can use
miceDRF:::onehot_to_factor and miceDRF:::factor_to_onehot
functions.
Value
a numerical value denoting weighted Imputation Score obtained for provided imputation function and a table with scores and weights calculated for particular columns.
References
This method is described in detail in:
Näf, J., Grzesiak, K., and Scornet, E. (2025). How to rank imputation methods? arXiv preprint. doi:10.48550/arXiv.2507.11297.
Calculates score for one imputation function
Description
Calculates score for one imputation function
Usage
energy_Iscore_num(
X,
imputation_func,
X_imp = imputation_func(X),
multiple = TRUE,
N = 50,
max_length = NULL,
skip_if_needed = TRUE,
scale = FALSE,
n_cores = 1,
silent = TRUE
)
Arguments
X |
data containing missing values denoted with NA's. |
imputation_func |
a function that imputes data. |
X_imp |
imputed dataset of the same size as |
multiple |
a logical indicating whether provided imputation method is a multiple imputation approach (i.e. it generates different values to impute for each call). Default to TRUE. Note that if multiple equals to FALSE, N is automatically set to 1. |
N |
a numeric value. Number of samples from imputation distribution H. Default to 50. |
max_length |
Maximum number of variables |
skip_if_needed |
logical, indicating whether some observations should be skipped to obtain complete columns for scoring. If FALSE, NA will be returned for column with no observed variable for training. |
scale |
a logical value. If TRUE, each variable is scaled in the score. |
n_cores |
a number of cores for parallelization. |
silent |
logical indicating whether warnings and messages should be printed. |
Value
a numerical value denoting weighted Imputation Score obtained for provided imputation function and a table with scores and weights calculated for particular columns.
References
This method is described in detail in:
Näf, J., Grzesiak, K., and Scornet, E. (2025). How to rank imputation methods? arXiv preprint. doi:10.48550/arXiv.2507.11297.
Standard exponential imputation
Description
Imputes all missing values by independent draws from an exponential distribution with rate 1.
Usage
exp_imputation(X_miss)
Arguments
X_miss |
A data set containing missing values. |
Value
A completed data set with all missing values replaced by draws
from an Exp(1) distribution.
Examples
X <- random_mcar_data(100, 3)
X_imp <- exp_imputation(X)
Internal function for changing factors to numerical
Description
A supplementary function for data management
Usage
factor_to_numeric(factor_col)
Arguments
factor_col |
a factor column |
Details
This function converts factor variables to numeric variables.
One hot encoding
Description
A supplementary function for one-hot encoding
Usage
factor_to_onehot(dat)
Arguments
dat |
a data containing some factor but numeric columns. |
Details
This function converts factor variables into one-hot encoding
Extract and group missing-data patterns
Description
Identifies unique missingness patterns in a data matrix and groups observations according to these patterns. If more than one pattern occurs only once, such singleton patterns are merged into a single group.
Usage
get_pattern_data(X)
Arguments
X |
A matrix or data frame that may contain missing values. |
Details
Missingness patterns are represented by a logical matrix obtained from
is.na(X). Only rows containing at least one missing value are used
to define the unique patterns.
If more than one pattern is represented by a single observation, these
singleton patterns are merged using merge_singleton_patterns().
Value
A list with three elements:
- patterns
A matrix of unique missingness patterns.
- groups
A list of integer vectors giving row indices for each pattern.
- average_diff
A logical indicating whether singleton patterns were merged.
Median/mode imputation
Description
Imputes numerical variables using their median and categorical variables using their most frequent observed category.
Usage
median_mode_imputation(X_miss)
Arguments
X_miss |
A data set containing missing values. |
Value
A completed data set with all missing values imputed.
Examples
X <- random_mcar_mixed_data(100, 3, n_fac = 1)
X_imp <- median_mode_imputation(X)
Merge singleton missingness patterns
Description
Merges missingness patterns that occur only once (singleton patterns) into a single pattern. If the merged pattern already exists among the current patterns, the corresponding groups of observations are combined. Otherwise, a new pattern is created and appended.
Usage
merge_singleton_patterns(patterns, groups, ind_singletons)
Arguments
patterns |
A numeric matrix where each row represents a unique missingness pattern. |
groups |
A list of integer vectors. Each element contains the indices of
observations corresponding to a given pattern in |
ind_singletons |
An integer vector indicating indices of patterns in
|
Value
A list with two elements:
- patterns
Updated matrix of unique missingness patterns.
- groups
Updated list of observation indices grouped by pattern.
Standard normal imputation
Description
Imputes all missing values by independent draws from a standard normal distribution.
Usage
norm_imputation(X_miss)
Arguments
X_miss |
A data set containing missing values. |
Value
A completed data set with all missing values replaced by draws
from a N(0,1) distribution.
Examples
X <- random_mcar_data(100, 3)
X_imp <- norm_imputation(X)
Generate random data with MCAR missing values
Description
Generates a numerical dataset consisting of independent standard normal variables and introduces missing values according to a Missing Completely at Random (MCAR) mechanism.
Usage
random_mcar_data(n, p, ratio = 0.2)
Arguments
n |
Number of observations. |
p |
Number of numerical variables. |
ratio |
Proportion of entries to replace with missing values. |
Value
A data frame with n rows and p numerical variables
containing missing values.
Examples
X <- random_mcar_data(100, 3, ratio = 0.2)
head(X)
Generate random mixed data with MCAR missing values
Description
Generates a mixed dataset containing independent standard normal variables and categorical variables, then introduces missing values according to a Missing Completely at Random (MCAR) mechanism.
Usage
random_mcar_mixed_data(n, p, n_fac = 1, ratio = 0.2)
Arguments
n |
Number of observations. |
p |
Number of numerical variables. |
n_fac |
Number of categorical variables. |
ratio |
Proportion of entries to replace with missing values. |
Value
A data frame containing p numerical variables and
n_fac factor variables with missing values.
Examples
X <- random_mcar_mixed_data(100, 3, n_fac = 2, ratio = 0.2)
str(X)
Sampling of Projections
Description
Sampling of Projections
Usage
sample_vars_proj(ids_x_na, X, projection_function = NULL, normal_proj = TRUE)
Arguments
ids_x_na |
a vector of indices corresponding to NA in the given missingness pattern. |
X |
a matrix of the observed data containing missing values. |
projection_function |
a function providing the user-specific projections. |
normal_proj |
a boolean, if TRUE, sample from the NA of the pattern and additionally from the non-NA. If FALSE, sample only from the NA of the pattern. |
Value
a vector of variables corresponding to the projection.
Truncation of probability
Description
Truncation of probability
Usage
truncProb(p)
Arguments
p |
a numeric value between 0 and 1 to be truncated |
Value
a numeric value, the truncated probability.