The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This vignette covers advanced topics for power users, researchers, and method developers:
Target audience: Users comfortable with R programming and statistical methods
Estimated time: 15-20 minutes
corrselect offers two algorithmic approaches for
corrPrune():
Algorithm: Eppstein–Löffler–Strash (ELS) or Bron–Kerbosch Complexity: O(2^p) - exponential in number of predictors Guarantee: Finds the largest maximal independent set
data(mtcars)
# Exact mode: guaranteed optimal
exact_result <- corrPrune(mtcars, threshold = 0.7, mode = "exact")
cat("Exact mode kept:", ncol(exact_result), "variables\n")
#> Exact mode kept: 5 variablesUse exact mode when:
Algorithm: Deterministic iterative removal Complexity: O(p² × k) where k = iterations Guarantee: Near-optimal, deterministic
# Greedy mode: fast approximation
greedy_result <- corrPrune(mtcars, threshold = 0.7, mode = "greedy")
cat("Greedy mode kept:", ncol(greedy_result), "variables\n")
#> Greedy mode kept: 5 variablesUse greedy mode when:
Let’s measure runtime scaling:
# Generate datasets with increasing p
library(microbenchmark)
benchmark_corrPrune <- function(p_values) {
results <- data.frame(
p = integer(),
exact_time_ms = numeric(),
greedy_time_ms = numeric()
)
for (p in p_values) {
# Generate correlated data
set.seed(123)
cor_mat <- 0.5^abs(outer(1:p, 1:p, "-"))
data <- as.data.frame(MASS::mvrnorm(n = 100, mu = rep(0, p), Sigma = cor_mat))
# Exact mode (skip if p too large)
exact_time_ms <- if (p <= 500) {
median(microbenchmark(
corrPrune(data, threshold = 0.7, mode = "exact"),
times = 3, # Few iterations for speed
unit = "ms"
)$time) / 1e6 # Convert nanoseconds to milliseconds
} else {
NA
}
# Greedy mode
greedy_time_ms <- median(microbenchmark(
corrPrune(data, threshold = 0.7, mode = "greedy"),
times = 3, # Few iterations for speed
unit = "ms"
)$time) / 1e6 # Convert nanoseconds to milliseconds
results <- rbind(results, data.frame(
p = p,
exact_time_ms = round(exact_time_ms, 1),
greedy_time_ms = round(greedy_time_ms, 1)
))
}
results
}
# Benchmark (extended range to show comprehensive scaling behavior)
p_values <- c(10, 20, 50, 100, 200, 300, 500, 1000)
benchmark <- benchmark_corrPrune(p_values)
print(benchmark)
#> p exact_time_ms greedy_time_ms
#> 1 10 0.7 0.2
#> 2 20 0.8 0.4
#> 3 50 1.0 0.7
#> 4 100 1.9 1.4
#> 5 200 5.1 3.1
#> 6 300 12.1 5.7
#> 7 500 32.0 12.5
#> 8 1000 NA 43.2# Visualize scaling
# Separate data for exact mode (only where available) and greedy mode (all points)
exact_valid <- !is.na(benchmark$exact_time_ms) & benchmark$exact_time_ms > 0
greedy_valid <- !is.na(benchmark$greedy_time_ms) & benchmark$greedy_time_ms > 0
# Replace any zeros with small positive value for log scale
exact_times <- benchmark$exact_time_ms
greedy_times <- benchmark$greedy_time_ms
exact_times[exact_times <= 0 & !is.na(exact_times)] <- 0.01
greedy_times[greedy_times <= 0 & !is.na(greedy_times)] <- 0.01
# Determine y-axis limits from valid data
all_valid_times <- c(exact_times[exact_valid], greedy_times[greedy_valid])
ylim <- c(min(all_valid_times) * 0.5, max(all_valid_times) * 2)
# Plot greedy mode (all points)
plot(benchmark$p[greedy_valid], greedy_times[greedy_valid],
type = "b", col = rgb(0.2, 0.5, 0.8, 1), pch = 19, lwd = 2,
xlab = "Number of Predictors (p)",
ylab = "Time (milliseconds, log scale)",
main = "Exact vs Greedy Scaling",
ylim = ylim,
xlim = range(benchmark$p),
log = "y")
# Add exact mode (only where available)
lines(benchmark$p[exact_valid], exact_times[exact_valid],
type = "b", col = rgb(0.8, 0.2, 0.2, 1), pch = 19, lwd = 2)
# Mark exact mode limit
abline(v = 500, lty = 2, col = "gray50")
text(500, ylim[2] * 0.5, "Exact mode limit", pos = 4, col = "gray30")
legend("topleft",
legend = c("Exact", "Greedy"),
col = c(rgb(0.8, 0.2, 0.2, 1), rgb(0.2, 0.5, 0.8, 1)),
pch = 19,
lwd = 2,
bty = "o",
bg = "white")Key insight: The scaling behavior depends heavily on correlation structure. For this moderately correlated dataset (ρ = 0.5^|i-j|), exact mode remains practical up to p ≈ 500, while greedy mode scales efficiently beyond p = 1000. The exact mode provides guaranteed optimality at the cost of computational complexity, while greedy mode offers substantial speed advantages for large p with near-optimal results in most practical scenarios.
When multiple variables have identical correlation profiles, corrselect uses deterministic tie-breaking:
# Create data with identical correlations
set.seed(123)
x1 <- rnorm(100)
x2 <- x1 + rnorm(100, sd = 0.1) # Almost identical to x1
x3 <- x1 + rnorm(100, sd = 0.1) # Also almost identical to x1
x4 <- rnorm(100) # Independent
data_ties <- data.frame(x1, x2, x3, x4)
# Run multiple times - always same result
result1 <- corrPrune(data_ties, threshold = 0.95)
result2 <- corrPrune(data_ties, threshold = 0.95)
cat("Run 1 selected:", names(result1), "\n")
#> Run 1 selected: x2 x4
cat("Run 2 selected:", names(result2), "\n")
#> Run 2 selected: x2 x4
cat("Identical:", identical(names(result1), names(result2)), "\n")
#> Identical: TRUETie-breaking rules: 1. Prefer variables with lower mean absolute correlation with others 2. If still tied, prefer lexicographically first (alphabetical)
This ensures reproducibility across runs, machines, and R versions.
A custom engine allows you to integrate any modeling
framework with modelPrune(), not just base R’s
lm() or lme4. This enables VIF-based pruning
for:
The modelPrune() algorithm follows this iterative
process:
limit
thresholdYour custom engine defines steps 1 and 2; modelPrune()
handles the iteration logic.
A custom engine is a named list with two required functions:
my_engine <- list(
# Required: How to fit the model
fit = function(formula, data, ...) {
# Your model fitting code
# Must return a fitted model object
},
# Required: How to compute diagnostics
diagnostics = function(model, fixed_effects) {
# Compute diagnostic scores for each fixed effect
# Higher values = worse (more likely to be removed)
# Must return a named numeric vector
},
# Optional: Name for error messages
name = "my_custom_engine" # Defaults to "custom"
)Key principle: The diagnostics function
must return higher values for worse predictors. This
inverted metric ensures the algorithm removes the most problematic
variables first.
INLA (Integrated Nested Laplace Approximations) is a popular package for fast Bayesian inference, especially for spatial and temporal models. Unlike traditional VIF, we can use posterior uncertainty as a pruning criterion: variables with high posterior standard deviation contribute less information to the model.
# Custom engine for INLA
inla_engine <- list(
name = "inla",
fit = function(formula, data, ...) {
# Fit INLA model
INLA::inla(
formula = formula,
data = data,
family = list(...)$family %||% "gaussian",
control.compute = list(config = TRUE),
...
)
},
diagnostics = function(model, fixed_effects) {
# Use posterior SD as "badness" metric
# Higher SD = more uncertain = candidate for removal
summary_fixed <- model$summary.fixed
scores <- summary_fixed[, "sd"]
names(scores) <- rownames(summary_fixed)
# Return scores for fixed effects only
scores[fixed_effects]
}
)
# Usage example
pruned <- modelPrune(
y ~ x1 + x2 + x3 + x4,
data = my_data,
engine = inla_engine,
limit = 0.5 # Remove if posterior SD > 0.5
)INLA::inla() to compute
posterior distributionsmodel$summary.fixedInterpretation: Variables with high posterior SD have coefficients that are uncertain given the data. Removing them reduces model complexity without losing much information.
When to use: Spatial models, hierarchical models, disease mapping, ecological modeling with INLA.
Generalized Additive Models (GAMs) allow non-linear relationships via smooth terms. When pruning GAMs, we want to remove parametric (linear) terms with weak evidence while preserving smooth terms that model non-linear patterns.
# Custom engine for mgcv GAMs
mgcv_engine <- list(
name = "mgcv_gam",
fit = function(formula, data, ...) {
mgcv::gam(formula, data = data, ...)
},
diagnostics = function(model, fixed_effects) {
# Use p-values as badness metric
# Higher p-value = less significant = candidate for removal
summary_obj <- summary(model)
# Extract parametric term p-values
pvals <- summary_obj$p.pv
# Return p-values for fixed effects
pvals[fixed_effects]
}
)
# Usage example
pruned <- modelPrune(
y ~ x1 + x2 + s(x3), # s() for smooth terms
data = my_data,
engine = mgcv_engine,
limit = 0.05 # Remove if p > 0.05
)mgcv::gam() to fit the
additive modelImportant: modelPrune() only removes
parametric (linear) terms. Smooth terms specified with s(),
te(), etc. are never removed
automatically, as they’re part of the model structure.
When to use: Non-linear regression, ecological response curves, time series with trends.
Sometimes you want to prune based on model comparison metrics rather than coefficient-level diagnostics. This example shows how to remove variables that don’t improve model fit according to AIC.
# AIC-based pruning engine
aic_engine <- list(
name = "aic_pruner",
fit = function(formula, data, ...) {
lm(formula, data = data)
},
diagnostics = function(model, fixed_effects) {
# For each predictor, compute ΔAIC if removed
full_aic <- AIC(model)
scores <- numeric(length(fixed_effects))
names(scores) <- fixed_effects
for (var in fixed_effects) {
# Refit without this variable
reduced_formula <- update(formula(model), paste("~ . -", var))
reduced_model <- lm(reduced_formula, data = model$model)
# ΔAIC: negative means removing improves model
# We negate so "higher = worse" convention holds
scores[var] <- -(AIC(reduced_model) - full_aic)
}
scores
}
)
# Usage: Remove predictors with ΔAIC < -2 (improve AIC by > 2 when removed)
pruned <- modelPrune(
y ~ x1 + x2 + x3,
data = my_data,
engine = aic_engine,
limit = -2
)Key insight: The score is negated (multiplied by -1) to maintain the “higher is worse” convention. A variable with score > -2 means removing it would worsen AIC by more than 2, so it’s kept.
When to use: Model selection focused on parsimony, comparing nested models, when VIF isn’t the right metric.
Alternative metrics: You can adapt this pattern for BIC, likelihood ratio tests, or any other model comparison criterion.
modelPrune() automatically validates custom engines to
catch common errors early:
# Invalid engine: missing 'diagnostics'
bad_engine <- list(
fit = function(formula, data, ...) lm(formula, data = data)
# Missing 'diagnostics'
)
tryCatch({
modelPrune(mpg ~ ., data = mtcars, engine = bad_engine, limit = 5)
}, error = function(e) {
cat("Error:", e$message, "\n")
})
#> Error: Custom engine missing required fields: diagnostics
#> Required: 'fit' and 'diagnostics'Your custom engine must satisfy these requirements:
fit and
diagnostics functionsdiagnostics function can consumeIf your custom engine fails, check:
# Test your fit function in isolation
test_model <- my_engine$fit(y ~ x1 + x2, data = my_data)
summary(test_model) # Does it work?
# Test your diagnostics function
test_diag <- my_engine$diagnostics(test_model, c("x1", "x2"))
print(test_diag) # Numeric? Named? Correct length?
# Check that "higher = worse" convention is satisfied
# The variable with the highest score should be the one you'd remove firstcorrPrune() returns a single subset.
Sometimes you want all maximal subsets:
# corrPrune: Single subset
single <- corrPrune(mtcars, threshold = 0.7)
cat("corrPrune returned:", ncol(single), "variables\n")
#> corrPrune returned: 5 variables
# corrSelect: ALL maximal subsets (use higher threshold to ensure subsets exist)
all_subsets <- corrSelect(mtcars, threshold = 0.9)
show(all_subsets)
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Correlation: pearson
#> Threshold: 0.900
#> Subsets: 2 maximal subsets
#> Data Rows: 32 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] mpg, disp, hp, drat, wt, qsec... 0.527 0.888 10
#> [ 2] mpg, cyl, hp, drat, wt, qsec,... 0.531 0.868 10When multiple maximal subsets exist, you can explore them:
if (length(all_subsets@subset_list) > 0) {
# Display first few subsets
cat("First few maximal subsets:\n")
for (i in seq_len(min(3, length(all_subsets@subset_list)))) {
cat(sprintf("\nSubset %d (avg corr = %.3f):\n", i, all_subsets@avg_corr[i]))
cat(" ", paste(all_subsets@subset_list[[i]], collapse = ", "), "\n")
}
# Analyze subset characteristics
subset_sizes <- lengths(all_subsets@subset_list)
cat("\nSubset sizes:\n")
print(table(subset_sizes))
cat("\nAverage correlations:\n")
print(summary(all_subsets@avg_corr))
} else {
cat("No subsets found at threshold 0.9\n")
}
#> First few maximal subsets:
#>
#> Subset 1 (avg corr = 0.527):
#> mpg, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> Subset 2 (avg corr = 0.531):
#> mpg, cyl, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> Subset sizes:
#> subset_sizes
#> 10
#> 2
#>
#> Average correlations:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.5269 0.5280 0.5290 0.5290 0.5301 0.5311if (length(all_subsets@subset_list) > 0) {
# Extract subset with lowest average correlation
best_idx <- which.min(all_subsets@avg_corr)
best_subset <- corrSubset(all_subsets, mtcars, which = best_idx)
cat("Best subset (lowest avg correlation):\n")
print(names(best_subset))
# Extract subset with most predictors
subset_sizes <- lengths(all_subsets@subset_list)
largest_idx <- which.max(subset_sizes)
largest_subset <- corrSubset(all_subsets, mtcars, which = largest_idx)
cat("\nLargest subset:\n")
print(names(largest_subset))
} else {
cat("No subsets to extract at threshold 0.9\n")
}
#> Best subset (lowest avg correlation):
#> [1] "mpg" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
#>
#> Largest subset:
#> [1] "mpg" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"You can define custom criteria for choosing among multiple subsets:
if (length(all_subsets@subset_list) > 0) {
# Custom criterion: Prefer subsets with specific variables
preferred_vars <- c("mpg", "hp", "wt")
# Compute score: number of preferred variables in each subset
scores <- sapply(all_subsets@subset_list, function(vars) {
sum(preferred_vars %in% vars)
})
# Select subset with most preferred variables
best_idx <- which.max(scores)
cat("Subset with most preferred variables (score:", scores[best_idx], "):\n")
cat(paste(all_subsets@subset_list[[best_idx]], collapse = ", "), "\n")
# Extract as data frame
preferred_subset <- corrSubset(all_subsets, mtcars, which = best_idx)
print(head(preferred_subset))
} else {
cat("No subsets available for custom selection\n")
}
#> Subset with most preferred variables (score: 3 ):
#> mpg, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> mpg disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 225 105 2.76 3.460 20.22 1 0 3 1For repeated pruning with different thresholds, precompute the correlation matrix:
# Create larger dataset for meaningful timing comparison
set.seed(123)
large_data <- as.data.frame(matrix(rnorm(100 * 50), ncol = 50))
# Benchmark: Recompute correlation every time
time1 <- median(microbenchmark(
{
result1 <- corrPrune(large_data, threshold = 0.7)
result2 <- corrPrune(large_data, threshold = 0.8)
result3 <- corrPrune(large_data, threshold = 0.9)
},
times = 3,
unit = "ms"
)$time) / 1e6 # Convert nanoseconds to milliseconds
# Benchmark: Compute correlation once, reuse
cor_matrix <- cor(large_data)
time2 <- median(microbenchmark(
{
result1 <- MatSelect(cor_matrix, threshold = 0.7)
result2 <- MatSelect(cor_matrix, threshold = 0.8)
result3 <- MatSelect(cor_matrix, threshold = 0.9)
},
times = 3,
unit = "ms"
)$time) / 1e6 # Convert nanoseconds to milliseconds
cat(sprintf("Recomputing each time: %.1f ms\n", time1))
#> Recomputing each time: 2.8 ms
cat(sprintf("Precomputed matrix: %.1f ms\n", time2))
#> Precomputed matrix: 0.7 ms
cat(sprintf("Speedup: %.1fx faster\n", time1 / time2))
#> Speedup: 3.8x fasterUse precomputed matrices when:
For large datasets (n > 10,000, p > 500):
# Standard (memory-intensive for large n)
cor_matrix <- cor(large_data)
# Memory-efficient alternative (for very large n)
# Process in chunks if needed
compute_cor_chunked <- function(data, chunk_size = 1000) {
n <- nrow(data)
n_chunks <- ceiling(n / chunk_size)
# Use online algorithm or chunked computation
# (Implementation depends on your data size)
}If most correlations are low, consider sparse storage:
# Convert to sparse format (requires Matrix package)
library(Matrix)
# Compute correlation
cor_mat <- cor(data)
# Threshold and convert to sparse
cor_sparse <- cor_mat
cor_sparse[abs(cor_sparse) < 0.3] <- 0 # Set low correlations to 0
cor_sparse <- Matrix(cor_sparse, sparse = TRUE)
# Memory savings
object.size(cor_mat)
object.size(cor_sparse)For multiple independent pruning operations:
library(parallel)
# Create cluster
cl <- makeCluster(detectCores() - 1)
# Export functions to cluster
clusterEvalQ(cl, library(corrselect))
# Parallel pruning with different thresholds
thresholds <- seq(0.5, 0.9, by = 0.1)
results <- parLapply(cl, thresholds, function(thresh) {
corrPrune(my_data, threshold = thresh)
})
# Clean up
stopCluster(cl)Note: corrselect itself doesn’t parallelize internally (for reproducibility), but you can parallelize across multiple analyses.
Decision tree for mode selection:
p ≤ 15: Use "exact" (fast enough, guaranteed optimal)
15 < p ≤ 25: Use "exact" if time permits, "greedy" if speed critical
p > 25: Use "greedy" or "auto"
p > 100: Always use "greedy"
Cause: Threshold too strict - all variables exceed it
# Example: All correlations > 0.9
set.seed(123)
x <- rnorm(100)
high_cor_data <- data.frame(
x1 = x,
x2 = x + rnorm(100, sd = 0.01),
x3 = x + rnorm(100, sd = 0.01)
)
tryCatch({
corrPrune(high_cor_data, threshold = 0.5)
}, error = function(e) {
cat("Error:", e$message, "\n")
})
#> Error: No valid subsets found that satisfy the threshold constraintSolutions: 1. Increase threshold 2. Use
force_in to keep at least one variable 3. Check data for
near-duplicates
# Solution 1: Increase threshold
result <- corrPrune(high_cor_data, threshold = 0.95)
#> Error in corrPrune(high_cor_data, threshold = 0.95): No valid subsets found that satisfy the threshold constraint
print(names(result))
#> Error: object 'result' not found
# Solution 2: Force keep one variable
result <- corrPrune(high_cor_data, threshold = 0.5, force_in = "x1")
#> Error in corrPrune(high_cor_data, threshold = 0.5, force_in = "x1"): No valid subsets found that satisfy the threshold constraint
print(names(result))
#> Error: object 'result' not foundCause: Variables in force_in have
|correlation| > threshold
# x1 and x2 are highly correlated
tryCatch({
corrPrune(high_cor_data, threshold = 0.5, force_in = c("x1", "x2"))
}, error = function(e) {
cat("Error:", e$message, "\n")
})
#> Error: Variables in 'force_in' violate the threshold constraint. Example: 'x1' and 'x2' have association 1.000 > 0.500Solution: Either increase threshold or reduce
force_in set
Cause: Perfect multicollinearity (R² = 1)
# Create perfect multicollinearity
perfect_data <- data.frame(
y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100)
)
perfect_data$x3 <- perfect_data$x1 + perfect_data$x2 # Perfect collinearity
tryCatch({
modelPrune(y ~ ., data = perfect_data, limit = 5)
}, error = function(e) {
cat("Error:", e$message, "\n")
})
#> Warning in summary.lm(fit): essentially perfect fit: summary may be unreliable
#> Warning in summary.lm(fit): essentially perfect fit: summary may be unreliable
#> Warning in summary.lm(fit): essentially perfect fit: summary may be unreliable
#> y x1 x2
#> 1 -0.71524219 -0.07355602 -0.60189285
#> 2 -0.75268897 -1.16865142 -0.99369859
#> 3 -0.93853870 -0.63474826 1.02678506
#> 4 -1.05251328 -0.02884155 0.75106130
#> 5 -0.43715953 0.67069597 -1.50916654
#> 6 0.33117917 -1.65054654 -0.09514745
#> 7 -2.01421050 -0.34975424 -0.89594782
#> 8 0.21198043 0.75640644 -2.07075107
#> 9 1.23667505 -0.53880916 0.15012013
#> 10 2.03757402 0.22729192 -0.07921171
#> 11 1.30117599 0.49222857 -0.09736927
#> 12 0.75677476 0.26783502 0.21615254
#> 13 -1.72673040 0.65325768 0.88246516
#> 14 -0.60150671 -0.12270866 0.20559750
#> 15 -0.35204646 -0.41367651 -0.61643584
#> 16 0.70352390 -2.64314895 -0.73479925
#> 17 -0.10567133 -0.09294102 -0.13180279
#> 18 -1.25864863 0.43028470 0.31001699
#> 19 1.68443571 0.53539884 -1.03968035
#> 20 0.91139129 -0.55527835 -0.18430887
#> 21 0.23743027 1.77950291 0.96726726
#> 22 1.21810861 0.28642442 -0.10828009
#> 23 -1.33877429 0.12631586 -0.69842067
#> 24 0.66082030 1.27226678 -0.27594517
#> 25 -0.52291238 -0.71846622 1.11464855
#> 26 0.68374552 -0.45033862 0.55004396
#> 27 -0.06082195 2.39745248 1.23667580
#> 28 0.63296071 0.01112919 0.13909786
#> 29 1.33551762 1.63356842 0.41027510
#> 30 0.00729009 -1.43850664 -0.55845691
#> 31 1.01755864 -0.19051680 0.60537067
#> 32 -1.18843404 0.37842390 -0.50633354
#> 33 -0.72160444 0.30003855 -1.42056550
#> 34 1.51921771 -1.00563626 0.12799297
#> 35 0.37738797 0.01925927 1.94585122
#> 36 -2.05222282 -1.07742065 0.80091434
#> 37 -1.36403745 0.71270333 1.16525339
#> 38 -0.20078102 1.08477509 0.35885572
#> 39 0.86577940 -2.22498770 -0.60855718
#> 40 -0.10188326 1.23569346 -0.20224086
#> 41 0.62418747 -1.24104450 -0.27324811
#> 42 0.95900538 0.45476927 -0.46869978
#> 43 1.67105483 0.65990264 0.70416728
#> 44 0.05601673 -0.19988983 -1.19736350
#> 45 -0.05198191 -0.64511396 0.86636613
#> 46 -1.75323736 0.16532102 0.86415249
#> 47 0.09932759 0.43881870 -1.19862236
#> 48 -0.57185006 0.88330282 0.63949200
#> 49 -0.97400958 -2.05233698 2.43022665
#> 50 -0.17990623 -1.63637927 -0.55721548
#> 51 1.01494317 1.43040234 0.84490424
#> 52 -1.99274849 1.04662885 -0.78220185
#> 53 -0.42727929 0.43528895 1.11071142
#> 54 0.11663728 0.71517841 0.24982472
#> 55 -0.89320757 0.91717492 1.65191539
#> 56 0.33390294 -2.66092280 -1.45897073
#> 57 0.41142992 1.11027710 -0.05129789
#> 58 -0.03303616 -0.48498760 -0.52692518
#> 59 -2.46589819 0.23061683 -0.19726487
#> 60 2.57145815 -0.29515780 -0.62957874
#> 61 -0.20529926 0.87196495 -0.83384358
#> 62 0.65119328 -0.34847245 0.57872237
#> 63 0.27376649 0.51850377 -1.08758071
#> 64 1.02467323 -0.39068498 1.48403093
#> 65 0.81765945 -1.09278721 -1.18620659
#> 66 -0.20979317 1.21001051 0.10107915
#> 67 0.37816777 0.74090001 0.53298929
#> 68 -0.94540883 1.72426224 0.58673534
#> 69 0.85692301 0.06515393 -0.30174666
#> 70 -0.46103834 1.12500275 0.07950200
#> 71 2.41677335 1.97541905 0.96126415
#> 72 -1.65104890 -0.28148212 -1.45646592
#> 73 -0.46398724 -1.32295111 -0.78173971
#> 74 0.82537986 -0.23935157 0.32040231
#> 75 0.51013255 -0.21404124 -0.44478198
#> 76 -0.58948104 0.15168050 1.37000399
#> 77 -0.99678074 1.71230498 0.67325386
#> 78 0.14447570 -0.32614389 0.07216675
#> 79 -0.01430741 0.37300466 -1.50775732
#> 80 -1.79028124 -0.22768406 0.02610023
#> 81 0.03455107 0.02045071 -0.31641587
#> 82 0.19023032 0.31405766 -0.10234651
#> 83 0.17472640 1.32821470 -1.18155923
#> 84 -1.05501704 0.12131838 0.49865804
#> 85 0.47613328 0.71284232 -1.03895644
#> 86 1.37857014 0.77886003 -0.22622198
#> 87 0.45623640 0.91477327 0.38142583
#> 88 -1.13558847 -0.57439455 -0.78351579
#> 89 -0.43564547 1.62688121 0.58299141
#> 90 0.34610362 -0.38095674 -1.31651040
#> 91 -0.64704563 -0.10578417 -2.80977468
#> 92 -2.15764634 1.40405027 0.46496799
#> 93 0.88425082 1.29408391 0.84053983
#> 94 -0.82947761 -1.08999187 -0.28584542
#> 95 -0.57356027 -0.87307100 0.50412625
#> 96 1.50390061 -1.35807906 -1.15591653
#> 97 -0.77414493 0.18184719 -0.12714861
#> 98 0.84573154 0.16484087 -1.94151838
#> 99 -1.26068288 0.36411469 1.18118089
#> 100 -0.35454240 0.55215771 1.85991086Solution: Use corrPrune() first to
remove perfect collinearity:
# Two-step approach
step1 <- corrPrune(perfect_data[, -1], threshold = 0.99)
step2_data <- data.frame(y = perfect_data$y, step1)
result <- modelPrune(y ~ ., data = step2_data, limit = 5)
#> Warning in summary.lm(fit): essentially perfect fit: summary may be unreliable
#> Warning in summary.lm(fit): essentially perfect fit: summary may be unreliable
#> Warning in summary.lm(fit): essentially perfect fit: summary may be unreliable
print(attr(result, "selected_vars"))
#> [1] "x1" "x2"Conservative (strict):
Moderate (recommended):
Lenient:
Strict:
Moderate (recommended):
Lenient:
data(mtcars)
# Visualize correlation distribution
cor_mat <- cor(mtcars)
cor_vec <- cor_mat[upper.tri(cor_mat)]
par(mfrow = c(1, 2))
# Histogram of correlations
hist(abs(cor_vec), breaks = 30,
main = "Distribution of |Correlations|",
xlab = "|Correlation|",
col = "lightblue")
abline(v = c(0.5, 0.7, 0.9), col = c("red", "blue", "green"), lwd = 2, lty = 2)
legend("topright",
legend = c("0.5 (strict)", "0.7 (moderate)", "0.9 (lenient)"),
col = c("red", "blue", "green"), lwd = 2, lty = 2,
bty = "o", bg = "white")
# Subset size vs threshold
thresholds <- seq(0.3, 0.95, by = 0.05)
sizes <- sapply(thresholds, function(t) {
tryCatch({
ncol(corrPrune(mtcars, threshold = t))
}, error = function(e) NA)
})
plot(thresholds, sizes, type = "b",
xlab = "Threshold",
ylab = "Number of Variables Retained",
main = "Threshold Sensitivity",
col = "blue", lwd = 2)
abline(h = ncol(mtcars), lty = 2, col = "gray")
text(0.3, ncol(mtcars), "Original", pos = 3)Strategy: Choose threshold where curve begins to plateau.
# Very strict threshold may leave only 1 variable
strict_result <- corrPrune(mtcars, threshold = 0.3)
cat("Variables remaining:", ncol(strict_result), "\n")
#> Variables remaining: 2
# Check if result is usable
if (ncol(strict_result) < 2) {
cat("Warning: Only 1 variable remaining. Consider:\n")
cat(" 1. Increasing threshold\n")
cat(" 2. Using force_in to keep important variables\n")
}# Create mixed data
mixed_data <- mtcars
mixed_data$cyl <- factor(mixed_data$cyl)
mixed_data$am <- factor(mixed_data$am)
# Use assocSelect for mixed types
result <- assocSelect(mixed_data, threshold = 0.6)
show(result)
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Correlation: mixed
#> AssocMethod: numeric_factor = eta, numeric_numeric = pearson, factor_numeric
#> = eta, factor_factor = cramersv
#> Threshold: 0.600
#> Subsets: 20 maximal subsets
#> Data Rows: 32 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] wt, vs, gear, carb 0.436 0.583 4
#> [ 2] vs, am, carb 0.265 0.570 3
#> [ 3] wt, qsec, gear 0.324 0.583 3
#> [ 4] disp, carb, am 0.348 0.591 3
#> [ 5] drat, vs, carb 0.367 0.570 3
#> ... (15 more combinations)# 1. Visualize correlations
corrplot::corrplot(cor(data), method = "circle")
# 2. Try multiple thresholds
results <- lapply(c(0.5, 0.7, 0.9), function(t) {
corrPrune(data, threshold = t)
})
# 3. Compare subset sizes
sapply(results, ncol)
# 4. Choose based on your needs
final_data <- results[[2]] # threshold = 0.7# 1. Use exact mode for reproducibility and optimality
data_pruned <- corrPrune(data, threshold = 0.7, mode = "exact")
# 2. Document in methods section
cat(sprintf(
"Variables were pruned using corrselect::corrPrune() with threshold = 0.7, ",
"exact mode, retaining %d of %d original predictors.",
ncol(data_pruned), ncol(data)
))
# 3. Report which variables were removed
removed <- attr(data_pruned, "removed_vars")
cat("Removed variables:", paste(removed, collapse = ", "))# Comprehensive variable selection pipeline
pipeline <- function(data, response) {
# Step 1: Remove correlations
step1 <- corrPrune(data, threshold = 0.7, mode = "auto")
# Step 2: VIF cleanup
step2_data <- data.frame(response = response, step1)
step2 <- modelPrune(response ~ ., data = step2_data, limit = 5)
# Step 3: Feature importance (optional)
if (requireNamespace("Boruta", quietly = TRUE)) {
boruta_result <- Boruta::Boruta(response ~ ., data = step2)
important <- Boruta::getSelectedAttributes(boruta_result)
final_data <- step2[, c("response", important)]
} else {
final_data <- step2
}
final_data
}fit and
diagnosticsforce_in to protect important variablesAlgorithms:
Multicollinearity:
Software:
vignette("quickstart") - 5-minute introductionvignette("workflows") - Real-world examplesvignette("comparison") - vs caret, Boruta, glmnetvignette("corrselect_vignette") - Original exact
methods vignette?corrPrune - Association-based pruning?modelPrune - Model-based pruning?corrSelect - Exact subset enumerationsessionInfo()
#> R version 4.5.1 (2025-06-13 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: Europe/Luxembourg
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] microbenchmark_1.5.0 corrselect_3.0.2
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.37 R6_2.6.1 fastmap_1.2.0 xfun_0.53
#> [5] cachem_1.1.0 knitr_1.50 htmltools_0.5.8.1 rmarkdown_2.30
#> [9] lifecycle_1.0.4 cli_3.6.5 svglite_2.2.2 sass_0.4.10
#> [13] textshaping_1.0.3 jquerylib_0.1.4 systemfonts_1.3.1 compiler_4.5.1
#> [17] rstudioapi_0.17.1 tools_4.5.1 evaluate_1.0.5 bslib_0.9.0
#> [21] Rcpp_1.1.0 yaml_2.3.10 rlang_1.1.6 jsonlite_2.0.0
#> [25] MASS_7.3-65These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.