The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Introduction

In this vignette, we demonstrate FORD algorithm in A New Measure Of Dependence: Integrated R2, a forward stepwise variable selection algorithm based on the integrated $R^2$ dependence measure. FORD is designed for variable ranking in both linear and nonlinear multivariate regression settings.

FORD closely follows the structure of FOCI A Simple Measure Of Conditional Dependence, but replaces the core dependence measure with irdc.

Algorithm

Let $Y$ be the response variable and $\mathbf{X} = (X_1, \dots, X_p)$ the predictor variables. Given $n$ i.i.d. samples of $(Y, \mathbf{X})$, FORD proceeds as follows:

Select $j_1 = \arg\max_j \nu_n(Y, X_j)$
If $\nu_n(Y, X_{j_1}) \leq 0$, return $\hat{V} = \emptyset$
Iteratively add the feature that gives the maximum increase in irdc: $$ j_{k+1} = \arg\max_{j \notin {j_1, \ldots, j_k}} \nu_n(Y, (X_{j_1}, \ldots, X_{j_k}, X_j)) $$
Stop when the irdc does not increase anymore: $$ \nu_n(Y, (X_{j_1}, \ldots, X_{j_k}, X_{j_{k+1}})) \leq \nu_n(Y, (X_{j_1}, \ldots, X_{j_k})) $$

If no such $k$ exists, select all variables.

Example 1 — Complex nonlinear function of first 4 features

Here, $Y$ depends only on the first 4 features of $X$ in a nonlinear way.

set.seed(42)
n <- 2000
p <- 100
X <- matrix(rnorm(n * p), ncol = p)
colnames(X) <- paste0("X", seq_len(p))
Y <- X[, 1] * X[, 2] + sin(X[, 1] * X[, 3]) + X[, 4]^2

FOCI Result

result_foci_1 <- foci(Y, X, numCores = 1)
result_foci_1
#> $selectedVar
#>    index  names
#>    <num> <char>
#> 1:     4     X4
#> 2:     1     X1
#> 3:     2     X2
#> 4:     3     X3
#> 
#> $stepT
#> [1] 0.3356423 0.4027284 0.6226254 0.7619649
#> 
#> attr(,"class")
#> [1] "foci"

FORD Result

result_ford_1 <- ford(Y, X, numCores = 1)
result_ford_1
#> $selectedVar
#>    index  names
#>    <num> <char>
#> 1:     4     X4
#> 2:     1     X1
#> 3:     2     X2
#> 4:     3     X3
#> 
#> $step_nu
#> [1] 0.3198165 0.4026348 0.6324854 0.7668089
#> 
#> attr(,"class")
#> [1] "ford"

Example 2 — Selecting a fixed number of variables

We can force both FOCI and FORD to select a specific number of variables instead of using an automatic stopping rule.

FOCI with 5 selected features

result_foci_2 <- foci(Y, X, num_features = 5, stop = FALSE, numCores = 1)
result_foci_2
#> $selectedVar
#>    index  names
#>    <num> <char>
#> 1:     4     X4
#> 2:     1     X1
#> 3:     2     X2
#> 4:     3     X3
#> 5:    66    X66
#> 
#> $stepT
#> [1] 0.3356423 0.4027284 0.6226254 0.7619649 0.6900384
#> 
#> attr(,"class")
#> [1] "foci"

FORD with 5 selected features

result_ford_2 <- ford(Y, X, num_features = 5, stop = FALSE, numCores = 1)
result_ford_2
#> $selectedVar
#>    index  names
#>    <num> <char>
#> 1:     4     X4
#> 2:     1     X1
#> 3:     2     X2
#> 4:     3     X3
#> 5:    31    X31
#> 
#> $step_nu
#> [1] 0.3198165 0.4026348 0.6324854 0.7668089 0.6988827
#> 
#> attr(,"class")
#> [1] "ford"

Conclusion

FORD provides an interpretable, irdc-based alternative to FOCI for variable selection in regression tasks. It offers a principled forward selection framework that can detect complex nonlinear relationships and be adapted for fixed-size feature subsets.

For further theoretical details, see our paper:
Azadkia and Roudaki (2025), A New Measure Of Dependence: Integrated R2

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

ford-demo