This vignette illustrates the basic and advanced usage of MFKnockoffs.filter
. For simplicity, we will use synthetic data constructed such that the response only depends on a small fraction of the variables.
set.seed(1234)
# Problem parameters
n = 1000 # number of observations
p = 1000 # number of variables
k = 60 # number of variables with nonzero coefficients
amplitude = 4.5 # signal amplitude (for noise level = 1)
# Generate the variables from a multivariate normal distribution
mu = rep(0,p); Sigma = diag(p)
X = matrix(rnorm(n*p),n)
# Generate the response from a linear model
nonzero = sample(p, k)
beta = amplitude * (1:p %in% nonzero) / sqrt(n)
y.sample <- function(X) X %*% beta + rnorm(n)
y = y.sample(X)
To begin, we call MFKnockoffs.filter
with all the default settings.
library(MFKnockoffs)
result = MFKnockoffs.filter(X, y)
We can display the results with
print(result)
## Call:
## MFKnockoffs.filter(X = X, y = y)
##
## Selected variables:
## [1] 3 9 40 46 61 78 85 108 146 148 153 172 173 177 210 223 238
## [18] 248 281 295 301 319 326 334 343 360 364 378 384 389 421 426 428 451
## [35] 494 506 528 557 559 595 668 708 770 787 844 893 906 913 931 937 953
## [52] 959
The default value for the target false discovery rate is 0.1. In this experiment the false discovery proportion is
fdp <- function(selected) sum(beta[selected] == 0) / max(1, length(selected))
fdp(result$selected)
## [1] 0.03846154
By default, the knockoff filter creates second-order approximate Gaussian knockoffs. This construction estimates from the data the mean \(\mu\) and the covariance \(\Sigma\) of the rows of \(X\), instead of using the true parameters (\(\mu, \Sigma\)) from which the variables were sampled.
The model-free knockoff package includes other knockoff construction methods, all of which have names prefixed with MFKnockoffs.create
. In the next snippet, we generate knockoffs using the true model parameters.
gaussian_knockoffs = function(X) MFKnockoffs.create.gaussian(X, mu, Sigma)
result = MFKnockoffs.filter(X, y, knockoffs = gaussian_knockoffs)
print(result)
## Call:
## MFKnockoffs.filter(X = X, y = y, knockoffs = gaussian_knockoffs)
##
## Selected variables:
## [1] 3 9 40 44 46 61 67 78 85 108 146 148 153 172 173 177 210
## [18] 223 238 248 281 295 301 319 326 334 343 360 364 378 384 389 421 426
## [35] 428 451 494 506 510 528 557 559 595 617 668 676 702 708 718 770 775
## [52] 787 844 875 893 906 913 931 937 953 959
Now the false discovery proportion is
fdp <- function(selected) sum(beta[selected] == 0) / max(1, length(selected))
fdp(result$selected)
## [1] 0.1311475
By default, the knockoff filter uses a test statistic based on the lasso. Specifically, it uses the statistic MFKnockoffs.stat.glmnet_lambda_signed_max
, which computes \[
W_j = |Z_j| - |\tilde{Z}_j|
\] where \(Z_j\) and \(\tilde{Z}_j\) are the lasso coefficient estimates for the jth variable and its knockoff, respectively. The value of the regularization parameter \(\lambda\) is selected by cross-validation and computed with glmnet.
Several other built-in statistics are available, all of which have names prefixed with MFKnockoffs.stat
. In the next snippet, we use a statistic based on random forests. We also set a higher target FDR of 0.2.
result = MFKnockoffs.filter(X, y, knockoffs = gaussian_knockoffs, statistic = MFKnockoffs.stat.random_forest, q=0.2)
print(result)
## Call:
## MFKnockoffs.filter(X = X, y = y, knockoffs = gaussian_knockoffs,
## statistic = MFKnockoffs.stat.random_forest, q = 0.2)
##
## Selected variables:
## [1] 9 40 61 108 146 148 173 180 223 238 248 254 301 326 347 378 384
## [18] 421 426 428 557 668 708 770 774 785 795 844 913 931 937 953
fdp(result$selected)
## [1] 0.25
In addition to using the predefined test statistics, it is also possible to define your own test statistics. To illustrate this functionality, we implement one of the simplest test statistics from the original knockoff filter paper, namely \[ W_j = \left|X_j^\top \cdot y\right| - \left|\tilde{X}_j^\top \cdot y\right|. \]
my_knockoff_stat <- function(X, X_k, y) {
abs(t(X) %*% y) - abs(t(X_k) %*% y)
}
result = MFKnockoffs.filter(X, y, knockoffs = gaussian_knockoffs, statistic = my_knockoff_stat)
print(result)
## Call:
## MFKnockoffs.filter(X = X, y = y, knockoffs = gaussian_knockoffs,
## statistic = my_knockoff_stat)
##
## Selected variables:
## [1] 3 9 108 146 148 172 173 223 238 248 301 326 360 364 378 421 426
## [18] 428 494 559 668 708 770 844 875 906 931 937 953 959
fdp(result$selected)
## [1] 0.1
As another example, we show how to customize the grid of \(\lambda\)’s used to compute the lasso path in the default test statistic.
my_lasso_stat <- function(...) MFKnockoffs.stat.glmnet_coef_difference(..., nlambda=100)
result = MFKnockoffs.filter(X, y, knockoffs = gaussian_knockoffs, statistic = my_lasso_stat)
print(result)
## Call:
## MFKnockoffs.filter(X = X, y = y, knockoffs = gaussian_knockoffs,
## statistic = my_lasso_stat)
##
## Selected variables:
## [1] 3 9 40 46 61 78 85 108 148 153 172 173 177 210 223 238 248
## [18] 281 295 301 319 326 334 343 360 364 378 384 389 421 426 428 451 494
## [35] 506 528 559 595 617 668 702 708 770 775 787 844 893 906 913 931 937
## [52] 953 959
fdp(result$selected)
## [1] 0.05660377
The nlambda
parameter is passed by MFKnockoffs.stat.glmnet_coef_difference
to the glmnet
, which is used to compute the lasso path. For more information about this and other parameters, see the documentation for MFKnockoffs.stat.glmnet_coef_difference
or glmnet.glmnet
.
In addition to using the predefined procedures for construction knockoff variables, it is also possible to create your own knockoffs. To illustrate this functionality, we implement a simple wrapper for the construction of second-order approximate Gaussian knockoffs.
create_knockoffs <- function(X) {
MFKnockoffs.create.approximate_gaussian(X, method=c('equi'), shrink=T)
}
result = MFKnockoffs.filter(X, y, knockoffs=create_knockoffs)
print(result)
## Call:
## MFKnockoffs.filter(X = X, y = y, knockoffs = create_knockoffs)
##
## Selected variables:
## [1] 3 9 40 46 61 78 85 108 146 148 153 172 173 177 210 223 238
## [18] 248 281 295 301 319 326 334 343 360 364 378 384 389 421 426 428 451
## [35] 494 506 510 528 557 559 596 668 682 708 718 770 775 787 844 893 906
## [52] 913 931 937 953 959
fdp(result$selected)
## [1] 0.08928571
In high-dimensional settings the semidefinite program used to construct SDP knockoffs becomes computationally intractable but equicorrelated knockoffs may yield very low power. A solution is then offered by approximate SDP knockoffs, which address this issue by solving a simpler relaxed problem based on a block-diagonal approximation of the covariance matrix.
In this example we generate second-order Gaussian knockoffs using the estimated model parameters and the approximate SDP construction. Then we run the knockoff filter.
gaussian_knockoffs = function(X) MFKnockoffs.create.approximate_gaussian(X, method=c('asdp'), shrink=T)
result = MFKnockoffs.filter(X, y, knockoffs = gaussian_knockoffs)
print(result)
## Call:
## MFKnockoffs.filter(X = X, y = y, knockoffs = gaussian_knockoffs)
##
## Selected variables:
## [1] 3 9 40 46 61 67 78 85 108 146 148 153 172 173 177 210 223
## [18] 238 248 281 295 301 319 326 334 343 360 364 378 384 389 421 426 428
## [35] 451 494 506 510 528 557 559 595 668 702 708 718 770 775 787 844 893
## [52] 906 913 931 937 953 959
fdp(result$selected)
## [1] 0.0877193
If you want to look inside the knockoff filter, see the advanced vignette. If you want to see how to use the original knockoff filter, see the fixed-design vignette.