The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The package \textsf{absorber} provides a tool to select variables in a nonlinear multivariate model. More precisely, it consists in providing a variable selection tool from \(n\) observations satisfying the following nonparametric regression model: \begin{equation} \label{eq:model} Y_i = f(x_i) + \varepsilon_i, \quad x_i = \left(x_i^{(1)}, \ldots, x_i^{(p)}\right), \quad 1\leq i \leq n, \end{equation} where \(f\) is an unknown real-valued function and where the \(\varepsilon_i\)’s are i.i.d centered random variables of variance \(\sigma^2\). The \(x_i\)’s are observation points which belong to a compact set \(S\) of \(\mathbb{R}^p\). We will also assume that \(f\) actually depends on only \(d\) variables instead of \(p\), with \(d<p\), which means that there exists a real-valued function \(\widetilde{f}\) such that \(f(x)=\widetilde{f}(\widetilde{x})\), where \(x\in\mathbb{R}^p\) and \(\widetilde{x}\in\mathbb{R}^d\). Variable selection consists in identifying the components of \(\widetilde{x}\). This variable selection approach is described in [1]. We refer the reader to this paper for further details and references.
You can install the released version of \textsf{absorber} from CRAN with:
install.packages("absorber")
We first propose to apply our method to \(n=700\) observations satisfying Model \eqref{eq:model} with \(f=f_1\) where \(p=5\), defined in [1]. These observations are obtained with a Gaussian noise of \(\sigma = 0.25\). In the following, the \(d=2\) relevant variables to select are \(\{3,5\}\) and the irrelevant ones to discard are \(\{1,2,4\}\):
true.dimensions = c(3,5) ; false.dimensions = c(1,2,4)
The observation set is loaded from files which are provided within the package, as follows:
# --- Loading the values of the observation sets --- ##
data('x_obs') ;
head(x_obs)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.3687684 0.16895845 0.7114856 0.1493075 0.2300115
## [2,] 0.7162858 0.47407370 0.2271114 0.8187909 0.3845692
## [3,] 0.5543277 0.63473174 0.9341467 0.4209710 0.1551578
## [4,] 0.2551628 0.55242762 0.8940447 0.8587429 0.6602330
## [5,] 0.1468073 0.21261063 0.8249912 0.7159358 0.6177809
## [6,] 0.3917696 0.01350068 0.6862343 0.8377919 0.6143807
## --- Loading the values of corresponding noisy values of the response variable --- ##
data('y_obs') ;
head(y_obs)
## [1] -0.09049367 -1.56817050 0.02365417 0.32580069 1.07158399 1.21354888
The \(\texttt{absorber}\) function of the \(\texttt{absorber}\) package is applied by using the following arguments:
res = absorber(x = x_obs, y = y_obs, M = 3)
Additional arguments can also be provided in this function:
The resulting outputs are the following:
First, we can print the sequence of penalization parameters \(\lambda\) used in our method:
head(res$lambdas)
## [1] 0.01563831 0.01492752 0.01424904 0.01360140 0.01298320 0.01239309
We can then print the corresponding sequences of selected variables for each penalization parameter:
head(res$selec.var)
## [[1]]
## NULL
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 3
##
## [[5]]
## [1] 3
##
## [[6]]
## [1] 3
and finally the variables selected with AIC:
res$aic.var
## [1] 3 5
The \(\texttt{plot\_selection}\) function of the \(\texttt{absorber}\) package produces a histogram of the variable selection percentage for each variable on which \(f\) depends. It also displays in red the results obtained with the AIC.
plot_selection(res)
nlam = length(res$lambdas)
occurrence = data.frame(table(unlist(res$selec.var))) ;
colnames(occurrence) = c("Covariable", "Percentage") ;
occurrence$Percentage =occurrence$Percentage*100/nlam ;
occurrence = occurrence[order(-occurrence$Percentage),,drop=FALSE] ;
occurrence$Covariable = factor(occurrence$Covariable,
levels = unique(occurrence$Covariable)) ;
occurrence$Category = as.factor(ifelse(occurrence$Covariable %in% true.dimensions,
'real features', 'fake features')) ;
str(occurrence) ;
## 'data.frame': 5 obs. of 3 variables:
## $ Covariable: Factor w/ 5 levels "3","5","4","2",..: 1 2 3 4 5
## $ Percentage: num 99 65 45 37 36
## $ Category : Factor w/ 2 levels "fake features",..: 2 2 1 1 1
We can then plot the results as a histogram of variable selection percentage:
color.order = c('firebrick', 'forestgreen')[which( c('fake features', 'real features')
%in% levels(occurrence$Category))]
plt_occ = ggplot(data = occurrence, aes(x = Covariable, y = Percentage, fill = Category)) +
geom_bar(stat = 'identity') +
scale_fill_manual(values = color.order) +
ylab('Percentage of selection') +
theme_bw() +
theme(legend.title = element_blank(),
axis.text.x = element_text(size = 16, face = 'bold'),
axis.text.y = element_text(size = 14),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 15),
legend.text = element_text(size = 14),
legend.position = 'bottom',
legend.key.size = unit(1, "cm"),
panel.grid.major = element_line(size = 0.6, linetype = 'solid',
colour = "darkgrey"),
panel.grid.minor = element_line(size = 0.2, linetype = 'solid',
colour = "darkgrey"))
print(plt_occ)
References
[1] Savino, M. E. and Lévy-Leduc, C. (2024) A novel variable selection method in nonlinear multivariate models using B-splines with an application to geoscience. ⟨hal-04434820⟩.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.