The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Package website: release | dev
mlr3fselect is the feature selection package of the mlr3 ecosystem. It selects the optimal feature set for any mlr3 learner. The package works with several optimization algorithms e.g. Random Search, Recursive Feature Elimination, and Genetic Search. Moreover, it can automatically optimize learners and estimate the performance of optimized feature sets with nested resampling. The package is built on the optimization framework bbotk.
There are several section about feature selection in the mlr3book.
The gallery features a collection of case studies and demos about optimization.
The cheatsheet summarizes the most important functions of mlr3fselect.
Install the last release from CRAN:
install.packages("mlr3fselect")
Install the development version from GitHub:
::install_github("mlr-org/mlr3fselect") remotes
We run a feature selection for a support vector machine on the Spam data set.
library("mlr3verse")
tsk("spam")
## <TaskClassif:spam> (4601 x 58): HP Spam Detection
## * Target: type
## * Properties: twoclass
## * Features (57):
## - dbl (57): address, addresses, all, business, capitalAve, capitalLong, capitalTotal,
## charDollar, charExclamation, charHash, charRoundbracket, charSemicolon,
## charSquarebracket, conference, credit, cs, data, direct, edu, email, font, free,
## george, hp, hpl, internet, lab, labs, mail, make, meeting, money, num000, num1999,
## num3d, num415, num650, num85, num857, order, original, our, over, parts, people, pm,
## project, re, receive, remove, report, table, technology, telnet, will, you, your
We construct an instance with the fsi()
function. The
instance describes the optimization problem.
= fsi(
instance task = tsk("spam"),
learner = lrn("classif.svm", type = "C-classification"),
resampling = rsmp("cv", folds = 3),
measures = msr("classif.ce"),
terminator = trm("evals", n_evals = 20)
) instance
## <FSelectInstanceBatchSingleCrit>
## * State: Not optimized
## * Objective: <ObjectiveFSelect:classif.svm_on_spam>
## * Terminator: <TerminatorEvals>
We select a simple random search as the optimization algorithm.
= fs("random_search", batch_size = 5)
fselector fselector
## <FSelectorBatchRandomSearch>: Random Search
## * Parameters: batch_size=5
## * Properties: single-crit, multi-crit
## * Packages: mlr3fselect
To start the feature selection, we simply pass the instance to the fselector.
$optimize(instance) fselector
The fselector writes the best hyperparameter configuration to the instance.
$result_feature_set instance
## [1] "address" "addresses" "all" "business"
## [5] "capitalAve" "capitalLong" "capitalTotal" "charDollar"
## [9] "charExclamation" "charHash" "charRoundbracket" "charSemicolon"
## [13] "charSquarebracket" "conference" "credit" "cs"
## [17] "data" "direct" "edu" "email"
## [21] "font" "free" "george" "hp"
## [25] "internet" "lab" "labs" "mail"
## [29] "make" "meeting" "money" "num000"
## [33] "num1999" "num3d" "num415" "num650"
## [37] "num85" "num857" "order" "our"
## [41] "parts" "people" "pm" "project"
## [45] "re" "receive" "remove" "report"
## [49] "table" "technology" "telnet" "will"
## [53] "you" "your"
And the corresponding measured performance.
$result_y instance
## classif.ce
## 0.07042005
The archive contains all evaluated hyperparameter configurations.
as.data.table(instance$archive)
## address addresses all business capitalAve capitalLong capitalTotal charDollar charExclamation
## 1: TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 2: TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## 3: TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## 4: TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## ---
## 16: FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 17: FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
## 18: FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE
## 19: TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## 20: TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
## 56 variables not shown: [charHash, charRoundbracket, charSemicolon, charSquarebracket, conference, credit, cs, data, direct, edu, ...]
We fit a final model with the optimized feature set to make predictions on new data.
= tsk("spam")
task = lrn("classif.svm", type = "C-classification")
learner
$select(instance$result_feature_set)
task$train(task) learner
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.