The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
library(dplyr); library(tidyr); library(purrr) # Data wrangling
library(ggplot2); library(stringr) # Plotting
library(tidyfit) # Auto-ML modeling
Multinomial classification is possible in tidyfit
using the methods powered by glmnet
, e1071
and randomForest
(LASSO, Ridge, ElasticNet, AdaLASSO, SVM and Random Forest). Currently, none of the other methods support multinomial classification.^[Feature selection methods such as relief
or chisq
can be used with multinomial response variables. I may also add support for multinomial classification with mboost
in future.] When the response variable contains more than 2 classes, classify
automatically uses a multinomial response for the above-mentioned methods.
Here’s an example using the built-in iris
dataset:
data("iris")
# For reproducibility
set.seed(42)
ix_tst <- sample(1:nrow(iris), round(nrow(iris)*0.2))
data_trn <- iris[-ix_tst,]
data_tst <- iris[ix_tst,]
as_tibble(iris)
#> # A tibble: 150 × 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ℹ 140 more rows
Species
The code chunk below fits the above mentioned algorithms on the training split, using a 10-fold cross validation to select optimal penalties. We then obtain out-of-sample predictions using predict
. Unlike binomial classification, the fit
and pred
objects contain a class
column with separate coefficients and predictions for each class. The predictions sum to one across classes:
fit <- data_trn %>%
classify(Species ~ .,
LASSO = m("lasso"),
Ridge = m("ridge"),
ElasticNet = m("enet"),
AdaLASSO = m("adalasso"),
SVM = m("svm"),
`Random Forest` = m("rf"),
`Least Squares` = m("ridge", lambda = 1e-5),
.cv = "vfold_cv")
pred <- fit %>%
predict(data_tst)
Note that we can add unregularized least squares estimates by setting lambda = 0
(or very close to zero).
Next, we can use yardstick
to calculate the log loss accuracy metric and compare the performance of the different models:
metrics <- pred %>%
group_by(model, class) %>%
mutate(row_n = row_number()) %>%
spread(class, prediction) %>%
group_by(model) %>%
yardstick::mn_log_loss(truth, setosa:virginica)
metrics %>%
mutate(model = str_wrap(model, 11)) %>%
ggplot(aes(model, .estimate)) +
geom_col(fill = "darkblue") +
theme_bw() +
theme(axis.title.x = element_blank())
plot of chunk unnamed-chunk-5
The least squares estimate performs poorest, while the random forest (nonlinear) and the support vector machine (SVM) achieve the best results. The SVM is estimated with a linear kernel by default (use kernel = <chosen_kernel>
to use a different kernel).
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.