In this vignette, we take a look at how we can simplify many machine learning tasks using manymodelr
. We will take a look at the core functions first.
Once the package has been successfully installed, we can then proceed by loading the package and exploring some of the key functions.
agg_by_group
As one can guess from the name, this function provides an easy way to manipulate grouped data. We can for instance find the number of observations in the iris data set. The formula takes the form x~y
where y
is the grouping variable(in this case Species
). One can supply a formula as shown next.
agg_by_group(iris,.~Species,length)
#> Grouped By[1]: Species
#>
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 setosa 50 50 50 50
#> 2 versicolor 50 50 50 50
#> 3 virginica 50 50 50 50
head(agg_by_group(mtcars,cyl~hp+vs,sum))
#> Grouped By[2]: hp vs
#>
#> hp vs cyl
#> 1 91 0 4
#> 2 110 0 12
#> 3 150 0 16
#> 4 175 0 22
#> 5 180 0 24
#> 6 205 0 8
multi_model_1
This is one of the core functions of the package. Since the function uses caret
backend, we need to load the function before we can use it. To avoid, several messages showing up, we use the function suppressMessages
. This assumes that one is familiar with machine learning basics. We specify our model types and we use the argument valid=TRUE
to specify that we are dealing with validation. Had we wanted to predict on unseen test data, then this argument would be set to FALSE
.
suppressMessages(library(caret))
train_set<-createDataPartition(iris$Species,p=0.8,list=FALSE)
valid_set<-iris[-train_set,]
train_set<-iris[train_set,]
ctrl<-trainControl(method="cv",number=5)
set.seed(233)
m<-multi_model_1(train_set,"Species",".",c("knn","rpart"),
"Accuracy",ctrl,newdata =valid_set,valid=TRUE)
#> [1] "Returning Metrics"
The above message tells us that the model has returned our metrics for each of the model types we specified. These can be extracted as shown below. Other return values include predictions and a summary of the model.
head(m$Predictions)
#> # A tibble: 6 x 2
#> knn rpart
#> <fct> <fct>
#> 1 setosa setosa
#> 2 setosa setosa
#> 3 setosa setosa
#> 4 setosa setosa
#> 5 setosa setosa
#> 6 setosa setosa
modeleR
Yet another core function, this allows us to (currently) conveniently perform linear regression and analysis of variance. Here we are simultaneously building and using our model to predict. This is particularly useful if you already know how well a given model models your data. We can extract the results just like we did above.
iris1<-iris[1:60,]
iris2<-iris[60:nrow(iris),]
m1<-modeleR(iris1,Sepal.Length,Petal.Length,
lm,na.rm=TRUE,iris2)
head(m1$Predictions)
#> Predicted
#> 60 5.985141
#> 61 5.821972
#> 62 6.107518
#> 63 6.025933
#> 64 6.311478
#> 65 5.862764
get_var_cor
As the name suggests, this function is useful when carrying out correlation tests as shown below. Setting get_all
to TRUE
implies that all the variables are correlated(from exploratory data analysis) and you just want to see what the correlation(s) is(are). The variant function get_var_corr_
(note the underscore at the end provides a convenient way to get correlations for combinations of variables(pairs).)
get_var_corr(mtcars, "mpg",get_all = TRUE)
#> Comparison_Var Other_Var p_value Correlation lower_ci
#> 1 mpg cyl 6.112687e-10 -0.8521620 -0.92576936
#> 2 mpg disp 9.380327e-10 -0.8475514 -0.92335937
#> 3 mpg hp 1.787835e-07 -0.7761684 -0.88526861
#> 4 mpg drat 1.776240e-05 0.6811719 0.43604838
#> 5 mpg wt 1.293959e-10 -0.8676594 -0.93382641
#> 6 mpg qsec 1.708199e-02 0.4186840 0.08195487
#> 7 mpg vs 3.415937e-05 0.6640389 0.41036301
#> 8 mpg am 2.850207e-04 0.5998324 0.31755830
#> 9 mpg gear 5.400948e-03 0.4802848 0.15806177
#> 10 mpg carb 1.084446e-03 -0.5509251 -0.75464796
#> upper_ci
#> 1 -0.7163171
#> 2 -0.7081376
#> 3 -0.5860994
#> 4 0.8322010
#> 5 -0.7440872
#> 6 0.6696186
#> 7 0.8223262
#> 8 0.7844520
#> 9 0.7100628
#> 10 -0.2503183
To get correlations for only select variables, one could work as follows:
get_var_corr(mtcars,comparison_var = "cyl",
other_vars = c("disp","mpg"),get_all = FALSE)
#> Comparison_Var Other_Var p.value Correlation lower_ci upper_ci
#> 1 cyl disp 1.802838e-12 0.9020329 0.8072442 0.9514607
#> 2 cyl mpg 6.112687e-10 -0.8521620 -0.9257694 -0.7163171
Similarly, get_var_corr_
(note the underscore at the end) provides a convenient way to get combination-wise correlations.
head(get_var_corr_(mtcars),6)
#> Comparison_Var Other_Var p.value Correlation lower_ci upper_ci
#> 1 mpg cyl 6.112687e-10 -0.8521620 -0.92576936 -0.7163171
#> 2 mpg disp 9.380327e-10 -0.8475514 -0.92335937 -0.7081376
#> 3 mpg hp 1.787835e-07 -0.7761684 -0.88526861 -0.5860994
#> 4 mpg drat 1.776240e-05 0.6811719 0.43604838 0.8322010
#> 5 mpg wt 1.293959e-10 -0.8676594 -0.93382641 -0.7440872
#> 6 mpg qsec 1.708199e-02 0.4186840 0.08195487 0.6696186
rowdiff
This is useful when trying to find differences between rows. The direction
argument specifies how the subtractions are made while the exclude
argument is currently used to remove non-numeric data. Using direction="reverse"
performs a subtraction akin to x-(x-1)
where x
is the row number.
head(rowdiff(iris,exclude = "non_numeric",direction = "reverse"))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 NA NA NA NA
#> 2 -0.2 -0.5 0.0 0.0
#> 3 -0.2 0.2 -0.1 0.0
#> 4 -0.1 -0.1 0.2 0.0
#> 5 0.4 0.5 -0.1 0.0
#> 6 0.4 0.3 0.3 0.2
The vignette has been short and therefore is non exhaustive. The best way to explore this and any package or language is to practice. For more examples, please use ?function_name
and see a few implementations of the given function.
If you would like to contribute, report issues or improve any of these functions, please raise a pull request at (manymodelr)
“Programs must be written for people to read, and only incidentally for machines to execute.” - Harold Abelson (Reference)
Thank You