Introduction to SuperML

Manish Saraswat

2022-05-10

SuperML R package is designed to unify the model training process in R like Python. Generally, it’s seen that people spend lot of time in searching for packages, figuring out the syntax for training machine learning models in R. This behaviour is highly apparent in users who frequently switch between R and Python. This package provides a python´s scikit-learn interface (fit, predict) to train models faster.

In addition to building machine learning models, there are handy functionalities to do feature engineering

This ambitious package is my ongoing effort to help the r-community build ML models easily and faster in R.

Install

You can install latest cran version using (recommended):

install.packages("superml")

You can install the developmemt version directly from github using:

devtools::install_github("saraswatmks/superml")

Caveats on superml installation

For machine learning, superml is based on the existing R packages. Hence, while installing the package, we don’t install all the dependencies. However, while training any model, superml will automatically install the package if its not found. Still, if you want to install all dependencies at once, you can simply do:

install.packages("superml", dependencies=TRUE)

Examples - Machine Learning Models

This package uses existing r-packages to build machine learning model. In this tutorial, we’ll use data.table R package to do all tasks related to data manipulation.

Regression Data

We’ll quickly prepare the data set to be ready to served for model training.

load("../data/reg_train.rda")
# if the above doesn't work, you can try: load("reg_train.rda")

library(data.table)
library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice
library(superml)

library(Metrics)
#> 
#> Attaching package: 'Metrics'
#> The following objects are masked from 'package:caret':
#> 
#>     precision, recall

head(reg_train)
#>    Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
#> 1:  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
#> 2:  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
#> 3:  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
#> 4:  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
#> 5:  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
#> 6:  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
#>    Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
#> 1:    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
#> 2:    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
#> 3:    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
#> 4:    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
#> 5:    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
#> 6:    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
#>    HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
#> 1:     2Story           7           5      2003         2003     Gable  CompShg
#> 2:     1Story           6           8      1976         1976     Gable  CompShg
#> 3:     2Story           7           5      2001         2002     Gable  CompShg
#> 4:     2Story           7           5      1915         1970     Gable  CompShg
#> 5:     2Story           8           5      2000         2000     Gable  CompShg
#> 6:     1.5Fin           5           5      1993         1995     Gable  CompShg
#>    Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
#> 1:     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
#> 2:     MetalSd     MetalSd       None          0        TA        TA     CBlock
#> 3:     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
#> 4:     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
#> 5:     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
#> 6:     VinylSd     VinylSd       None          0        TA        TA       Wood
#>    BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
#> 1:       Gd       TA           No          GLQ        706          Unf
#> 2:       Gd       TA           Gd          ALQ        978          Unf
#> 3:       Gd       TA           Mn          GLQ        486          Unf
#> 4:       TA       Gd           No          ALQ        216          Unf
#> 5:       Gd       TA           Av          GLQ        655          Unf
#> 6:       Gd       TA           No          GLQ        732          Unf
#>    BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
#> 1:          0       150         856    GasA        Ex          Y      SBrkr
#> 2:          0       284        1262    GasA        Ex          Y      SBrkr
#> 3:          0       434         920    GasA        Ex          Y      SBrkr
#> 4:          0       540         756    GasA        Gd          Y      SBrkr
#> 5:          0       490        1145    GasA        Ex          Y      SBrkr
#> 6:          0        64         796    GasA        Ex          Y      SBrkr
#>    1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
#> 1:      856      854            0      1710            1            0        2
#> 2:     1262        0            0      1262            0            1        2
#> 3:      920      866            0      1786            1            0        2
#> 4:      961      756            0      1717            1            0        1
#> 5:     1145     1053            0      2198            1            0        2
#> 6:      796      566            0      1362            1            0        1
#>    HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
#> 1:        1            3            1          Gd            8        Typ
#> 2:        0            3            1          TA            6        Typ
#> 3:        1            3            1          Gd            6        Typ
#> 4:        0            3            1          Gd            7        Typ
#> 5:        1            4            1          Gd            9        Typ
#> 6:        1            1            1          TA            5        Typ
#>    Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
#> 1:          0        <NA>     Attchd        2003          RFn          2
#> 2:          1          TA     Attchd        1976          RFn          2
#> 3:          1          TA     Attchd        2001          RFn          2
#> 4:          1          Gd     Detchd        1998          Unf          3
#> 5:          1          TA     Attchd        2000          RFn          3
#> 6:          0        <NA>     Attchd        1993          Unf          2
#>    GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
#> 1:        548         TA         TA          Y          0          61
#> 2:        460         TA         TA          Y        298           0
#> 3:        608         TA         TA          Y          0          42
#> 4:        642         TA         TA          Y          0          35
#> 5:        836         TA         TA          Y        192          84
#> 6:        480         TA         TA          Y         40          30
#>    EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
#> 1:             0         0           0        0   <NA>  <NA>        <NA>
#> 2:             0         0           0        0   <NA>  <NA>        <NA>
#> 3:             0         0           0        0   <NA>  <NA>        <NA>
#> 4:           272         0           0        0   <NA>  <NA>        <NA>
#> 5:             0         0           0        0   <NA>  <NA>        <NA>
#> 6:             0       320           0        0   <NA> MnPrv        Shed
#>    MiscVal MoSold YrSold SaleType SaleCondition SalePrice
#> 1:       0      2   2008       WD        Normal    208500
#> 2:       0      5   2007       WD        Normal    181500
#> 3:       0      9   2008       WD        Normal    223500
#> 4:       0      2   2006       WD       Abnorml    140000
#> 5:       0     12   2008       WD        Normal    250000
#> 6:     700     10   2009       WD        Normal    143000

split <- createDataPartition(y = reg_train$SalePrice, p = 0.7)
xtrain <- reg_train[split$Resample1]
xtest <- reg_train[!split$Resample1]
# remove features with 90% or more missing values
# we will also remove the Id column because it doesn't contain
# any useful information
na_cols <- colSums(is.na(xtrain)) / nrow(xtrain)
na_cols <- names(na_cols[which(na_cols > 0.9)])

xtrain[, c(na_cols, "Id") := NULL]
xtest[, c(na_cols, "Id") := NULL]

# encode categorical variables
cat_cols <- names(xtrain)[sapply(xtrain, is.character)]

for(c in cat_cols){
    lbl <- LabelEncoder$new()
    lbl$fit(c(xtrain[[c]], xtest[[c]]))
    xtrain[[c]] <- lbl$transform(xtrain[[c]])
    xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA'

# removing noise column
noise <- c('GrLivArea','TotalBsmtSF')

xtrain[, c(noise) := NULL]
xtest[, c(noise) := NULL]

# fill missing value with  -1
xtrain[is.na(xtrain)] <- -1
xtest[is.na(xtest)] <- -1

KNN Regression

SVM Regression

Simple Regresison

lf <- LMTrainer$new(family="gaussian")
lf$fit(X = xtrain, y = "SalePrice")
summary(lf$model)
#> 
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -338327   -14806    -1357    13406   264149  
#> 
#> Coefficients:
#>                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   -8.403e+05  1.649e+06  -0.510 0.610463    
#> MSSubClass    -7.887e+01  5.723e+01  -1.378 0.168466    
#> MSZoning      -2.326e+03  1.835e+03  -1.267 0.205386    
#> LotFrontage   -2.105e+01  3.450e+01  -0.610 0.542000    
#> LotArea        3.100e-01  1.683e-01   1.843 0.065696 .  
#> Street        -2.802e+04  1.772e+04  -1.581 0.114197    
#> LotShape      -3.803e+02  2.185e+03  -0.174 0.861871    
#> LandContour   -1.749e+03  1.815e+03  -0.964 0.335394    
#> Utilities     -6.236e+04  3.511e+04  -1.776 0.076056 .  
#> LotConfig      2.268e+03  1.135e+03   1.998 0.046028 *  
#> LandSlope      1.197e+04  4.634e+03   2.582 0.009960 ** 
#> Neighborhood  -5.439e+02  2.210e+02  -2.461 0.014023 *  
#> Condition1    -3.171e+03  1.012e+03  -3.134 0.001778 ** 
#> Condition2    -1.546e+04  3.517e+03  -4.395 1.23e-05 ***
#> BldgType      -2.900e+03  2.196e+03  -1.320 0.186987    
#> HouseStyle    -6.204e+02  1.010e+03  -0.615 0.539028    
#> OverallQual    1.531e+04  1.439e+03  10.643  < 2e-16 ***
#> OverallCond    6.264e+03  1.281e+03   4.889 1.19e-06 ***
#> YearBuilt      3.986e+02  8.696e+01   4.584 5.16e-06 ***
#> YearRemodAdd   9.653e+01  8.274e+01   1.167 0.243654    
#> RoofStyle      5.483e+03  2.175e+03   2.521 0.011880 *  
#> RoofMatl      -1.585e+04  2.299e+03  -6.893 9.95e-12 ***
#> Exterior1st   -1.023e+03  7.264e+02  -1.408 0.159511    
#> Exterior2nd    8.290e+02  6.596e+02   1.257 0.209115    
#> MasVnrType     2.483e+03  1.725e+03   1.439 0.150353    
#> MasVnrArea     2.922e+01  7.096e+00   4.118 4.15e-05 ***
#> ExterQual     -5.428e+02  2.547e+03  -0.213 0.831317    
#> ExterCond      1.057e+03  2.615e+03   0.404 0.686263    
#> Foundation    -3.088e+03  2.012e+03  -1.534 0.125270    
#> BsmtQual       6.577e+03  1.581e+03   4.159 3.48e-05 ***
#> BsmtCond      -2.711e+03  2.021e+03  -1.341 0.180111    
#> BsmtExposure   1.313e+03  1.064e+03   1.234 0.217481    
#> BsmtFinType1  -1.137e+03  8.413e+02  -1.352 0.176684    
#> BsmtFinSF1     5.771e+00  6.016e+00   0.959 0.337663    
#> BsmtFinType2  -1.152e+03  1.091e+03  -1.056 0.291114    
#> BsmtFinSF2     1.731e+01  1.105e+01   1.567 0.117433    
#> BsmtUnfSF      1.516e+00  5.735e+00   0.264 0.791534    
#> Heating       -1.409e+03  3.400e+03  -0.414 0.678660    
#> HeatingQC     -1.390e+03  1.433e+03  -0.970 0.332129    
#> CentralAir     2.728e+03  5.508e+03   0.495 0.620461    
#> Electrical     3.957e+03  2.241e+03   1.766 0.077773 .  
#> `1stFlrSF`     5.947e+01  7.274e+00   8.176 9.36e-16 ***
#> `2ndFlrSF`     5.111e+01  6.069e+00   8.422  < 2e-16 ***
#> LowQualFinSF  -1.606e+00  2.944e+01  -0.055 0.956526    
#> BsmtFullBath   1.121e+04  3.063e+03   3.659 0.000267 ***
#> BsmtHalfBath   4.686e+03  4.621e+03   1.014 0.310817    
#> FullBath       6.199e+03  3.298e+03   1.879 0.060511 .  
#> HalfBath      -1.834e+03  3.089e+03  -0.594 0.552889    
#> BedroomAbvGr  -7.087e+03  1.972e+03  -3.594 0.000343 ***
#> KitchenAbvGr  -1.891e+04  5.937e+03  -3.185 0.001494 ** 
#> KitchenQual    9.122e+03  1.595e+03   5.719 1.44e-08 ***
#> TotRmsAbvGrd   2.779e+03  1.452e+03   1.914 0.055896 .  
#> Functional    -4.440e+03  1.601e+03  -2.774 0.005647 ** 
#> Fireplaces    -1.103e+03  2.680e+03  -0.412 0.680682    
#> FireplaceQu    3.605e+03  1.440e+03   2.504 0.012460 *  
#> GarageType    -3.729e+01  1.167e+03  -0.032 0.974509    
#> GarageYrBlt   -1.006e+01  6.902e+00  -1.458 0.145093    
#> GarageFinish   6.266e+02  1.560e+03   0.402 0.688096    
#> GarageCars     1.598e+04  3.539e+03   4.516 7.10e-06 ***
#> GarageArea    -5.789e+00  1.199e+01  -0.483 0.629460    
#> GarageQual     1.132e+03  4.160e+03   0.272 0.785537    
#> GarageCond    -1.720e+03  2.300e+03  -0.748 0.454730    
#> PavedDrive    -7.645e+02  3.137e+03  -0.244 0.807503    
#> WoodDeckSF     2.180e+01  9.438e+00   2.310 0.021095 *  
#> OpenPorchSF   -9.759e+00  1.843e+01  -0.529 0.596665    
#> EnclosedPorch  1.735e+01  1.979e+01   0.877 0.380947    
#> `3SsnPorch`    2.172e+00  3.562e+01   0.061 0.951392    
#> ScreenPorch    5.565e+01  2.067e+01   2.692 0.007234 ** 
#> PoolArea      -3.991e+01  3.487e+01  -1.144 0.252723    
#> Fence         -2.484e+03  1.490e+03  -1.667 0.095834 .  
#> MiscVal        6.502e+00  4.441e+00   1.464 0.143447    
#> MoSold        -3.681e+01  3.896e+02  -0.094 0.924751    
#> YrSold        -9.048e+01  8.216e+02  -0.110 0.912340    
#> SaleType       3.043e+03  1.341e+03   2.269 0.023488 *  
#> SaleCondition -1.317e+03  1.400e+03  -0.941 0.347007    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 1064227128)
#> 
#>     Null deviance: 6.6216e+12  on 1023  degrees of freedom
#> Residual deviance: 1.0100e+12  on  949  degrees of freedom
#> AIC: 24264
#> 
#> Number of Fisher Scoring iterations: 2
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 32211.9

Lasso Regression

Ridge Regression

Logistic Regression with CV

Random Forest

rf <- RFTrainer$new(n_estimators = 500,classification = 0)
rf$fit(X = xtrain, y = "SalePrice")
pred <- rf$predict(df = xtest)
rf$get_importance()
#>               tmp.order.tmp..decreasing...TRUE..
#> OverallQual                         843528867996
#> GarageCars                          541332222299
#> 1stFlrSF                            483318749419
#> GarageArea                          473506726377
#> YearBuilt                           383509412736
#> GarageYrBlt                         292494440939
#> 2ndFlrSF                            245273696373
#> BsmtFinSF1                          236385559010
#> FullBath                            235657120093
#> TotRmsAbvGrd                        204160296183
#> YearRemodAdd                        199854187618
#> KitchenQual                         192826779801
#> LotArea                             167638537104
#> ExterQual                           165651128526
#> MasVnrArea                          164778221394
#> Fireplaces                          157096715586
#> FireplaceQu                         137297641709
#> BsmtQual                            121920438920
#> OpenPorchSF                         120841460753
#> LotFrontage                         118101178039
#> Foundation                          103990733529
#> WoodDeckSF                           69848765508
#> Neighborhood                         67261100945
#> BsmtUnfSF                            62919521322
#> BedroomAbvGr                         52860306010
#> BsmtFinType1                         51982234371
#> HeatingQC                            50189063151
#> GarageType                           48044355273
#> Exterior2nd                          42596379941
#> MoSold                               40062903329
#> MSSubClass                           39564307015
#> OverallCond                          32754782066
#> Exterior1st                          31737713942
#> RoofStyle                            29179553505
#> HalfBath                             25624426339
#> PoolArea                             24774423758
#> HouseStyle                           23762558236
#> YrSold                               22444732721
#> BsmtFullBath                         22260611232
#> GarageFinish                         22137108912
#> LotShape                             20767075998
#> BsmtExposure                         17364314649
#> SaleCondition                        16866276843
#> RoofMatl                             15874497235
#> MasVnrType                           15852982148
#> Fence                                15453138099
#> CentralAir                           14327658513
#> LotConfig                            13549635384
#> SaleType                             13508654676
#> MSZoning                             13498853944
#> ScreenPorch                          13397400798
#> LandContour                          13319110213
#> BldgType                             12867140545
#> EnclosedPorch                        11296479669
#> BsmtHalfBath                          9688039829
#> KitchenAbvGr                          9584674634
#> Condition1                            8747947235
#> ExterCond                             8169124003
#> BsmtFinSF2                            7683805820
#> BsmtCond                              7619078071
#> GarageQual                            7562150192
#> BsmtFinType2                          6735314538
#> GarageCond                            6672057801
#> LandSlope                             6584672331
#> Electrical                            5399227197
#> PavedDrive                            4501353384
#> Functional                            4190006815
#> Condition2                            2195681528
#> MiscVal                               1928894547
#> Heating                               1569086808
#> 3SsnPorch                             1480581688
#> LowQualFinSF                          1056659610
#> Street                                 438834513
#> Utilities                               88330013
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 30455

Xgboost

Grid Search

xgb <- XGBTrainer$new(objective = "reg:linear")

gst <- GridSearchCV$new(trainer = xgb,
                             parameters = list(n_estimators = c(10,50), max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'))
gst$fit(xtrain, "SalePrice")
#> [1] "entering grid search"
#> [1] "In total, 4 models will be trained"
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:56] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:56] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-rmse:143834.662493 
#> Will train until train_rmse hasn't improved in 50 rounds.
#> 
#> [10] train-rmse:16183.222568
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:56] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:56] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-rmse:140929.342988 
#> Will train until train_rmse hasn't improved in 50 rounds.
#> 
#> [10] train-rmse:15550.145000
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-rmse:143251.233056 
#> Will train until train_rmse hasn't improved in 50 rounds.
#> 
#> [10] train-rmse:16330.818909
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-rmse:143834.662493 
#> Will train until train_rmse hasn't improved in 50 rounds.
#> 
#> [50] train-rmse:3819.400511
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-rmse:140929.342988 
#> Will train until train_rmse hasn't improved in 50 rounds.
#> 
#> [50] train-rmse:3905.734710
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-rmse:143251.233056 
#> Will train until train_rmse hasn't improved in 50 rounds.
#> 
#> [50] train-rmse:3718.327761
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-rmse:144732.651251 
#> Will train until train_rmse hasn't improved in 50 rounds.
#> 
#> [10] train-rmse:32563.362240
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-rmse:141943.965863 
#> Will train until train_rmse hasn't improved in 50 rounds.
#> 
#> [10] train-rmse:31118.978041
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-rmse:144076.337327 
#> Will train until train_rmse hasn't improved in 50 rounds.
#> 
#> [10] train-rmse:27683.098538
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-rmse:144732.651251 
#> Will train until train_rmse hasn't improved in 50 rounds.
#> 
#> [50] train-rmse:17622.435150
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-rmse:141943.965863 
#> Will train until train_rmse hasn't improved in 50 rounds.
#> 
#> [50] train-rmse:16389.178107
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-rmse:144076.337327 
#> Will train until train_rmse hasn't improved in 50 rounds.
#> 
#> [50] train-rmse:15743.306882
gst$best_iteration()
#> $n_estimators
#> [1] 10
#> 
#> $max_depth
#> [1] 5
#> 
#> $accuracy_avg
#> [1] 0
#> 
#> $accuracy_sd
#> [1] 0
#> 
#> $auc_avg
#> [1] NaN
#> 
#> $auc_sd
#> [1] NA

Random Search

Binary Classification Data

Here, we will solve a simple binary classification problem (predict people who survived on titanic ship). The idea here is to demonstrate how to use this package to solve classification problems.

Data Preparation

# load class
load('../data/cla_train.rda')
# if the above doesn't work, you can try: load("cla_train.rda")

head(cla_train)
#>    PassengerId Survived Pclass
#> 1:           1        0      3
#> 2:           2        1      1
#> 3:           3        1      3
#> 4:           4        1      1
#> 5:           5        0      3
#> 6:           6        0      3
#>                                                   Name    Sex Age SibSp Parch
#> 1:                             Braund, Mr. Owen Harris   male  22     1     0
#> 2: Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
#> 3:                              Heikkinen, Miss. Laina female  26     0     0
#> 4:        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
#> 5:                            Allen, Mr. William Henry   male  35     0     0
#> 6:                                    Moran, Mr. James   male  NA     0     0
#>              Ticket    Fare Cabin Embarked
#> 1:        A/5 21171  7.2500              S
#> 2:         PC 17599 71.2833   C85        C
#> 3: STON/O2. 3101282  7.9250              S
#> 4:           113803 53.1000  C123        S
#> 5:           373450  8.0500              S
#> 6:           330877  8.4583              Q

# split the data
split <- createDataPartition(y = cla_train$Survived,p = 0.7)
xtrain <- cla_train[split$Resample1]
xtest <- cla_train[!split$Resample1]

# encode categorical variables - shorter way
for(c in c('Embarked','Sex','Cabin')) {
    lbl <- LabelEncoder$new()
    lbl$fit(c(xtrain[[c]], xtest[[c]]))
    xtrain[[c]] <- lbl$transform(xtrain[[c]])
    xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA'

# impute missing values
xtrain[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
xtest[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]

# drop these features
to_drop <- c('PassengerId','Ticket','Name')

xtrain <- xtrain[,-c(to_drop), with=F]
xtest <- xtest[,-c(to_drop), with=F]

Now, our data is ready to be served for model training. Let’s do it.

KNN Classification

Naive Bayes Classification

SVM Classification

Logistic Regression

Lasso Logistic Regression

Ridge Logistic Regression

Random Forest

Xgboost

Grid Search

xgb <- XGBTrainer$new(objective="binary:logistic")
gst <-GridSearchCV$new(trainer = xgb,
                             parameters = list(n_estimators = c(10,50),
                             max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'))
gst$fit(xtrain, "Survived")
#> [1] "entering grid search"
#> [1] "In total, 4 models will be trained"
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-logloss:0.559993 
#> Will train until train_logloss hasn't improved in 50 rounds.
#> 
#> [10] train-logloss:0.304225
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-logloss:0.558838 
#> Will train until train_logloss hasn't improved in 50 rounds.
#> 
#> [10] train-logloss:0.302156
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-logloss:0.547201 
#> Will train until train_logloss hasn't improved in 50 rounds.
#> 
#> [10] train-logloss:0.284697
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-logloss:0.559993 
#> Will train until train_logloss hasn't improved in 50 rounds.
#> 
#> [50] train-logloss:0.172763
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-logloss:0.558838 
#> Will train until train_logloss hasn't improved in 50 rounds.
#> 
#> [50] train-logloss:0.160542
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-logloss:0.547201 
#> Will train until train_logloss hasn't improved in 50 rounds.
#> 
#> [50] train-logloss:0.134347
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-logloss:0.586389 
#> Will train until train_logloss hasn't improved in 50 rounds.
#> 
#> [10] train-logloss:0.416683
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-logloss:0.590015 
#> Will train until train_logloss hasn't improved in 50 rounds.
#> 
#> [10] train-logloss:0.416606
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-logloss:0.582958 
#> Will train until train_logloss hasn't improved in 50 rounds.
#> 
#> [10] train-logloss:0.393512
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-logloss:0.586389 
#> Will train until train_logloss hasn't improved in 50 rounds.
#> 
#> [50] train-logloss:0.325189
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:59] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-logloss:0.590015 
#> Will train until train_logloss hasn't improved in 50 rounds.
#> 
#> [50] train-logloss:0.320031
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:59] WARNING: amalgamation/../src/learner.cc:627: 
#> Parameters: { "nrounds" } might not be used.
#> 
#>   This could be a false alarm, with some parameters getting used by language bindings but
#>   then being mistakenly passed down to XGBoost core, or some parameter actually being used
#>   but getting flagged wrongly here. Please open an issue if you find any such cases.
#> 
#> 
#> [1]  train-logloss:0.582958 
#> Will train until train_logloss hasn't improved in 50 rounds.
#> 
#> [50] train-logloss:0.308201
gst$best_iteration()
#> $n_estimators
#> [1] 10
#> 
#> $max_depth
#> [1] 5
#> 
#> $accuracy_avg
#> [1] 0
#> 
#> $accuracy_sd
#> [1] 0
#> 
#> $auc_avg
#> [1] 0.8630918
#> 
#> $auc_sd
#> [1] 0.02270548

Random Search

Let’s create some new feature based on target variable using target encoding and test a model.