SuperML R package is designed to unify the model training process in R like Python. Generally, it’s seen that people spend lot of time in searching for packages, figuring out the syntax for training machine learning models in R. This behaviour is highly apparent in users who frequently switch between R and Python. This package provides a python´s scikit-learn interface (fit
, predict
) to train models faster.
In addition to building machine learning models, there are handy functionalities to do feature engineering
This ambitious package is my ongoing effort to help the r-community build ML models easily and faster in R.
You can install latest cran version using (recommended):
You can install the developmemt version directly from github using:
For machine learning, superml is based on the existing R packages. Hence, while installing the package, we don’t install all the dependencies. However, while training any model, superml will automatically install the package if its not found. Still, if you want to install all dependencies at once, you can simply do:
This package uses existing r-packages to build machine learning model. In this tutorial, we’ll use data.table R package to do all tasks related to data manipulation.
We’ll quickly prepare the data set to be ready to served for model training.
load("../data/reg_train.rda")
# if the above doesn't work, you can try: load("reg_train.rda")
library(data.table)
library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice
library(superml)
library(Metrics)
#>
#> Attaching package: 'Metrics'
#> The following objects are masked from 'package:caret':
#>
#> precision, recall
head(reg_train)
#> Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
#> 1: 1 60 RL 65 8450 Pave <NA> Reg Lvl
#> 2: 2 20 RL 80 9600 Pave <NA> Reg Lvl
#> 3: 3 60 RL 68 11250 Pave <NA> IR1 Lvl
#> 4: 4 70 RL 60 9550 Pave <NA> IR1 Lvl
#> 5: 5 60 RL 84 14260 Pave <NA> IR1 Lvl
#> 6: 6 50 RL 85 14115 Pave <NA> IR1 Lvl
#> Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
#> 1: AllPub Inside Gtl CollgCr Norm Norm 1Fam
#> 2: AllPub FR2 Gtl Veenker Feedr Norm 1Fam
#> 3: AllPub Inside Gtl CollgCr Norm Norm 1Fam
#> 4: AllPub Corner Gtl Crawfor Norm Norm 1Fam
#> 5: AllPub FR2 Gtl NoRidge Norm Norm 1Fam
#> 6: AllPub Inside Gtl Mitchel Norm Norm 1Fam
#> HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
#> 1: 2Story 7 5 2003 2003 Gable CompShg
#> 2: 1Story 6 8 1976 1976 Gable CompShg
#> 3: 2Story 7 5 2001 2002 Gable CompShg
#> 4: 2Story 7 5 1915 1970 Gable CompShg
#> 5: 2Story 8 5 2000 2000 Gable CompShg
#> 6: 1.5Fin 5 5 1993 1995 Gable CompShg
#> Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
#> 1: VinylSd VinylSd BrkFace 196 Gd TA PConc
#> 2: MetalSd MetalSd None 0 TA TA CBlock
#> 3: VinylSd VinylSd BrkFace 162 Gd TA PConc
#> 4: Wd Sdng Wd Shng None 0 TA TA BrkTil
#> 5: VinylSd VinylSd BrkFace 350 Gd TA PConc
#> 6: VinylSd VinylSd None 0 TA TA Wood
#> BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
#> 1: Gd TA No GLQ 706 Unf
#> 2: Gd TA Gd ALQ 978 Unf
#> 3: Gd TA Mn GLQ 486 Unf
#> 4: TA Gd No ALQ 216 Unf
#> 5: Gd TA Av GLQ 655 Unf
#> 6: Gd TA No GLQ 732 Unf
#> BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
#> 1: 0 150 856 GasA Ex Y SBrkr
#> 2: 0 284 1262 GasA Ex Y SBrkr
#> 3: 0 434 920 GasA Ex Y SBrkr
#> 4: 0 540 756 GasA Gd Y SBrkr
#> 5: 0 490 1145 GasA Ex Y SBrkr
#> 6: 0 64 796 GasA Ex Y SBrkr
#> 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
#> 1: 856 854 0 1710 1 0 2
#> 2: 1262 0 0 1262 0 1 2
#> 3: 920 866 0 1786 1 0 2
#> 4: 961 756 0 1717 1 0 1
#> 5: 1145 1053 0 2198 1 0 2
#> 6: 796 566 0 1362 1 0 1
#> HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
#> 1: 1 3 1 Gd 8 Typ
#> 2: 0 3 1 TA 6 Typ
#> 3: 1 3 1 Gd 6 Typ
#> 4: 0 3 1 Gd 7 Typ
#> 5: 1 4 1 Gd 9 Typ
#> 6: 1 1 1 TA 5 Typ
#> Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
#> 1: 0 <NA> Attchd 2003 RFn 2
#> 2: 1 TA Attchd 1976 RFn 2
#> 3: 1 TA Attchd 2001 RFn 2
#> 4: 1 Gd Detchd 1998 Unf 3
#> 5: 1 TA Attchd 2000 RFn 3
#> 6: 0 <NA> Attchd 1993 Unf 2
#> GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
#> 1: 548 TA TA Y 0 61
#> 2: 460 TA TA Y 298 0
#> 3: 608 TA TA Y 0 42
#> 4: 642 TA TA Y 0 35
#> 5: 836 TA TA Y 192 84
#> 6: 480 TA TA Y 40 30
#> EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
#> 1: 0 0 0 0 <NA> <NA> <NA>
#> 2: 0 0 0 0 <NA> <NA> <NA>
#> 3: 0 0 0 0 <NA> <NA> <NA>
#> 4: 272 0 0 0 <NA> <NA> <NA>
#> 5: 0 0 0 0 <NA> <NA> <NA>
#> 6: 0 320 0 0 <NA> MnPrv Shed
#> MiscVal MoSold YrSold SaleType SaleCondition SalePrice
#> 1: 0 2 2008 WD Normal 208500
#> 2: 0 5 2007 WD Normal 181500
#> 3: 0 9 2008 WD Normal 223500
#> 4: 0 2 2006 WD Abnorml 140000
#> 5: 0 12 2008 WD Normal 250000
#> 6: 700 10 2009 WD Normal 143000
split <- createDataPartition(y = reg_train$SalePrice, p = 0.7)
xtrain <- reg_train[split$Resample1]
xtest <- reg_train[!split$Resample1]
# remove features with 90% or more missing values
# we will also remove the Id column because it doesn't contain
# any useful information
na_cols <- colSums(is.na(xtrain)) / nrow(xtrain)
na_cols <- names(na_cols[which(na_cols > 0.9)])
xtrain[, c(na_cols, "Id") := NULL]
xtest[, c(na_cols, "Id") := NULL]
# encode categorical variables
cat_cols <- names(xtrain)[sapply(xtrain, is.character)]
for(c in cat_cols){
lbl <- LabelEncoder$new()
lbl$fit(c(xtrain[[c]], xtest[[c]]))
xtrain[[c]] <- lbl$transform(xtrain[[c]])
xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
#> The data contains NA values. Imputing NA with 'NA'
# removing noise column
noise <- c('GrLivArea','TotalBsmtSF')
xtrain[, c(noise) := NULL]
xtest[, c(noise) := NULL]
# fill missing value with -1
xtrain[is.na(xtrain)] <- -1
xtest[is.na(xtest)] <- -1
KNN Regression
knn <- KNNTrainer$new(k = 2,prob = T,type = 'reg')
knn$fit(train = xtrain, test = xtest, y = 'SalePrice')
probs <- knn$predict(type = 'prob')
labels <- knn$predict(type='raw')
rmse(actual = xtest$SalePrice, predicted=labels)
#> [1] 48662.52
SVM Regression
svm <- SVMTrainer$new()
svm$fit(xtrain, 'SalePrice')
pred <- svm$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)
Simple Regresison
lf <- LMTrainer$new(family="gaussian")
lf$fit(X = xtrain, y = "SalePrice")
summary(lf$model)
#>
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -338327 -14806 -1357 13406 264149
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -8.403e+05 1.649e+06 -0.510 0.610463
#> MSSubClass -7.887e+01 5.723e+01 -1.378 0.168466
#> MSZoning -2.326e+03 1.835e+03 -1.267 0.205386
#> LotFrontage -2.105e+01 3.450e+01 -0.610 0.542000
#> LotArea 3.100e-01 1.683e-01 1.843 0.065696 .
#> Street -2.802e+04 1.772e+04 -1.581 0.114197
#> LotShape -3.803e+02 2.185e+03 -0.174 0.861871
#> LandContour -1.749e+03 1.815e+03 -0.964 0.335394
#> Utilities -6.236e+04 3.511e+04 -1.776 0.076056 .
#> LotConfig 2.268e+03 1.135e+03 1.998 0.046028 *
#> LandSlope 1.197e+04 4.634e+03 2.582 0.009960 **
#> Neighborhood -5.439e+02 2.210e+02 -2.461 0.014023 *
#> Condition1 -3.171e+03 1.012e+03 -3.134 0.001778 **
#> Condition2 -1.546e+04 3.517e+03 -4.395 1.23e-05 ***
#> BldgType -2.900e+03 2.196e+03 -1.320 0.186987
#> HouseStyle -6.204e+02 1.010e+03 -0.615 0.539028
#> OverallQual 1.531e+04 1.439e+03 10.643 < 2e-16 ***
#> OverallCond 6.264e+03 1.281e+03 4.889 1.19e-06 ***
#> YearBuilt 3.986e+02 8.696e+01 4.584 5.16e-06 ***
#> YearRemodAdd 9.653e+01 8.274e+01 1.167 0.243654
#> RoofStyle 5.483e+03 2.175e+03 2.521 0.011880 *
#> RoofMatl -1.585e+04 2.299e+03 -6.893 9.95e-12 ***
#> Exterior1st -1.023e+03 7.264e+02 -1.408 0.159511
#> Exterior2nd 8.290e+02 6.596e+02 1.257 0.209115
#> MasVnrType 2.483e+03 1.725e+03 1.439 0.150353
#> MasVnrArea 2.922e+01 7.096e+00 4.118 4.15e-05 ***
#> ExterQual -5.428e+02 2.547e+03 -0.213 0.831317
#> ExterCond 1.057e+03 2.615e+03 0.404 0.686263
#> Foundation -3.088e+03 2.012e+03 -1.534 0.125270
#> BsmtQual 6.577e+03 1.581e+03 4.159 3.48e-05 ***
#> BsmtCond -2.711e+03 2.021e+03 -1.341 0.180111
#> BsmtExposure 1.313e+03 1.064e+03 1.234 0.217481
#> BsmtFinType1 -1.137e+03 8.413e+02 -1.352 0.176684
#> BsmtFinSF1 5.771e+00 6.016e+00 0.959 0.337663
#> BsmtFinType2 -1.152e+03 1.091e+03 -1.056 0.291114
#> BsmtFinSF2 1.731e+01 1.105e+01 1.567 0.117433
#> BsmtUnfSF 1.516e+00 5.735e+00 0.264 0.791534
#> Heating -1.409e+03 3.400e+03 -0.414 0.678660
#> HeatingQC -1.390e+03 1.433e+03 -0.970 0.332129
#> CentralAir 2.728e+03 5.508e+03 0.495 0.620461
#> Electrical 3.957e+03 2.241e+03 1.766 0.077773 .
#> `1stFlrSF` 5.947e+01 7.274e+00 8.176 9.36e-16 ***
#> `2ndFlrSF` 5.111e+01 6.069e+00 8.422 < 2e-16 ***
#> LowQualFinSF -1.606e+00 2.944e+01 -0.055 0.956526
#> BsmtFullBath 1.121e+04 3.063e+03 3.659 0.000267 ***
#> BsmtHalfBath 4.686e+03 4.621e+03 1.014 0.310817
#> FullBath 6.199e+03 3.298e+03 1.879 0.060511 .
#> HalfBath -1.834e+03 3.089e+03 -0.594 0.552889
#> BedroomAbvGr -7.087e+03 1.972e+03 -3.594 0.000343 ***
#> KitchenAbvGr -1.891e+04 5.937e+03 -3.185 0.001494 **
#> KitchenQual 9.122e+03 1.595e+03 5.719 1.44e-08 ***
#> TotRmsAbvGrd 2.779e+03 1.452e+03 1.914 0.055896 .
#> Functional -4.440e+03 1.601e+03 -2.774 0.005647 **
#> Fireplaces -1.103e+03 2.680e+03 -0.412 0.680682
#> FireplaceQu 3.605e+03 1.440e+03 2.504 0.012460 *
#> GarageType -3.729e+01 1.167e+03 -0.032 0.974509
#> GarageYrBlt -1.006e+01 6.902e+00 -1.458 0.145093
#> GarageFinish 6.266e+02 1.560e+03 0.402 0.688096
#> GarageCars 1.598e+04 3.539e+03 4.516 7.10e-06 ***
#> GarageArea -5.789e+00 1.199e+01 -0.483 0.629460
#> GarageQual 1.132e+03 4.160e+03 0.272 0.785537
#> GarageCond -1.720e+03 2.300e+03 -0.748 0.454730
#> PavedDrive -7.645e+02 3.137e+03 -0.244 0.807503
#> WoodDeckSF 2.180e+01 9.438e+00 2.310 0.021095 *
#> OpenPorchSF -9.759e+00 1.843e+01 -0.529 0.596665
#> EnclosedPorch 1.735e+01 1.979e+01 0.877 0.380947
#> `3SsnPorch` 2.172e+00 3.562e+01 0.061 0.951392
#> ScreenPorch 5.565e+01 2.067e+01 2.692 0.007234 **
#> PoolArea -3.991e+01 3.487e+01 -1.144 0.252723
#> Fence -2.484e+03 1.490e+03 -1.667 0.095834 .
#> MiscVal 6.502e+00 4.441e+00 1.464 0.143447
#> MoSold -3.681e+01 3.896e+02 -0.094 0.924751
#> YrSold -9.048e+01 8.216e+02 -0.110 0.912340
#> SaleType 3.043e+03 1.341e+03 2.269 0.023488 *
#> SaleCondition -1.317e+03 1.400e+03 -0.941 0.347007
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 1064227128)
#>
#> Null deviance: 6.6216e+12 on 1023 degrees of freedom
#> Residual deviance: 1.0100e+12 on 949 degrees of freedom
#> AIC: 24264
#>
#> Number of Fisher Scoring iterations: 2
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 32211.9
Lasso Regression
lf <- LMTrainer$new(family = "gaussian", alpha = 1, lambda = 1000)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 37490.75
Ridge Regression
lf <- LMTrainer$new(family = "gaussian", alpha=0)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 37519.78
Logistic Regression with CV
lf <- LMTrainer$new(family = "gaussian")
lf$cv_model(X = xtrain, y = 'SalePrice', nfolds = 5, parallel = FALSE)
predictions <- lf$cv_predict(df = xtest)
coefs <- lf$get_importance()
rmse(actual = xtest$SalePrice, predicted = predictions)
Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 0)
rf$fit(X = xtrain, y = "SalePrice")
pred <- rf$predict(df = xtest)
rf$get_importance()
#> tmp.order.tmp..decreasing...TRUE..
#> OverallQual 843528867996
#> GarageCars 541332222299
#> 1stFlrSF 483318749419
#> GarageArea 473506726377
#> YearBuilt 383509412736
#> GarageYrBlt 292494440939
#> 2ndFlrSF 245273696373
#> BsmtFinSF1 236385559010
#> FullBath 235657120093
#> TotRmsAbvGrd 204160296183
#> YearRemodAdd 199854187618
#> KitchenQual 192826779801
#> LotArea 167638537104
#> ExterQual 165651128526
#> MasVnrArea 164778221394
#> Fireplaces 157096715586
#> FireplaceQu 137297641709
#> BsmtQual 121920438920
#> OpenPorchSF 120841460753
#> LotFrontage 118101178039
#> Foundation 103990733529
#> WoodDeckSF 69848765508
#> Neighborhood 67261100945
#> BsmtUnfSF 62919521322
#> BedroomAbvGr 52860306010
#> BsmtFinType1 51982234371
#> HeatingQC 50189063151
#> GarageType 48044355273
#> Exterior2nd 42596379941
#> MoSold 40062903329
#> MSSubClass 39564307015
#> OverallCond 32754782066
#> Exterior1st 31737713942
#> RoofStyle 29179553505
#> HalfBath 25624426339
#> PoolArea 24774423758
#> HouseStyle 23762558236
#> YrSold 22444732721
#> BsmtFullBath 22260611232
#> GarageFinish 22137108912
#> LotShape 20767075998
#> BsmtExposure 17364314649
#> SaleCondition 16866276843
#> RoofMatl 15874497235
#> MasVnrType 15852982148
#> Fence 15453138099
#> CentralAir 14327658513
#> LotConfig 13549635384
#> SaleType 13508654676
#> MSZoning 13498853944
#> ScreenPorch 13397400798
#> LandContour 13319110213
#> BldgType 12867140545
#> EnclosedPorch 11296479669
#> BsmtHalfBath 9688039829
#> KitchenAbvGr 9584674634
#> Condition1 8747947235
#> ExterCond 8169124003
#> BsmtFinSF2 7683805820
#> BsmtCond 7619078071
#> GarageQual 7562150192
#> BsmtFinType2 6735314538
#> GarageCond 6672057801
#> LandSlope 6584672331
#> Electrical 5399227197
#> PavedDrive 4501353384
#> Functional 4190006815
#> Condition2 2195681528
#> MiscVal 1928894547
#> Heating 1569086808
#> 3SsnPorch 1480581688
#> LowQualFinSF 1056659610
#> Street 438834513
#> Utilities 88330013
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 30455
Xgboost
xgb <- XGBTrainer$new(objective = "reg:linear"
, n_estimators = 500
, eval_metric = "rmse"
, maximize = F
, learning_rate = 0.1
,max_depth = 6)
xgb$fit(X = xtrain, y = "SalePrice", valid = xtest)
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:55] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:55] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-rmse:179540.272230 val-rmse:177816.924029
#> Multiple eval metrics are present. Will use val_rmse for early stopping.
#> Will train until val_rmse hasn't improved in 50 rounds.
#>
#> [51] train-rmse:8043.020532 val-rmse:29194.076697
#> [101] train-rmse:4690.059726 val-rmse:28389.852039
#> [151] train-rmse:3046.429187 val-rmse:28202.642311
#> [201] train-rmse:2076.870140 val-rmse:28066.442911
#> [251] train-rmse:1405.456004 val-rmse:27995.566795
#> Stopping. Best iteration:
#> [237] train-rmse:1561.653431 val-rmse:27994.376286
pred <- xgb$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 27994.38
Grid Search
xgb <- XGBTrainer$new(objective = "reg:linear")
gst <- GridSearchCV$new(trainer = xgb,
parameters = list(n_estimators = c(10,50), max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'))
gst$fit(xtrain, "SalePrice")
#> [1] "entering grid search"
#> [1] "In total, 4 models will be trained"
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:56] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:56] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-rmse:143834.662493
#> Will train until train_rmse hasn't improved in 50 rounds.
#>
#> [10] train-rmse:16183.222568
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:56] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:56] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-rmse:140929.342988
#> Will train until train_rmse hasn't improved in 50 rounds.
#>
#> [10] train-rmse:15550.145000
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-rmse:143251.233056
#> Will train until train_rmse hasn't improved in 50 rounds.
#>
#> [10] train-rmse:16330.818909
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-rmse:143834.662493
#> Will train until train_rmse hasn't improved in 50 rounds.
#>
#> [50] train-rmse:3819.400511
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-rmse:140929.342988
#> Will train until train_rmse hasn't improved in 50 rounds.
#>
#> [50] train-rmse:3905.734710
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-rmse:143251.233056
#> Will train until train_rmse hasn't improved in 50 rounds.
#>
#> [50] train-rmse:3718.327761
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-rmse:144732.651251
#> Will train until train_rmse hasn't improved in 50 rounds.
#>
#> [10] train-rmse:32563.362240
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-rmse:141943.965863
#> Will train until train_rmse hasn't improved in 50 rounds.
#>
#> [10] train-rmse:31118.978041
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-rmse:144076.337327
#> Will train until train_rmse hasn't improved in 50 rounds.
#>
#> [10] train-rmse:27683.098538
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-rmse:144732.651251
#> Will train until train_rmse hasn't improved in 50 rounds.
#>
#> [50] train-rmse:17622.435150
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-rmse:141943.965863
#> Will train until train_rmse hasn't improved in 50 rounds.
#>
#> [50] train-rmse:16389.178107
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:57] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
#> [12:05:57] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-rmse:144076.337327
#> Will train until train_rmse hasn't improved in 50 rounds.
#>
#> [50] train-rmse:15743.306882
gst$best_iteration()
#> $n_estimators
#> [1] 10
#>
#> $max_depth
#> [1] 5
#>
#> $accuracy_avg
#> [1] 0
#>
#> $accuracy_sd
#> [1] 0
#>
#> $auc_avg
#> [1] NaN
#>
#> $auc_sd
#> [1] NA
Random Search
rf <- RFTrainer$new()
rst <- RandomSearchCV$new(trainer = rf,
parameters = list(n_estimators = c(5,10),
max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'),
n_iter = 3)
rst$fit(xtrain, "SalePrice")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 5
#>
#> $max_depth
#> [1] 2
#>
#> $accuracy_avg
#> [1] 0.01074421
#>
#> $accuracy_sd
#> [1] 0.003393777
#>
#> $auc_avg
#> [1] NaN
#>
#> $auc_sd
#> [1] NA
Here, we will solve a simple binary classification problem (predict people who survived on titanic ship). The idea here is to demonstrate how to use this package to solve classification problems.
Data Preparation
# load class
load('../data/cla_train.rda')
# if the above doesn't work, you can try: load("cla_train.rda")
head(cla_train)
#> PassengerId Survived Pclass
#> 1: 1 0 3
#> 2: 2 1 1
#> 3: 3 1 3
#> 4: 4 1 1
#> 5: 5 0 3
#> 6: 6 0 3
#> Name Sex Age SibSp Parch
#> 1: Braund, Mr. Owen Harris male 22 1 0
#> 2: Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
#> 3: Heikkinen, Miss. Laina female 26 0 0
#> 4: Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
#> 5: Allen, Mr. William Henry male 35 0 0
#> 6: Moran, Mr. James male NA 0 0
#> Ticket Fare Cabin Embarked
#> 1: A/5 21171 7.2500 S
#> 2: PC 17599 71.2833 C85 C
#> 3: STON/O2. 3101282 7.9250 S
#> 4: 113803 53.1000 C123 S
#> 5: 373450 8.0500 S
#> 6: 330877 8.4583 Q
# split the data
split <- createDataPartition(y = cla_train$Survived,p = 0.7)
xtrain <- cla_train[split$Resample1]
xtest <- cla_train[!split$Resample1]
# encode categorical variables - shorter way
for(c in c('Embarked','Sex','Cabin')) {
lbl <- LabelEncoder$new()
lbl$fit(c(xtrain[[c]], xtest[[c]]))
xtrain[[c]] <- lbl$transform(xtrain[[c]])
xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
#> The data contains blank values. Imputing them with 'NA'
# impute missing values
xtrain[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
xtest[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
# drop these features
to_drop <- c('PassengerId','Ticket','Name')
xtrain <- xtrain[,-c(to_drop), with=F]
xtest <- xtest[,-c(to_drop), with=F]
Now, our data is ready to be served for model training. Let’s do it.
KNN Classification
knn <- KNNTrainer$new(k = 2,prob = T,type = 'class')
knn$fit(train = xtrain, test = xtest, y = 'Survived')
probs <- knn$predict(type = 'prob')
labels <- knn$predict(type = 'raw')
auc(actual = xtest$Survived, predicted = labels)
#> [1] 0.6385027
Naive Bayes Classification
nb <- NBTrainer$new()
nb$fit(xtrain, 'Survived')
pred <- nb$predict(xtest)
#> Warning: predict.naive_bayes(): more features in the newdata are provided as
#> there are probability tables in the object. Calculation is performed based on
#> features to be found in the tables.
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7771836
SVM Classification
#predicts labels
svm <- SVMTrainer$new()
svm$fit(xtrain, 'Survived')
pred <- svm$predict(xtest)
auc(actual = xtest$Survived, predicted=pred)
Logistic Regression
lf <- LMTrainer$new(family = "binomial")
lf$fit(X = xtrain, y = "Survived")
summary(lf$model)
#>
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6102 -0.6018 -0.4367 0.7038 2.4493
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.830070 0.616894 2.967 0.00301 **
#> Pclass -0.980785 0.192493 -5.095 3.48e-07 ***
#> Sex 2.508241 0.230374 10.888 < 2e-16 ***
#> Age -0.041034 0.009309 -4.408 1.04e-05 ***
#> SibSp -0.235520 0.117715 -2.001 0.04542 *
#> Parch -0.098742 0.137791 -0.717 0.47361
#> Fare 0.001281 0.002842 0.451 0.65230
#> Cabin 0.008408 0.004786 1.757 0.07899 .
#> Embarked 0.248088 0.166616 1.489 0.13649
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 831.52 on 623 degrees of freedom
#> Residual deviance: 564.76 on 615 degrees of freedom
#> AIC: 582.76
#>
#> Number of Fisher Scoring iterations: 5
predictions <- lf$predict(df = xtest)
auc(actual = xtest$Survived, predicted = predictions)
#> [1] 0.8832145
Lasso Logistic Regression
lf <- LMTrainer$new(family="binomial", alpha=1)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
Ridge Logistic Regression
lf <- LMTrainer$new(family="binomial", alpha=0)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 3)
rf$fit(X = xtrain, y = "Survived")
pred <- rf$predict(df = xtest)
rf$get_importance()
#> tmp.order.tmp..decreasing...TRUE..
#> Sex 67.80128
#> Fare 57.97193
#> Age 48.37045
#> Pclass 24.64915
#> Cabin 21.45972
#> SibSp 13.51637
#> Parch 10.45743
#> Embarked 10.23844
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7976827
Xgboost
xgb <- XGBTrainer$new(objective = "binary:logistic"
, n_estimators = 500
, eval_metric = "auc"
, maximize = T
, learning_rate = 0.1
,max_depth = 6)
xgb$fit(X = xtrain, y = "Survived", valid = xtest)
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-auc:0.886258 val-auc:0.879085
#> Multiple eval metrics are present. Will use val_auc for early stopping.
#> Will train until val_auc hasn't improved in 50 rounds.
#>
#> [51] train-auc:0.972938 val-auc:0.866370
#> Stopping. Best iteration:
#> [1] train-auc:0.886258 val-auc:0.879085
pred <- xgb$predict(xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.879085
Grid Search
xgb <- XGBTrainer$new(objective="binary:logistic")
gst <-GridSearchCV$new(trainer = xgb,
parameters = list(n_estimators = c(10,50),
max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'))
gst$fit(xtrain, "Survived")
#> [1] "entering grid search"
#> [1] "In total, 4 models will be trained"
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-logloss:0.559993
#> Will train until train_logloss hasn't improved in 50 rounds.
#>
#> [10] train-logloss:0.304225
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-logloss:0.558838
#> Will train until train_logloss hasn't improved in 50 rounds.
#>
#> [10] train-logloss:0.302156
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-logloss:0.547201
#> Will train until train_logloss hasn't improved in 50 rounds.
#>
#> [10] train-logloss:0.284697
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-logloss:0.559993
#> Will train until train_logloss hasn't improved in 50 rounds.
#>
#> [50] train-logloss:0.172763
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-logloss:0.558838
#> Will train until train_logloss hasn't improved in 50 rounds.
#>
#> [50] train-logloss:0.160542
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-logloss:0.547201
#> Will train until train_logloss hasn't improved in 50 rounds.
#>
#> [50] train-logloss:0.134347
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-logloss:0.586389
#> Will train until train_logloss hasn't improved in 50 rounds.
#>
#> [10] train-logloss:0.416683
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-logloss:0.590015
#> Will train until train_logloss hasn't improved in 50 rounds.
#>
#> [10] train-logloss:0.416606
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-logloss:0.582958
#> Will train until train_logloss hasn't improved in 50 rounds.
#>
#> [10] train-logloss:0.393512
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:58] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-logloss:0.586389
#> Will train until train_logloss hasn't improved in 50 rounds.
#>
#> [50] train-logloss:0.325189
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:59] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-logloss:0.590015
#> Will train until train_logloss hasn't improved in 50 rounds.
#>
#> [50] train-logloss:0.320031
#> converting the data into xgboost format..
#> starting with training...
#> [12:05:59] WARNING: amalgamation/../src/learner.cc:627:
#> Parameters: { "nrounds" } might not be used.
#>
#> This could be a false alarm, with some parameters getting used by language bindings but
#> then being mistakenly passed down to XGBoost core, or some parameter actually being used
#> but getting flagged wrongly here. Please open an issue if you find any such cases.
#>
#>
#> [1] train-logloss:0.582958
#> Will train until train_logloss hasn't improved in 50 rounds.
#>
#> [50] train-logloss:0.308201
gst$best_iteration()
#> $n_estimators
#> [1] 10
#>
#> $max_depth
#> [1] 5
#>
#> $accuracy_avg
#> [1] 0
#>
#> $accuracy_sd
#> [1] 0
#>
#> $auc_avg
#> [1] 0.8630918
#>
#> $auc_sd
#> [1] 0.02270548
Random Search
rf <- RFTrainer$new()
rst <- RandomSearchCV$new(trainer = rf,
parameters = list(n_estimators = c(10,50), max_depth = c(5,2)),
n_folds = 3,
scoring = c('accuracy','auc'),
n_iter = 3)
rst$fit(xtrain, "Survived")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 50
#>
#> $max_depth
#> [1] 5
#>
#> $accuracy_avg
#> [1] 0.7964744
#>
#> $accuracy_sd
#> [1] 0.03090914
#>
#> $auc_avg
#> [1] 0.7729436
#>
#> $auc_sd
#> [1] 0.04283084
Let’s create some new feature based on target variable using target encoding and test a model.
# add target encoding features
xtrain[, feat_01 := smoothMean(train_df = xtrain,
test_df = xtest,
colname = "Embarked",
target = "Survived")$train[[2]]]
xtest[, feat_01 := smoothMean(train_df = xtrain,
test_df = xtest,
colname = "Embarked",
target = "Survived")$test[[2]]]
# train a random forest
# Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 4)
rf$fit(X = xtrain, y = "Survived")
pred <- rf$predict(df = xtest)
rf$get_importance()
#> tmp.order.tmp..decreasing...TRUE..
#> Sex 69.787235
#> Fare 60.832089
#> Age 52.982604
#> Pclass 24.419818
#> Cabin 21.419274
#> SibSp 13.112177
#> Parch 10.175269
#> feat_01 6.675399
#> Embarked 6.450819
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8018717