The lares
package has multiple families of functions to help the analyst or data scientist achieve quality robust analysis without the need of much coding. One of the most complex but valuable functions we have is h2o_automl
, which semi-automatically runs the whole pipeline of a Machine Learning model given a dataset and some customizable parameters. AutoML enables you to train high-quality models specific to your needs and accelerate the research and development process.
HELP: Before getting to the code, I recommend checking h2o_automl
’s full documentation here or within your R session by running ?lares::h2o_automl
. In it you’ll find a brief description of all the parameters you can set into the function to get exactly what you need and control how it behaves.
In short, these are some of the things that happen on its backend:
Mapping h2o_automl
Input a dataframe df
and choose which one is the independent variable (y
) you’d like to predict. You may set/change the seed
argument to guarantee reproducibility of your results.
The function decides if it’s a classification (categorical) or regression (continuous) model looking at the independent variable’s (y
) class and number of unique values, which can be control with the thresh
parameter.
The dataframe will be split in two: test and train datasets. The proportion of this split can be control with the split
argument. This can be replicated with the msplit()
function.
You could also center
and scale
your numerical values before you continue, use the no_outliers
to exclude some outliers, and/or impute
missing values with MICE
. If it’s a classification model, the function can balance (under-sample) your training data. You can control this behavior with the balance
argument. Until here, you can replicate the whole process with the model_preprocess()
function.
Runs h2o::h2o.automl(...)
to train multiple models and generate a leaderboard with the top (max_models
or max_time
) models trained, sorted by their performance. You can also customize some additional arguments such as nfolds
for k-fold cross-validations, exclude_algos
and include_algos
to exclude or include some algorithms, and any other additional argument you wish to pass to the mother function.
The best model given the default performance metric (which can be changed with stopping_metric
parameter) evaluated with cross-validation (customize it with nfolds
), will be selected to continue. You can also use the function h2o_selectmodel()
to select another model and recalculate/plot everything again using this alternate model.
Performance metrics and plots will be calculated and rendered given the test predictions and test actual values (which were NOT passed to the models as inputs to be trained with). That way, your model’s performance metrics shouldn’t be biased. You can replicate these calculations with the model_metrics()
function.
A list with all the inputs, leaderboard results, best selected model, performance metrics, and plots. You can either (play) see the results on console or export them using the export_results()
function.
Now, let’s (install and) load the library, the data, and dig in:
# install.packages("lares")
library(lares)
# The data we'll use is the Titanic dataset
data(dft)
<- subset(dft, select = -c(Ticket, PassengerId, Cabin)) df
NOTE: I’ll randomly set some parameters on each example to give visibility on some of the arguments you can set to your models. Be sure to also check all the print, warnings, and messages shown throughout the process as they may have relevant information regarding your inputs and the backend operations.
Let’s have a look at three specific examples: classification models (binary and multiple categories) and a regression model. Also, let’s see how we can export our models and put them to work on any environment.
Let’s begin with a binary (TRUE/FALSE) model to predict if each passenger Survived
:
<- h2o_automl(df, y = Survived, max_models = 1, impute = FALSE, target = "TRUE")
r #> 2022-01-31 15:45:16 | Started process...
#> - INDEPENDENT VARIABLE: Survived
#> - MODEL TYPE: Classification
#> # A tibble: 2 × 5
#> tag n p order pcum
#> <lgl> <int> <dbl> <int> <dbl>
#> 1 FALSE 549 61.6 1 61.6
#> 2 TRUE 342 38.4 2 100
#> - MISSINGS: The following variables contain missing observations: Age (19.87%). Consider using the impute parameter.
#> - CATEGORICALS: There are 3 non-numerical features. Consider using ohse() or equivalent prior to encode categorical variables.
#> >>> Splitting data: train = 0.7 & test = 0.3
#> train_size test_size
#> 623 268
#> - REPEATED: There were 64 repeated rows which are being suppressed from the train dataset
#> - ALGORITHMS: excluded 'StackedEnsemble', 'DeepLearning'
#> - CACHE: Previous models are not being erased. You may use 'start_clean' [clear] or 'project_name' [join]
#> - UI: You may check results using H2O Flow's interactive platform: http://localhost:54321/flow/index.html
#> >>> Iterating until 1 models or 600 seconds...
#>
|
| | 0%
|
|======================================================================| 100%
#>
#> 15:45:18.289: Project: AutoML_1_20220131_154518
#> 15:45:18.294: Setting stopping tolerance adaptively based on the training frame: 0.0400641540107502
#> 15:45:18.294: Build control seed: 0
#> 15:45:18.295: training frame: Frame key: AutoML_1_20220131_154518_training_train_sid_9a33_1 cols: 8 rows: 623 chunks: 1 size: 9400 checksum: -5460382674975192472
#> 15:45:18.295: validation frame: NULL
#> 15:45:18.295: leaderboard frame: NULL
#> 15:45:18.295: blending frame: NULL
#> 15:45:18.295: response column: tag
#> 15:45:18.296: fold column: null
#> 15:45:18.296: weights column: null
#> 15:45:18.306: Loading execution steps: [{XGBoost : [def_2 (1g, 10w), def_1 (2g, 10w), def_3 (3g, 10w), grid_1 (4g, 90w), lr_search (6g, 30w)]}, {GLM : [def_1 (1g, 10w)]}, {DRF : [def_1 (2g, 10w), XRT (3g, 10w)]}, {GBM : [def_5 (1g, 10w), def_2 (2g, 10w), def_3 (2g, 10w), def_4 (2g, 10w), def_1 (3g, 10w), grid_1 (4g, 60w), lr_annealing (6g, 10w)]}, {DeepLearning : [def_1 (3g, 10w), grid_1 (4g, 30w), grid_2 (5g, 30w), grid_3 (5g, 30w)]}, {completion : [resume_best_grids (10g, 60w)]}, {StackedEnsemble : [best_of_family_1 (1g, 5w), best_of_family_2 (2g, 5w), best_of_family_3 (3g, 5w), best_of_family_4 (4g, 5w), best_of_family_5 (5g, 5w), all_2 (2g, 10w), all_3 (3g, 10w), all_4 (4g, 10w), all_5 (5g, 10w), monotonic (6g, 10w), best_of_family_xgboost (6g, 10w), best_of_family_gbm (6g, 10w), all_xgboost (7g, 10w), all_gbm (7g, 10w), best_of_family_xglm (8g, 10w), all_xglm (8g, 10w), best_of_family (10g, 10w), best_N (10g, 10w)]}]
#> 15:45:18.332: Disabling Algo: StackedEnsemble as requested by the user.
#> 15:45:18.332: Disabling Algo: DeepLearning as requested by the user.
#> 15:45:18.332: Defined work allocations: [Work{def_2, XGBoost, ModelBuild, group=1, weight=10}, Work{def_1, GLM, ModelBuild, group=1, weight=10}, Work{def_5, GBM, ModelBuild, group=1, weight=10}, Work{def_1, XGBoost, ModelBuild, group=2, weight=10}, Work{def_1, DRF, ModelBuild, group=2, weight=10}, Work{def_2, GBM, ModelBuild, group=2, weight=10}, Work{def_3, GBM, ModelBuild, group=2, weight=10}, Work{def_4, GBM, ModelBuild, group=2, weight=10}, Work{def_3, XGBoost, ModelBuild, group=3, weight=10}, Work{XRT, DRF, ModelBuild, group=3, weight=10}, Work{def_1, GBM, ModelBuild, group=3, weight=10}, Work{grid_1, XGBoost, HyperparamSearch, group=4, weight=90}, Work{grid_1, GBM, HyperparamSearch, group=4, weight=60}, Work{lr_search, XGBoost, Selection, group=6, weight=30}, Work{lr_annealing, GBM, Selection, group=6, weight=10}, Work{resume_best_grids, virtual, Dynamic, group=10, weight=60}]
#> 15:45:18.332: Actual work allocations: [Work{def_2, XGBoost, ModelBuild, group=1, weight=10}, Work{def_1, GLM, ModelBuild, group=1, weight=10}, Work{def_5, GBM, ModelBuild, group=1, weight=10}, Work{def_1, XGBoost, ModelBuild, group=2, weight=10}, Work{def_1, DRF, ModelBuild, group=2, weight=10}, Work{def_2, GBM, ModelBuild, group=2, weight=10}, Work{def_3, GBM, ModelBuild, group=2, weight=10}, Work{def_4, GBM, ModelBuild, group=2, weight=10}, Work{def_3, XGBoost, ModelBuild, group=3, weight=10}, Work{XRT, DRF, ModelBuild, group=3, weight=10}, Work{def_1, GBM, ModelBuild, group=3, weight=10}, Work{grid_1, XGBoost, HyperparamSearch, group=4, weight=90}, Work{grid_1, GBM, HyperparamSearch, group=4, weight=60}, Work{lr_search, XGBoost, Selection, group=6, weight=30}, Work{lr_annealing, GBM, Selection, group=6, weight=10}, Work{resume_best_grids, virtual, Dynamic, group=10, weight=60}]
#> 15:45:18.333: AutoML job created: 2022.01.31 15:45:18.267
#> 15:45:18.333: AutoML build started: 2022.01.31 15:45:18.333
#> 15:45:18.336: Time assigned for XGBoost_1_AutoML_1_20220131_154518: 199.999s
#> 15:45:18.340: AutoML: starting XGBoost_1_AutoML_1_20220131_154518 model training
#> 15:45:18.346: XGBoost_1_AutoML_1_20220131_154518 [XGBoost def_2] started
#> 15:45:20.353: XGBoost_1_AutoML_1_20220131_154518 [XGBoost def_2] complete
#> 15:45:20.353: Adding model XGBoost_1_AutoML_1_20220131_154518 to leaderboard Leaderboard_AutoML_1_20220131_154518@@tag. Training time: model=0s, total=1s
#> 15:45:20.371: New leader: XGBoost_1_AutoML_1_20220131_154518, auc: 0.8427989276139409
#> 15:45:20.371: AutoML: hit the max_models limit; skipping GLM def_1
#> 15:45:20.371: AutoML: hit the max_models limit; skipping GBM def_5
#> 15:45:20.371: Skipping StackedEnsemble 'best_of_family_1' due to the exclude_algos option or it is already trained.
#> 15:45:20.371: AutoML: hit the max_models limit; skipping XGBoost def_1
#> 15:45:20.371: AutoML: hit the max_models limit; skipping DRF def_1
#> 15:45:20.371: AutoML: hit the max_models limit; skipping GBM def_2
#> 15:45:20.371: AutoML: hit the max_models limit; skipping GBM def_3
#> 15:45:20.371: AutoML: hit the max_models limit; skipping GBM def_4
#> 15:45:20.372: Skipping StackedEnsemble 'best_of_family_2' due to the exclude_algos option or it is already trained.
#> 15:45:20.372: Skipping StackedEnsemble 'all_2' due to the exclude_algos option or it is already trained.
#> 15:45:20.372: AutoML: hit the max_models limit; skipping XGBoost def_3
#> 15:45:20.372: AutoML: hit the max_models limit; skipping DRF XRT (Extremely Randomized Trees)
#> 15:45:20.372: AutoML: hit the max_models limit; skipping GBM def_1
#> 15:45:20.372: AutoML: hit the max_models limit; skipping DeepLearning def_1
#> 15:45:20.372: Skipping StackedEnsemble 'best_of_family_3' due to the exclude_algos option or it is already trained.
#> 15:45:20.373: Skipping StackedEnsemble 'all_3' due to the exclude_algos option or it is already trained.
#> 15:45:20.373: AutoML: hit the max_models limit; skipping XGBoost grid_1
#> 15:45:20.373: AutoML: hit the max_models limit; skipping GBM grid_1
#> 15:45:20.373: AutoML: hit the max_models limit; skipping DeepLearning grid_1
#> 15:45:20.373: Skipping StackedEnsemble 'best_of_family_4' due to the exclude_algos option or it is already trained.
#> 15:45:20.373: Skipping StackedEnsemble 'all_4' due to the exclude_algos option or it is already trained.
#> 15:45:20.373: AutoML: hit the max_models limit; skipping DeepLearning grid_2
#> 15:45:20.373: AutoML: hit the max_models limit; skipping DeepLearning grid_3
#> 15:45:20.373: Skipping StackedEnsemble 'best_of_family_5' due to the exclude_algos option or it is already trained.
#> 15:45:20.374: Skipping StackedEnsemble 'all_5' due to the exclude_algos option or it is already trained.
#> 15:45:20.374: AutoML: hit the max_models limit; skipping XGBoost lr_search
#> 15:45:20.374: AutoML: hit the max_models limit; skipping GBM lr_annealing
#> 15:45:20.374: Skipping StackedEnsemble 'monotonic' due to the exclude_algos option or it is already trained.
#> 15:45:20.374: Skipping StackedEnsemble 'best_of_family_xgboost' due to the exclude_algos option or it is already trained.
#> 15:45:20.375: Skipping StackedEnsemble 'best_of_family_gbm' due to the exclude_algos option or it is already trained.
#> 15:45:20.375: Skipping StackedEnsemble 'all_xgboost' due to the exclude_algos option or it is already trained.
#> 15:45:20.375: Skipping StackedEnsemble 'all_gbm' due to the exclude_algos option or it is already trained.
#> 15:45:20.375: Skipping StackedEnsemble 'best_of_family_xglm' due to the exclude_algos option or it is already trained.
#> 15:45:20.375: Skipping StackedEnsemble 'all_xglm' due to the exclude_algos option or it is already trained.
#> 15:45:20.375: AutoML: hit the max_models limit; skipping completion resume_best_grids
#> 15:45:20.376: Skipping StackedEnsemble 'best_of_family' due to the exclude_algos option or it is already trained.
#> 15:45:20.377: Skipping StackedEnsemble 'best_N' due to the exclude_algos option or it is already trained.
#> 15:45:20.377: Actual modeling steps: [{XGBoost : [def_2 (1g, 10w)]}]
#> 15:45:20.377: AutoML build stopped: 2022.01.31 15:45:20.377
#> 15:45:20.378: AutoML build done: built 1 models
#> 15:45:20.378: AutoML duration: 2.044 sec
#> 15:45:20.386: Verifying training frame immutability. . .
#> 15:45:20.386: Training frame was not mutated (as expected).
#> - EUREKA: Succesfully generated 1 models
#> model_id auc logloss aucpr
#> 1 XGBoost_1_AutoML_1_20220131_154518 0.8427989 0.4618864 0.8300427
#> mean_per_class_error rmse mse
#> 1 0.2205576 0.3850217 0.1482417
#> SELECTED MODEL: XGBoost_1_AutoML_1_20220131_154518
#> - NOTE: The following variables were the least important: SibSp, Embarked.C, Parch
#> >>> Running predictions for Survived...
#>
|
| | 0%
|
|======================================================================| 100%
#>
|
| | 0%
|
|======================================================================| 100%
#> Target value: TRUE
#> >>> Generating plots...
#> Model (1/1): XGBoost_1_AutoML_1_20220131_154518
#> Independent Variable: Survived
#> Type: Classification (2 classes)
#> Algorithm: XGBOOST
#> Split: 70% training data (of 891 observations)
#> Seed: 0
#>
#> Test metrics:
#> AUC = 0.86169
#> ACC = 0.20522
#> PRC = 0.1462
#> TPR = 0.27174
#> TNR = 0.17045
#>
#> Most important variables:
#> Sex.female (39.5%)
#> Fare (14.7%)
#> Age (13.4%)
#> Pclass.3 (12.9%)
#> Sex.male (8.7%)
#> Process duration: 14.4s
Let’s take a look at the plots generated into a single dashboard:
plot(r)
We also have several calculations for our model’s performance that may come useful such as a confusion matrix, gain and lift by percentile, area under the curve (AUC), accuracy (ACC), recall or true positive rate (TPR), cross-validation metrics, exact thresholds to maximize each metric, and others:
$metrics
r#> $dictionary
#> [1] "AUC: Area Under the Curve"
#> [2] "ACC: Accuracy"
#> [3] "PRC: Precision = Positive Predictive Value"
#> [4] "TPR: Sensitivity = Recall = Hit rate = True Positive Rate"
#> [5] "TNR: Specificity = Selectivity = True Negative Rate"
#> [6] "Logloss (Error): Logarithmic loss [Neutral classification: 0.69315]"
#> [7] "Gain: When best n deciles selected, what % of the real target observations are picked?"
#> [8] "Lift: When best n deciles selected, how much better than random is?"
#>
#> $confusion_matrix
#> Pred
#> Real FALSE TRUE
#> FALSE 30 146
#> TRUE 67 25
#>
#> $gain_lift
#> # A tibble: 10 × 10
#> percentile value random target total gain optimal lift response score
#> <fct> <chr> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 TRUE 10.1 25 27 27.2 29.3 170. 27.2 89.4
#> 2 2 TRUE 20.1 23 27 52.2 58.7 159. 25 70.5
#> 3 3 TRUE 31.0 16 29 69.6 90.2 125. 17.4 56.0
#> 4 4 TRUE 39.9 7 24 77.2 100 93.3 7.61 43.9
#> 5 5 TRUE 50 8 27 85.9 100 71.7 8.70 25.7
#> 6 6 TRUE 60.1 4 27 90.2 100 50.2 4.35 18.8
#> 7 7 TRUE 70.1 5 27 95.7 100 36.4 5.43 13.7
#> 8 8 TRUE 79.9 0 26 95.7 100 19.8 0 10.3
#> 9 9 TRUE 89.9 2 27 97.8 100 8.79 2.17 7.48
#> 10 10 TRUE 100 2 27 100 100 0 2.17 4.41
#>
#> $metrics
#> AUC ACC PRC TPR TNR
#> 1 0.86169 0.20522 0.1462 0.27174 0.17045
#>
#> $cv_metrics
#> # A tibble: 20 × 8
#> metric mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 accuracy 0.803 0.0301 0.792 0.776 0.776 0.831 0.839
#> 2 auc 0.846 0.0179 0.836 0.845 0.824 0.871 0.854
#> 3 err 0.197 0.0301 0.208 0.224 0.224 0.169 0.161
#> 4 err_cou… 24.6 3.85 26 28 28 21 20
#> 5 f0point5 0.756 0.0336 0.717 0.737 0.743 0.782 0.799
#> 6 f1 0.752 0.0396 0.735 0.767 0.759 0.804 0.697
#> 7 f2 0.754 0.0809 0.753 0.799 0.775 0.827 0.618
#> 8 lift_to… 2.53 0.364 2.66 2.23 2.23 2.43 3.1
#> 9 logloss 0.462 0.0337 0.464 0.477 0.509 0.433 0.426
#> 10 max_per… 0.266 0.0938 0.234 0.261 0.232 0.178 0.425
#> 11 mcc 0.590 0.0463 0.565 0.558 0.551 0.658 0.619
#> 12 mean_pe… 0.789 0.0250 0.787 0.780 0.777 0.833 0.770
#> 13 mean_pe… 0.211 0.0250 0.213 0.220 0.223 0.167 0.230
#> 14 mse 0.148 0.0113 0.150 0.153 0.163 0.137 0.137
#> 15 pr_auc 0.835 0.0250 0.820 0.861 0.829 0.860 0.804
#> 16 precisi… 0.762 0.0723 0.706 0.719 0.733 0.768 0.885
#> 17 r2 0.377 0.0348 0.359 0.380 0.339 0.433 0.375
#> 18 recall 0.758 0.107 0.766 0.821 0.786 0.843 0.575
#> 19 rmse 0.385 0.0147 0.388 0.392 0.404 0.371 0.369
#> 20 specifi… 0.820 0.0869 0.808 0.739 0.768 0.822 0.964
#>
#> $max_metrics
#> metric threshold value idx
#> 1 max f1 0.39277211 0.7384615 202
#> 2 max f2 0.16585530 0.7922912 294
#> 3 max f0point5 0.75647959 0.7857143 91
#> 4 max accuracy 0.56256211 0.7961477 140
#> 5 max precision 0.97438139 1.0000000 0
#> 6 max recall 0.04712081 1.0000000 397
#> 7 max specificity 0.97438139 1.0000000 0
#> 8 max absolute_mcc 0.56256211 0.5709932 140
#> 9 max min_per_class_accuracy 0.37454695 0.7721180 209
#> 10 max mean_per_class_accuracy 0.39870447 0.7794638 200
#> 11 max tns 0.97438139 373.0000000 0
#> 12 max fns 0.97438139 249.0000000 0
#> 13 max fps 0.02398755 373.0000000 399
#> 14 max tps 0.04712081 250.0000000 397
#> 15 max tnr 0.97438139 1.0000000 0
#> 16 max fnr 0.97438139 0.9960000 0
#> 17 max fpr 0.02398755 1.0000000 399
#> 18 max tpr 0.04712081 1.0000000 397
The same goes for the plots generated for these metrics. We have the gains and response plots on test data-set, confusion matrix, and ROC curves.
$plots$metrics
r#> $gains
#>
#> $response
#>
#> $conf_matrix
#>
#> $ROC
For all models, regardless of their type (classification or regression), you can check the importance of each variable as well:
head(r$importance)
#> variable relative_importance scaled_importance importance
#> 1 Sex.female 199.90350 1.0000000 0.39503346
#> 2 Fare 74.31248 0.3717418 0.14685044
#> 3 Age 67.85074 0.3394175 0.13408125
#> 4 Pclass.3 65.38113 0.3270634 0.12920100
#> 5 Sex.male 44.00624 0.2201374 0.08696164
#> 6 Pclass.1 19.06694 0.0953807 0.03767857
$plots$importance r
Now, let’s run a multi-categorical (+2 labels) model to predict Pclass
of each passenger:
<- h2o_automl(df, Pclass, ignore = c("Fare", "Cabin"), max_time = 30, plots = FALSE)
r #> 2022-01-31 15:45:35 | Started process...
#> - INDEPENDENT VARIABLE: Pclass
#> - MODEL TYPE: Classification
#> # A tibble: 3 × 5
#> tag n p order pcum
#> <fct> <int> <dbl> <int> <dbl>
#> 1 n_3 491 55.1 1 55.1
#> 2 n_1 216 24.2 2 79.4
#> 3 n_2 184 20.6 3 100
#> - MISSINGS: The following variables contain missing observations: Age (19.87%). Consider using the impute parameter.
#> - CATEGORICALS: There are 3 non-numerical features. Consider using ohse() or equivalent prior to encode categorical variables.
#> >>> Splitting data: train = 0.7 & test = 0.3
#> train_size test_size
#> 623 268
#> - REPEATED: There were 65 repeated rows which are being suppressed from the train dataset
#> - ALGORITHMS: excluded 'StackedEnsemble', 'DeepLearning'
#> - CACHE: Previous models are not being erased. You may use 'start_clean' [clear] or 'project_name' [join]
#> - UI: You may check results using H2O Flow's interactive platform: http://localhost:54321/flow/index.html
#> >>> Iterating until 3 models or 30 seconds...
#>
#> 15:45:35.925: Project: AutoML_2_20220131_154535
#> 15:45:35.925: Setting stopping tolerance adaptively based on the training frame: 0.0400641540107502
#> 15:45:35.925: Build control seed: 0
#> 15:45:35.925: training frame: Frame key: AutoML_2_20220131_154535_training_train_sid_bbe6_82 cols: 8 rows: 623 chunks: 1 size: 9412 checksum: -6266075352297987636
#> 15:45:35.925: validation frame: NULL
#> 15:45:35.925: leaderboard frame: NULL
#> 15:45:35.925: blending frame: NULL
#> 15:45:35.925: response column: tag
#> 15:45:35.925: fold column: null
#> 15:45:35.926: weights column: null
#> 15:45:35.926: Loading execution steps: [{XGBoost : [def_2 (1g, 10w), def_1 (2g, 10w), def_3 (3g, 10w), grid_1 (4g, 90w), lr_search (6g, 30w)]}, {GLM : [def_1 (1g, 10w)]}, {DRF : [def_1 (2g, 10w), XRT (3g, 10w)]}, {GBM : [def_5 (1g, 10w), def_2 (2g, 10w), def_3 (2g, 10w), def_4 (2g, 10w), def_1 (3g, 10w), grid_1 (4g, 60w), lr_annealing (6g, 10w)]}, {DeepLearning : [def_1 (3g, 10w), grid_1 (4g, 30w), grid_2 (5g, 30w), grid_3 (5g, 30w)]}, {completion : [resume_best_grids (10g, 60w)]}, {StackedEnsemble : [best_of_family_1 (1g, 5w), best_of_family_2 (2g, 5w), best_of_family_3 (3g, 5w), best_of_family_4 (4g, 5w), best_of_family_5 (5g, 5w), all_2 (2g, 10w), all_3 (3g, 10w), all_4 (4g, 10w), all_5 (5g, 10w), monotonic (6g, 10w), best_of_family_xgboost (6g, 10w), best_of_family_gbm (6g, 10w), all_xgboost (7g, 10w), all_gbm (7g, 10w), best_of_family_xglm (8g, 10w), all_xglm (8g, 10w), best_of_family (10g, 10w), best_N (10g, 10w)]}]
#> 15:45:35.927: Disabling Algo: StackedEnsemble as requested by the user.
#> 15:45:35.927: Disabling Algo: DeepLearning as requested by the user.
#> 15:45:35.927: Defined work allocations: [Work{def_2, XGBoost, ModelBuild, group=1, weight=10}, Work{def_1, GLM, ModelBuild, group=1, weight=10}, Work{def_5, GBM, ModelBuild, group=1, weight=10}, Work{def_1, XGBoost, ModelBuild, group=2, weight=10}, Work{def_1, DRF, ModelBuild, group=2, weight=10}, Work{def_2, GBM, ModelBuild, group=2, weight=10}, Work{def_3, GBM, ModelBuild, group=2, weight=10}, Work{def_4, GBM, ModelBuild, group=2, weight=10}, Work{def_3, XGBoost, ModelBuild, group=3, weight=10}, Work{XRT, DRF, ModelBuild, group=3, weight=10}, Work{def_1, GBM, ModelBuild, group=3, weight=10}, Work{grid_1, XGBoost, HyperparamSearch, group=4, weight=90}, Work{grid_1, GBM, HyperparamSearch, group=4, weight=60}, Work{lr_search, XGBoost, Selection, group=6, weight=30}, Work{lr_annealing, GBM, Selection, group=6, weight=10}, Work{resume_best_grids, virtual, Dynamic, group=10, weight=60}]
#> 15:45:35.927: Actual work allocations: [Work{def_2, XGBoost, ModelBuild, group=1, weight=10}, Work{def_1, GLM, ModelBuild, group=1, weight=10}, Work{def_5, GBM, ModelBuild, group=1, weight=10}, Work{def_1, XGBoost, ModelBuild, group=2, weight=10}, Work{def_1, DRF, ModelBuild, group=2, weight=10}, Work{def_2, GBM, ModelBuild, group=2, weight=10}, Work{def_3, GBM, ModelBuild, group=2, weight=10}, Work{def_4, GBM, ModelBuild, group=2, weight=10}, Work{def_3, XGBoost, ModelBuild, group=3, weight=10}, Work{XRT, DRF, ModelBuild, group=3, weight=10}, Work{def_1, GBM, ModelBuild, group=3, weight=10}, Work{grid_1, XGBoost, HyperparamSearch, group=4, weight=90}, Work{grid_1, GBM, HyperparamSearch, group=4, weight=60}, Work{lr_search, XGBoost, Selection, group=6, weight=30}, Work{lr_annealing, GBM, Selection, group=6, weight=10}, Work{resume_best_grids, virtual, Dynamic, group=10, weight=60}]
#> 15:45:35.927: AutoML job created: 2022.01.31 15:45:35.924
#> 15:45:35.928: AutoML build started: 2022.01.31 15:45:35.927
#> 15:45:35.928: Time assigned for XGBoost_1_AutoML_2_20220131_154535: 9.9996669921875s
#> 15:45:35.928: AutoML: starting XGBoost_1_AutoML_2_20220131_154535 model training
#> 15:45:35.928: XGBoost_1_AutoML_2_20220131_154535 [XGBoost def_2] started
#> 15:45:36.933: XGBoost_1_AutoML_2_20220131_154535 [XGBoost def_2] complete
#> 15:45:36.933: Adding model XGBoost_1_AutoML_2_20220131_154535 to leaderboard Leaderboard_AutoML_2_20220131_154535@@tag. Training time: model=0s, total=0s
#> 15:45:36.934: New leader: XGBoost_1_AutoML_2_20220131_154535, mean_per_class_error: 0.4946035431716546
#> 15:45:36.935: Time assigned for GLM_1_AutoML_2_20220131_154535: 14.496s
#> 15:45:36.936: AutoML: starting GLM_1_AutoML_2_20220131_154535 model training
#> 15:45:36.937: GLM_1_AutoML_2_20220131_154535 [GLM def_1] started
#> 15:45:39.943: GLM_1_AutoML_2_20220131_154535 [GLM def_1] complete
#> 15:45:39.943: Adding model GLM_1_AutoML_2_20220131_154535 to leaderboard Leaderboard_AutoML_2_20220131_154535@@tag. Training time: model=1s, total=2s
#> 15:45:39.944: New leader: GLM_1_AutoML_2_20220131_154535, mean_per_class_error: 0.47423245614035087
#> 15:45:39.944: Time assigned for GBM_1_AutoML_2_20220131_154535: 25.983s
#> 15:45:39.946: AutoML: starting GBM_1_AutoML_2_20220131_154535 model training
#> 15:45:39.947: GBM_1_AutoML_2_20220131_154535 [GBM def_5] started
#> 15:45:40.949: GBM_1_AutoML_2_20220131_154535 [GBM def_5] complete
#> 15:45:40.949: Adding model GBM_1_AutoML_2_20220131_154535 to leaderboard Leaderboard_AutoML_2_20220131_154535@@tag. Training time: model=0s, total=0s
#> 15:45:40.951: Skipping StackedEnsemble 'best_of_family_1' due to the exclude_algos option or it is already trained.
#> 15:45:40.951: AutoML: hit the max_models limit; skipping XGBoost def_1
#> 15:45:40.951: AutoML: hit the max_models limit; skipping DRF def_1
#> 15:45:40.951: AutoML: hit the max_models limit; skipping GBM def_2
#> 15:45:40.951: AutoML: hit the max_models limit; skipping GBM def_3
#> 15:45:40.951: AutoML: hit the max_models limit; skipping GBM def_4
#> 15:45:40.951: Skipping StackedEnsemble 'best_of_family_2' due to the exclude_algos option or it is already trained.
#> 15:45:40.951: Skipping StackedEnsemble 'all_2' due to the exclude_algos option or it is already trained.
#> 15:45:40.951: AutoML: hit the max_models limit; skipping XGBoost def_3
#> 15:45:40.951: AutoML: hit the max_models limit; skipping DRF XRT (Extremely Randomized Trees)
#> 15:45:40.951: AutoML: hit the max_models limit; skipping GBM def_1
#> 15:45:40.951: AutoML: hit the max_models limit; skipping DeepLearning def_1
#> 15:45:40.951: Skipping StackedEnsemble 'best_of_family_3' due to the exclude_algos option or it is already trained.
#> 15:45:40.951: Skipping StackedEnsemble 'all_3' due to the exclude_algos option or it is already trained.
#> 15:45:40.952: AutoML: hit the max_models limit; skipping XGBoost grid_1
#> 15:45:40.952: AutoML: hit the max_models limit; skipping GBM grid_1
#> 15:45:40.952: AutoML: hit the max_models limit; skipping DeepLearning grid_1
#> 15:45:40.952: Skipping StackedEnsemble 'best_of_family_4' due to the exclude_algos option or it is already trained.
#> 15:45:40.952: Skipping StackedEnsemble 'all_4' due to the exclude_algos option or it is already trained.
#> 15:45:40.952: AutoML: hit the max_models limit; skipping DeepLearning grid_2
#> 15:45:40.952: AutoML: hit the max_models limit; skipping DeepLearning grid_3
#> 15:45:40.952: Skipping StackedEnsemble 'best_of_family_5' due to the exclude_algos option or it is already trained.
#> 15:45:40.952: Skipping StackedEnsemble 'all_5' due to the exclude_algos option or it is already trained.
#> 15:45:40.952: AutoML: hit the max_models limit; skipping XGBoost lr_search
#> 15:45:40.952: AutoML: hit the max_models limit; skipping GBM lr_annealing
#> 15:45:40.952: Skipping StackedEnsemble 'monotonic' due to the exclude_algos option or it is already trained.
#> 15:45:40.952: Skipping StackedEnsemble 'best_of_family_xgboost' due to the exclude_algos option or it is already trained.
#> 15:45:40.953: Skipping StackedEnsemble 'best_of_family_gbm' due to the exclude_algos option or it is already trained.
#> 15:45:40.953: Skipping StackedEnsemble 'all_xgboost' due to the exclude_algos option or it is already trained.
#> 15:45:40.953: Skipping StackedEnsemble 'all_gbm' due to the exclude_algos option or it is already trained.
#> 15:45:40.953: Skipping StackedEnsemble 'best_of_family_xglm' due to the exclude_algos option or it is already trained.
#> 15:45:40.954: Skipping StackedEnsemble 'all_xglm' due to the exclude_algos option or it is already trained.
#> 15:45:40.954: AutoML: hit the max_models limit; skipping completion resume_best_grids
#> 15:45:40.954: Skipping StackedEnsemble 'best_of_family' due to the exclude_algos option or it is already trained.
#> 15:45:40.954: Skipping StackedEnsemble 'best_N' due to the exclude_algos option or it is already trained.
#> 15:45:40.954: Actual modeling steps: [{XGBoost : [def_2 (1g, 10w)]}, {GLM : [def_1 (1g, 10w)]}, {GBM : [def_5 (1g, 10w)]}]
#> 15:45:40.954: AutoML build stopped: 2022.01.31 15:45:40.954
#> 15:45:40.954: AutoML build done: built 3 models
#> 15:45:40.954: AutoML duration: 5.027 sec
#> 15:45:40.958: Verifying training frame immutability. . .
#> 15:45:40.958: Training frame was not mutated (as expected).
#> - EUREKA: Succesfully generated 3 models
#> model_id mean_per_class_error logloss rmse
#> 1 GLM_1_AutoML_2_20220131_154535 0.4742325 0.8170807 0.5409388
#> 2 XGBoost_1_AutoML_2_20220131_154535 0.4946035 0.8255072 0.5392879
#> 3 GBM_1_AutoML_2_20220131_154535 0.5037168 0.8620436 0.5616772
#> mse
#> 1 0.2926147
#> 2 0.2908315
#> 3 0.3154812
#> SELECTED MODEL: GLM_1_AutoML_2_20220131_154535
#> - NOTE: The following variables were the least important: Sex.male, Sex.female, Parch
#> >>> Running predictions for Pclass...
#>
|
| | 0%
|
|======================================================================| 100%
#>
|
| | 0%
|
|======================================================================| 100%
#> Model (1/3): GLM_1_AutoML_2_20220131_154535
#> Independent Variable: Pclass
#> Type: Classification (3 classes)
#> Algorithm: GLM
#> Split: 70% training data (of 891 observations)
#> Seed: 0
#>
#> Test metrics:
#> AUC = 0.76337
#> ACC = 0.64179
#>
#> Most important variables:
#> Embarked.Q (25.3%)
#> Embarked.C (13.5%)
#> Embarked.S (13.3%)
#> Age (11.9%)
#> Survived.FALSE (10.6%)
#> Process duration: 17.2s
Let’s take a look at the plots generated into a single dashboard:
plot(r)
Finally, a regression model with continuous values to predict Fare
payed by passenger:
<- h2o_automl(df, y = "Fare", ignore = "Pclass", exclude_algos = NULL, quiet = TRUE)
r #>
#> 15:45:54.803: Project: AutoML_3_20220131_154554
#> 15:45:54.804: Setting stopping tolerance adaptively based on the training frame: 0.04052204492365539
#> 15:45:54.804: Build control seed: 0
#> 15:45:54.804: training frame: Frame key: AutoML_3_20220131_154554_training_train_sid_8245_174 cols: 8 rows: 609 chunks: 1 size: 9258 checksum: 6553451394746990112
#> 15:45:54.804: validation frame: NULL
#> 15:45:54.804: leaderboard frame: NULL
#> 15:45:54.804: blending frame: NULL
#> 15:45:54.804: response column: tag
#> 15:45:54.804: fold column: null
#> 15:45:54.804: weights column: null
#> 15:45:54.804: Loading execution steps: [{XGBoost : [def_2 (1g, 10w), def_1 (2g, 10w), def_3 (3g, 10w), grid_1 (4g, 90w), lr_search (6g, 30w)]}, {GLM : [def_1 (1g, 10w)]}, {DRF : [def_1 (2g, 10w), XRT (3g, 10w)]}, {GBM : [def_5 (1g, 10w), def_2 (2g, 10w), def_3 (2g, 10w), def_4 (2g, 10w), def_1 (3g, 10w), grid_1 (4g, 60w), lr_annealing (6g, 10w)]}, {DeepLearning : [def_1 (3g, 10w), grid_1 (4g, 30w), grid_2 (5g, 30w), grid_3 (5g, 30w)]}, {completion : [resume_best_grids (10g, 60w)]}, {StackedEnsemble : [best_of_family_1 (1g, 5w), best_of_family_2 (2g, 5w), best_of_family_3 (3g, 5w), best_of_family_4 (4g, 5w), best_of_family_5 (5g, 5w), all_2 (2g, 10w), all_3 (3g, 10w), all_4 (4g, 10w), all_5 (5g, 10w), monotonic (6g, 10w), best_of_family_xgboost (6g, 10w), best_of_family_gbm (6g, 10w), all_xgboost (7g, 10w), all_gbm (7g, 10w), best_of_family_xglm (8g, 10w), all_xglm (8g, 10w), best_of_family (10g, 10w), best_N (10g, 10w)]}]
#> 15:45:54.805: Defined work allocations: [Work{def_2, XGBoost, ModelBuild, group=1, weight=10}, Work{def_1, GLM, ModelBuild, group=1, weight=10}, Work{def_5, GBM, ModelBuild, group=1, weight=10}, Work{best_of_family_1, StackedEnsemble, ModelBuild, group=1, weight=5}, Work{def_1, XGBoost, ModelBuild, group=2, weight=10}, Work{def_1, DRF, ModelBuild, group=2, weight=10}, Work{def_2, GBM, ModelBuild, group=2, weight=10}, Work{def_3, GBM, ModelBuild, group=2, weight=10}, Work{def_4, GBM, ModelBuild, group=2, weight=10}, Work{best_of_family_2, StackedEnsemble, ModelBuild, group=2, weight=5}, Work{all_2, StackedEnsemble, ModelBuild, group=2, weight=10}, Work{def_3, XGBoost, ModelBuild, group=3, weight=10}, Work{XRT, DRF, ModelBuild, group=3, weight=10}, Work{def_1, GBM, ModelBuild, group=3, weight=10}, Work{def_1, DeepLearning, ModelBuild, group=3, weight=10}, Work{best_of_family_3, StackedEnsemble, ModelBuild, group=3, weight=5}, Work{all_3, StackedEnsemble, ModelBuild, group=3, weight=10}, Work{grid_1, XGBoost, HyperparamSearch, group=4, weight=90}, Work{grid_1, GBM, HyperparamSearch, group=4, weight=60}, Work{grid_1, DeepLearning, HyperparamSearch, group=4, weight=30}, Work{best_of_family_4, StackedEnsemble, ModelBuild, group=4, weight=5}, Work{all_4, StackedEnsemble, ModelBuild, group=4, weight=10}, Work{grid_2, DeepLearning, HyperparamSearch, group=5, weight=30}, Work{grid_3, DeepLearning, HyperparamSearch, group=5, weight=30}, Work{best_of_family_5, StackedEnsemble, ModelBuild, group=5, weight=5}, Work{all_5, StackedEnsemble, ModelBuild, group=5, weight=10}, Work{lr_search, XGBoost, Selection, group=6, weight=30}, Work{lr_annealing, GBM, Selection, group=6, weight=10}, Work{monotonic, StackedEnsemble, ModelBuild, group=6, weight=10}, Work{best_of_family_xgboost, StackedEnsemble, ModelBuild, group=6, weight=10}, Work{best_of_family_gbm, StackedEnsemble, ModelBuild, group=6, weight=10}, Work{all_xgboost, StackedEnsemble, ModelBuild, group=7, weight=10}, Work{all_gbm, StackedEnsemble, ModelBuild, group=7, weight=10}, Work{best_of_family_xglm, StackedEnsemble, ModelBuild, group=8, weight=10}, Work{all_xglm, StackedEnsemble, ModelBuild, group=8, weight=10}, Work{resume_best_grids, virtual, Dynamic, group=10, weight=60}, Work{best_of_family, StackedEnsemble, ModelBuild, group=10, weight=10}, Work{best_N, StackedEnsemble, ModelBuild, group=10, weight=10}]
#> 15:45:54.805: Actual work allocations: [Work{def_2, XGBoost, ModelBuild, group=1, weight=10}, Work{def_1, GLM, ModelBuild, group=1, weight=10}, Work{def_5, GBM, ModelBuild, group=1, weight=10}, Work{best_of_family_1, StackedEnsemble, ModelBuild, group=1, weight=5}, Work{def_1, XGBoost, ModelBuild, group=2, weight=10}, Work{def_1, DRF, ModelBuild, group=2, weight=10}, Work{def_2, GBM, ModelBuild, group=2, weight=10}, Work{def_3, GBM, ModelBuild, group=2, weight=10}, Work{def_4, GBM, ModelBuild, group=2, weight=10}, Work{best_of_family_2, StackedEnsemble, ModelBuild, group=2, weight=5}, Work{all_2, StackedEnsemble, ModelBuild, group=2, weight=10}, Work{def_3, XGBoost, ModelBuild, group=3, weight=10}, Work{XRT, DRF, ModelBuild, group=3, weight=10}, Work{def_1, GBM, ModelBuild, group=3, weight=10}, Work{def_1, DeepLearning, ModelBuild, group=3, weight=10}, Work{best_of_family_3, StackedEnsemble, ModelBuild, group=3, weight=5}, Work{all_3, StackedEnsemble, ModelBuild, group=3, weight=10}, Work{grid_1, XGBoost, HyperparamSearch, group=4, weight=90}, Work{grid_1, GBM, HyperparamSearch, group=4, weight=60}, Work{grid_1, DeepLearning, HyperparamSearch, group=4, weight=30}, Work{best_of_family_4, StackedEnsemble, ModelBuild, group=4, weight=5}, Work{all_4, StackedEnsemble, ModelBuild, group=4, weight=10}, Work{grid_2, DeepLearning, HyperparamSearch, group=5, weight=30}, Work{grid_3, DeepLearning, HyperparamSearch, group=5, weight=30}, Work{best_of_family_5, StackedEnsemble, ModelBuild, group=5, weight=5}, Work{all_5, StackedEnsemble, ModelBuild, group=5, weight=10}, Work{lr_search, XGBoost, Selection, group=6, weight=30}, Work{lr_annealing, GBM, Selection, group=6, weight=10}, Work{monotonic, StackedEnsemble, ModelBuild, group=6, weight=10}, Work{best_of_family_xgboost, StackedEnsemble, ModelBuild, group=6, weight=10}, Work{best_of_family_gbm, StackedEnsemble, ModelBuild, group=6, weight=10}, Work{all_xgboost, StackedEnsemble, ModelBuild, group=7, weight=10}, Work{all_gbm, StackedEnsemble, ModelBuild, group=7, weight=10}, Work{best_of_family_xglm, StackedEnsemble, ModelBuild, group=8, weight=10}, Work{all_xglm, StackedEnsemble, ModelBuild, group=8, weight=10}, Work{resume_best_grids, virtual, Dynamic, group=10, weight=60}, Work{best_of_family, StackedEnsemble, ModelBuild, group=10, weight=10}, Work{best_N, StackedEnsemble, ModelBuild, group=10, weight=10}]
#> 15:45:54.806: AutoML job created: 2022.01.31 15:45:54.803
#> 15:45:54.806: AutoML build started: 2022.01.31 15:45:54.806
#> 15:45:54.806: Time assigned for XGBoost_1_AutoML_3_20220131_154554: 171.428578125s
#> 15:45:54.806: AutoML: starting XGBoost_1_AutoML_3_20220131_154554 model training
#> 15:45:54.807: XGBoost_1_AutoML_3_20220131_154554 [XGBoost def_2] started
#> 15:45:55.811: XGBoost_1_AutoML_3_20220131_154554 [XGBoost def_2] complete
#> 15:45:55.811: Adding model XGBoost_1_AutoML_3_20220131_154554 to leaderboard Leaderboard_AutoML_3_20220131_154554@@tag. Training time: model=0s, total=0s
#> 15:45:55.813: New leader: XGBoost_1_AutoML_3_20220131_154554, mean_residual_deviance: 830.4367206836267
#> 15:45:55.813: Time assigned for GLM_1_AutoML_3_20220131_154554: 239.597203125s
#> 15:45:55.813: AutoML: starting GLM_1_AutoML_3_20220131_154554 model training
#> 15:45:55.814: GLM_1_AutoML_3_20220131_154554 [GLM def_1] started
#> 15:45:56.816: GLM_1_AutoML_3_20220131_154554 [GLM def_1] complete
#> 15:45:56.816: Adding model GLM_1_AutoML_3_20220131_154554 to leaderboard Leaderboard_AutoML_3_20220131_154554@@tag. Training time: model=0s, total=0s
#> 15:45:56.818: New leader: GLM_1_AutoML_3_20220131_154554, mean_residual_deviance: 739.2163401351919
#> 15:45:56.818: Time assigned for GBM_1_AutoML_3_20220131_154554: 398.6586875s
#> 15:45:56.818: AutoML: starting GBM_1_AutoML_3_20220131_154554 model training
#> 15:45:56.819: GBM_1_AutoML_3_20220131_154554 [GBM def_5] started
#> 15:45:57.823: GBM_1_AutoML_3_20220131_154554 [GBM def_5] complete
#> 15:45:57.823: Adding model GBM_1_AutoML_3_20220131_154554 to leaderboard Leaderboard_AutoML_3_20220131_154554@@tag. Training time: model=0s, total=0s
#> 15:45:57.831: Time assigned for StackedEnsemble_BestOfFamily_1_AutoML_3_20220131_154554: 596.975s
#> 15:45:57.832: AutoML: starting StackedEnsemble_BestOfFamily_1_AutoML_3_20220131_154554 model training
#> 15:45:57.833: StackedEnsemble_BestOfFamily_1_AutoML_3_20220131_154554 [StackedEnsemble best_of_family_1 (built with AUTO metalearner, using top model from each algorithm type)] started
#> 15:45:58.837: StackedEnsemble_BestOfFamily_1_AutoML_3_20220131_154554 [StackedEnsemble best_of_family_1 (built with AUTO metalearner, using top model from each algorithm type)] complete
#> 15:45:58.837: Adding model StackedEnsemble_BestOfFamily_1_AutoML_3_20220131_154554 to leaderboard Leaderboard_AutoML_3_20220131_154554@@tag. Training time: model=0s, total=0s
#> 15:45:58.838: New leader: StackedEnsemble_BestOfFamily_1_AutoML_3_20220131_154554, mean_residual_deviance: 730.9754634709968
#> 15:45:58.838: AutoML: hit the max_models limit; skipping XGBoost def_1
#> 15:45:58.838: AutoML: hit the max_models limit; skipping DRF def_1
#> 15:45:58.838: AutoML: hit the max_models limit; skipping GBM def_2
#> 15:45:58.838: AutoML: hit the max_models limit; skipping GBM def_3
#> 15:45:58.838: AutoML: hit the max_models limit; skipping GBM def_4
#> 15:45:58.838: AutoML: hit the max_models limit; skipping XGBoost def_3
#> 15:45:58.838: AutoML: hit the max_models limit; skipping DRF XRT (Extremely Randomized Trees)
#> 15:45:58.838: AutoML: hit the max_models limit; skipping GBM def_1
#> 15:45:58.838: AutoML: hit the max_models limit; skipping DeepLearning def_1
#> 15:45:58.838: AutoML: hit the max_models limit; skipping XGBoost grid_1
#> 15:45:58.838: AutoML: hit the max_models limit; skipping GBM grid_1
#> 15:45:58.838: AutoML: hit the max_models limit; skipping DeepLearning grid_1
#> 15:45:58.839: AutoML: hit the max_models limit; skipping DeepLearning grid_2
#> 15:45:58.839: AutoML: hit the max_models limit; skipping DeepLearning grid_3
#> 15:45:58.839: AutoML: hit the max_models limit; skipping XGBoost lr_search
#> 15:45:58.839: AutoML: hit the max_models limit; skipping GBM lr_annealing
#> 15:45:58.839: No base models, due to timeouts or the exclude_algos option. Skipping StackedEnsemble 'monotonic'.
#> 15:45:58.840: Time assigned for StackedEnsemble_BestOfFamily_2_AutoML_3_20220131_154554: 99.327671875s
#> 15:45:58.840: AutoML: starting StackedEnsemble_BestOfFamily_2_AutoML_3_20220131_154554 model training
#> 15:45:58.840: StackedEnsemble_BestOfFamily_2_AutoML_3_20220131_154554 [StackedEnsemble best_of_family_xgboost (built with xgboost metalearner, using top model from each algorithm type)] started
#> 15:45:59.845: StackedEnsemble_BestOfFamily_2_AutoML_3_20220131_154554 [StackedEnsemble best_of_family_xgboost (built with xgboost metalearner, using top model from each algorithm type)] complete
#> 15:45:59.845: Adding model StackedEnsemble_BestOfFamily_2_AutoML_3_20220131_154554 to leaderboard Leaderboard_AutoML_3_20220131_154554@@tag. Training time: model=0s, total=0s
#> 15:45:59.847: Time assigned for StackedEnsemble_BestOfFamily_3_AutoML_3_20220131_154554: 118.9918046875s
#> 15:45:59.847: AutoML: starting StackedEnsemble_BestOfFamily_3_AutoML_3_20220131_154554 model training
#> 15:45:59.848: StackedEnsemble_BestOfFamily_3_AutoML_3_20220131_154554 [StackedEnsemble best_of_family_gbm (built with gbm metalearner, using top model from each algorithm type)] started
#> 15:46:00.849: StackedEnsemble_BestOfFamily_3_AutoML_3_20220131_154554 [StackedEnsemble best_of_family_gbm (built with gbm metalearner, using top model from each algorithm type)] complete
#> 15:46:00.849: Adding model StackedEnsemble_BestOfFamily_3_AutoML_3_20220131_154554 to leaderboard Leaderboard_AutoML_3_20220131_154554@@tag. Training time: model=0s, total=0s
#> 15:46:00.851: Time assigned for StackedEnsemble_BestOfFamily_4_AutoML_3_20220131_154554: 296.9775s
#> 15:46:00.851: AutoML: starting StackedEnsemble_BestOfFamily_4_AutoML_3_20220131_154554 model training
#> 15:46:00.851: StackedEnsemble_BestOfFamily_4_AutoML_3_20220131_154554 [StackedEnsemble best_of_family_xglm (built with AUTO metalearner, using top model from each algorithm type)] started
#> 15:46:01.855: StackedEnsemble_BestOfFamily_4_AutoML_3_20220131_154554 [StackedEnsemble best_of_family_xglm (built with AUTO metalearner, using top model from each algorithm type)] complete
#> 15:46:01.856: Adding model StackedEnsemble_BestOfFamily_4_AutoML_3_20220131_154554 to leaderboard Leaderboard_AutoML_3_20220131_154554@@tag. Training time: model=0s, total=0s
#> 15:46:01.857: AutoML: hit the max_models limit; skipping completion resume_best_grids
#> 15:46:01.858: Actual modeling steps: [{XGBoost : [def_2 (1g, 10w)]}, {GLM : [def_1 (1g, 10w)]}, {GBM : [def_5 (1g, 10w)]}, {StackedEnsemble : [best_of_family_1 (1g, 5w), best_of_family_xgboost (6g, 10w), best_of_family_gbm (6g, 10w), best_of_family_xglm (8g, 10w)]}]
#> 15:46:01.858: AutoML build stopped: 2022.01.31 15:46:01.858
#> 15:46:01.858: AutoML build done: built 3 models
#> 15:46:01.858: AutoML duration: 7.052 sec
#> 15:46:01.862: Verifying training frame immutability. . .
#> 15:46:01.862: Training frame was not mutated (as expected).
print(r)
#> Model (1/7): StackedEnsemble_BestOfFamily_1_AutoML_3_20220131_154554
#> Independent Variable: Fare
#> Type: Regression
#> Algorithm: STACKEDENSEMBLE
#> Split: 70% training data (of 871 observations)
#> Seed: 0
#>
#> Test metrics:
#> rmse = 20.17
#> mae = 14.079
#> mape = 0.068862
#> mse = 406.82
#> rsq = 0.367
#> rsqa = 0.3645
Let’s take a look at the plots generated into a single dashboard:
plot(r)
Once you have you model trained and picked, you can export the model and it’s results, so you can put it to work in a production environment (doesn’t have to be R). There is a function that does all that for you: export_results()
. Simply pass your h2o_automl
list object into this function and that’s it! You can select which formats will be exported using the which
argument. Currently we support: txt
, csv
, rds
, binary
, mojo
[best format for production], and plots
. There are also 2 quick options (dev
and production
) to export some or all the files. Lastly, you can set a custom subdir
to gather everything into a new sub-directory; I’d recommend using the model’s name or any other convention that helps you know which one’s which.
If you’d like to re-use your exported models to predict new datasets, you have several options:
h2o_predict_MOJO()
[recommended]: This function lets the user predict using h2o
’s .zip
file containing the MOJO files. These files are also the ones used when putting the model into production on any other environment. Also, MOJO let’s you change h2o
’s versions without issuesh2o_predict_binary()
: This function lets the user predict using the h2o binary file. The h2o
version/build must match for it to work.h2o_predict_model()
: This function lets the user run predictions from a H2O Model Object
same as you’d use the predict
base function. Will probably only work in your current session as you must have the actual trained object to use it.