MachineShop
is a meta-package for statistical and machine learning with a common interface for model fitting, prediction, performance assessment, and presentation of results. Support is provided for predictive modeling of numerical, categorical, and censored time-to-event outcomes and for resample (bootstrap, cross-validation, and split training-test sets) estimation of model performance. This vignette introduces the package interface with a survival data analysis example, followed by applications to other types of response variables, supported methods of model specification and data preprocessing, and a list of all currently available models.
The Melanoma
dataset from the MASS
package (Andersen et al. 1993) contains time, in days, to (1) death from disease, (2) alive at end of study, or (3) death from other causes for 205 Denmark patients with malignant melanomas. Also provided are potential predictors of the survival outcomes. We begin by loading the MachineShop
, survival
, and MASS
packages required for the analysis as well as the magrittr
package (Bache and Wickham 2014) for its pipe (%>%
) operator to simplify some of the code syntax. The dataset is split into a training set to which a survival model will be fit and a test set on which to make predictions. A global formula fo
relates the predictors on the right hand side to the overall survival outcome on the left and will be used in all of the survival models in this vignette example.
## Load libraries for the survival analysis
library(MachineShop)
library(survival)
library(MASS)
library(magrittr)
## Malignant melanoma cancer dataset
head(Melanoma)
#> time status sex age year thickness ulcer
#> 1 10 3 1 76 1972 6.76 1
#> 2 30 3 1 56 1968 0.65 0
#> 3 35 2 1 41 1977 1.34 0
#> 4 99 3 0 71 1968 2.90 0
#> 5 185 1 1 52 1965 12.08 1
#> 6 204 1 1 28 1971 4.84 1
## Create training and test sets
n <- nrow(Melanoma) * 2 / 3
train <- head(Melanoma, n)
test <- head(Melanoma, -n)
## Global formula for the analysis
fo <- Surv(time, status != 2) ~ sex + age + year + thickness + ulcer
Generalized boosted regression models are a tree-based ensemble method that can applied to survival outcomes. They are available in the MachineShop
with the function GBMModel
. A call to the function creates an instance of the model containing any user-specified model parameters and internal machinery for model fitting, prediction, and performance assessment. Created models can be supplied to the fit
function to estimate a relationship (fo
) between predictors and an outcome based on a set of data (train
). The importance of variables in a model fit is estimated with the varimp
function and plotted with plot
. Variable importance is a measure of the relative importance of predictors in a model and has a default range of 0 to 100, where 0 denotes the least important variables and 100 the most.
## Fit a generalized boosted model
gbmfit <- fit(fo, data = train, model = GBMModel)
## Predictor variable importance
(vi <- varimp(gbmfit))
#> Overall
#> year 100.00000
#> thickness 63.17344
#> age 33.19405
#> ulcer 13.76512
#> sex 0.00000
plot(vi)
From the model fit, predictions are obtained at 2, 5, and 10 years as survival probabilities (type = "prob"
) and as 0-1 death indicators (default: type = "response"
).
## Predict survival probabilities and outcomes at specified follow-up times
times <- 365 * c(2, 5, 10)
predict(gbmfit, newdata = test, times = times, type = "prob") %>% head
#> [,1] [,2] [,3]
#> [1,] 0.8659876 0.67458405 0.539432251
#> [2,] 0.8285838 0.59782467 0.446353240
#> [3,] 0.9837047 0.95604527 0.931946921
#> [4,] 0.8241337 0.58908117 0.436160014
#> [5,] 0.3119026 0.04127331 0.006752073
#> [6,] 0.8470514 0.63498855 0.490621484
predict(gbmfit, newdata = test, times = times) %>% head
#> [,1] [,2] [,3]
#> [1,] 0 0 0
#> [2,] 0 0 1
#> [3,] 0 0 0
#> [4,] 0 0 1
#> [5,] 1 1 1
#> [6,] 0 0 1
A call to modelmetrics
with observed and predicted outcomes will produce model performance metrics. The metrics produced will depend on the type of the observed variable. In this case of a Surv
variable, the metrics are area under the ROC curve (Heagerty, Lumley, and Pepe 2004) and Brier score (Graf et al. 1999) at the specified times and overall time-integrated averages.
## Model performance metrics
obs <- response(fo, test)
pred <- predict(gbmfit, newdata = test, times = times, type = "prob")
modelmetrics(obs, pred, times = times)
#> ROC Brier ROCTime1 ROCTime2 ROCTime3 BrierTime1
#> 0.9322072 NaN 0.7293209 0.9829287 0.9829287 0.2124870
#> BrierTime2 BrierTime3
#> NaN NaN
Performance of a model can be estimated with resampling methods that simulate repeated training and test set fits and prediction. Performance metrics are computed on each resample to produce an empirical distribution for inference. Resampling is controlled in the MachineShop
with the functions:
In our example, performance of models to predict survival at 2, 5, and 10 years will be estimated with five repeats of 10-fold cross-validation. Variable metrics
is defined for the purpose of reducing the printed and plotted output in this vignette to only the time-integrated ROC and Brier metrics. Such subsetting of output would not be done in practice if there is interest in seeing all metrics.
## Control parameters for repeated K-fold cross-validation
control <- CVControl(
folds = 10,
repeats = 5,
surv_times = 365 * c(2, 5, 10)
)
## Metrics of interest
metrics <- c("ROC", "Brier")
Resampling is implemented with the foreach
package (Microsoft and Weston 2017b) and will run in parallel if a compatible backend is loaded, such as that provided by the doParallel
package (Microsoft and Weston 2017a).
Resampling of a single model is performed with the resample
function applied to a model object (e.g. GBMModel
) and a control object like the one defined previously (control
). Summary statistics and plots can be obtained with the summary
and plot
functions.
## Resample estimation
(perf <- resample(fo, data = Melanoma, model = GBMModel, control = control))
#> An object of class "Resamples"
#>
#> Metrics: ROC, Brier, ROCTime1, ROCTime2, ROCTime3, BrierTime1, BrierTime2, BrierTime3
#>
#> Resamples control object of class "CVMLControl"
#>
#> Method: K-Fold Cross-Validation
#>
#> Folds: 10
#>
#> Repeats: 5
#>
#> Class cutoff probability: 0.5
#>
#> Survival times: 730, 1825, 3650
#>
#> Omit missing responses: TRUE
#>
#> Seed: 1235503296
summary(perf)
#> Mean Median SD Min Max NA
#> ROC 0.68604239 0.70874170 0.13112744 0.30403646 0.9243192 0.00
#> Brier 0.19842301 0.18785439 0.05065880 0.11119681 0.3191915 0.06
#> ROCTime1 0.70223615 0.78703704 0.25757157 0.00000000 1.0000000 0.00
#> ROCTime2 0.71955421 0.71661879 0.12419261 0.40947014 0.9625000 0.00
#> ROCTime3 0.65945780 0.66159861 0.13864165 0.32682292 0.9081753 0.00
#> BrierTime1 0.08801901 0.09111647 0.03456247 0.00964609 0.1742736 0.00
#> BrierTime2 0.18637499 0.18367866 0.04184711 0.10990137 0.3386482 0.00
#> BrierTime3 0.24878603 0.23675365 0.09215865 0.08383085 0.5169392 0.06
plot(perf, metrics = metrics)
Resampled metrics from different models can be combined for comparison with the Resamples
function. Names given on the left hand side of the equal operators in the call to Resamples
will be used as labels in output from the summary
and plot
functions. For these types of model comparisons, the same control structure should be used in all associated calls to resample
to ensure that resulting model metrics are computed on the same resampled training and test sets.
## Resample estimation
gbmperf1 <- resample(fo, data = Melanoma, model = GBMModel(n.trees = 25), control = control)
gbmperf2 <- resample(fo, data = Melanoma, model = GBMModel(n.trees = 50), control = control)
gbmperf3 <- resample(fo, data = Melanoma, model = GBMModel(n.trees = 100), control = control)
## Combine resamples for comparison
(perf <- Resamples(GBM1 = gbmperf1, GBM2 = gbmperf2, GBM3 = gbmperf3))
#> An object of class "Resamples"
#>
#> Models: GBM1, GBM2, GBM3
#>
#> Metrics: ROC, Brier, ROCTime1, ROCTime2, ROCTime3, BrierTime1, BrierTime2, BrierTime3
#>
#> Resamples control object of class "CVMLControl"
#>
#> Method: K-Fold Cross-Validation
#>
#> Folds: 10
#>
#> Repeats: 5
#>
#> Class cutoff probability: 0.5
#>
#> Survival times: 730, 1825, 3650
#>
#> Omit missing responses: TRUE
#>
#> Seed: 1235503296
summary(perf)[, , metrics]
#> , , ROC
#>
#> Mean Median SD Min Max NA
#> GBM1 0.7126959 0.7249419 0.1303829 0.3231771 0.9539048 0
#> GBM2 0.7078389 0.7232105 0.1289528 0.3888021 0.9324742 0
#> GBM3 0.6860424 0.7087417 0.1311274 0.3040365 0.9243192 0
#>
#> , , Brier
#>
#> Mean Median SD Min Max NA
#> GBM1 0.1860707 0.1760822 0.04476337 0.10706487 0.2978184 0.06
#> GBM2 0.1892243 0.1814450 0.04603588 0.09965699 0.2994825 0.06
#> GBM3 0.1984230 0.1878544 0.05065880 0.11119681 0.3191915 0.06
plot(perf, metrics = metrics)
Pairwise model differences for each metric can be calculated with the diff
function applied to results from a call to Resamples
. The differences can be summarized descriptively with the summary
and plot
functions and assessed for statistical significance with the t.test
function.
## Pairwise model comparisons
(perfdiff <- diff(perf))
#> An object of class "ResamplesDiff"
#>
#> Models: GBM1 - GBM2, GBM1 - GBM3, GBM2 - GBM3
#>
#> Metrics: ROC, Brier, ROCTime1, ROCTime2, ROCTime3, BrierTime1, BrierTime2, BrierTime3
#>
#> Resamples control object of class "CVMLControl"
#>
#> Method: K-Fold Cross-Validation
#>
#> Folds: 10
#>
#> Repeats: 5
#>
#> Class cutoff probability: 0.5
#>
#> Survival times: 730, 1825, 3650
#>
#> Omit missing responses: TRUE
#>
#> Seed: 1235503296
summary(perfdiff)[, , metrics]
#> , , ROC
#>
#> Mean Median SD Min Max NA
#> GBM1 - GBM2 0.004857001 0.005014928 0.02777305 -0.06562500 0.0637349 0
#> GBM1 - GBM3 0.026653553 0.025452828 0.04132958 -0.04401444 0.1605615 0
#> GBM2 - GBM3 0.021796551 0.026398966 0.03335346 -0.02937940 0.1339694 0
#>
#> , , Brier
#>
#> Mean Median SD Min Max
#> GBM1 - GBM2 -0.003153596 -0.003936083 0.00906714 -0.02699968 0.01698245
#> GBM1 - GBM3 -0.012352336 -0.010910743 0.01560477 -0.04889790 0.01718238
#> GBM2 - GBM3 -0.009198740 -0.006745434 0.01141701 -0.03430725 0.01210852
#> NA
#> GBM1 - GBM2 0.06
#> GBM1 - GBM3 0.06
#> GBM2 - GBM3 0.06
plot(perfdiff, metrics = metrics)
t.test(perfdiff)[, , metrics]
#> , , ROC
#>
#> GBM1 GBM2 GBM3
#> GBM1 NA 4.857001e-03 0.02665355
#> GBM2 2.221297e-01 NA 0.02179655
#> GBM3 8.398902e-05 8.398902e-05 NA
#>
#> , , Brier
#>
#> GBM1 GBM2 GBM3
#> GBM1 NA -3.153596e-03 -0.01235234
#> GBM2 2.128534e-02 NA -0.00919874
#> GBM3 4.467414e-06 4.467414e-06 NA
Modelling functions may have arguments that define parameters in their model fitting algorithms. For example, GBMModel
has arguments n.trees
, interaction.dept
, and n.minobsinnode
that defined the number of decision trees to fit, the maximum depth of variable interactions, and the minimum number of observations in the trees terminal nodes. The tune
function is available to fit a model over a grid of parameters and return the model whose parameters provide the optimal fit. Note that the function name GBMModel
, and not the function call GBMModel()
, is supplied as the first argument to tune
. Summary statistics and plots of performance across all tuning parameters are available with the summary
and plot
functions.
## Tune over a grid of model parameters
(gbmtune <- tune(fo, data = Melanoma, model = GBMModel,
grid = expand.grid(n.trees = c(25, 50, 100),
interaction.depth = 1:3,
n.minobsinnode = c(5, 10)),
control = control))
#> An object of class "MLModelTune"
#>
#> Name: GBMModel
#>
#> Required packages: gbm
#>
#> Response types: factor, numeric, Surv
#>
#> Parameters:
#> $n.trees
#> [1] 25
#>
#> $interaction.depth
#> [1] 1
#>
#> $n.minobsinnode
#> [1] 10
#>
#> $shrinkage
#> [1] 0.1
#>
#> $bag.fraction
#> [1] 0.5
#>
#> grid:
#> n.trees interaction.depth n.minobsinnode
#> 1 25 1 5
#> 2 50 1 5
#> 3 100 1 5
#> 4 25 2 5
#> 5 50 2 5
#> 6 100 2 5
#> 7 25 3 5
#> 8 50 3 5
#> 9 100 3 5
#> 10 25 1 10
#> 11 50 1 10
#> 12 100 1 10
#> 13 25 2 10
#> 14 50 2 10
#> 15 100 2 10
#> 16 25 3 10
#> 17 50 3 10
#> 18 100 3 10
#>
#> resamples:
#> An object of class "Resamples"
#>
#> Models: GBMModel, GBMModel.1, GBMModel.2, GBMModel.3, GBMModel.4, GBMModel.5, GBMModel.6, GBMModel.7, GBMModel.8, GBMModel.9, GBMModel.10, GBMModel.11, GBMModel.12, GBMModel.13, GBMModel.14, GBMModel.15, GBMModel.16, GBMModel.17
#>
#> Metrics: ROC, Brier, ROCTime1, ROCTime2, ROCTime3, BrierTime1, BrierTime2, BrierTime3
#>
#> Resamples control object of class "CVMLControl"
#>
#> Method: K-Fold Cross-Validation
#>
#> Folds: 10
#>
#> Repeats: 5
#>
#> Class cutoff probability: 0.5
#>
#> Survival times: 730, 1825, 3650
#>
#> Omit missing responses: TRUE
#>
#> Seed: 1235503296
#>
#> Selected: Model10 (ROC)
summary(gbmtune)[, , metrics]
#> , , ROC
#>
#> Mean Median SD Min Max NA
#> GBMModel 0.7084099 0.7334550 0.1317115 0.2912007 0.9378315 0
#> GBMModel.1 0.7044935 0.7216624 0.1280673 0.3786458 0.9209873 0
#> GBMModel.2 0.6914334 0.7205510 0.1229936 0.4113761 0.9116389 0
#> GBMModel.3 0.6920790 0.7160629 0.1221498 0.3727638 0.9132602 0
#> GBMModel.4 0.6848302 0.6918460 0.1250106 0.3704309 0.9528189 0
#> GBMModel.5 0.6691489 0.6816788 0.1391156 0.2993490 0.9158057 0
#> GBMModel.6 0.6879942 0.6914624 0.1282246 0.3758064 0.9396974 0
#> GBMModel.7 0.6769710 0.6873528 0.1358296 0.3204427 0.9218421 0
#> GBMModel.8 0.6527250 0.6528089 0.1456248 0.2634115 0.9004983 0
#> GBMModel.9 0.7126959 0.7249419 0.1303829 0.3231771 0.9539048 0
#> GBMModel.10 0.7078389 0.7232105 0.1289528 0.3888021 0.9324742 0
#> GBMModel.11 0.6860424 0.7087417 0.1311274 0.3040365 0.9243192 0
#> GBMModel.12 0.6994989 0.7266897 0.1234906 0.3975260 0.9485726 0
#> GBMModel.13 0.6921657 0.7104928 0.1336868 0.3523811 0.9336789 0
#> GBMModel.14 0.6814255 0.6918727 0.1373640 0.3334635 0.9053087 0
#> GBMModel.15 0.6956087 0.7147426 0.1366645 0.3521014 0.9636788 0
#> GBMModel.16 0.6929889 0.6900073 0.1289404 0.3296225 0.9262851 0
#> GBMModel.17 0.6744827 0.6672509 0.1368483 0.3111367 0.9409702 0
#>
#> , , Brier
#>
#> Mean Median SD Min Max NA
#> GBMModel 0.1902605 0.1813742 0.04592658 0.11447004 0.3193480 0.06
#> GBMModel.1 0.1961846 0.1886240 0.04975919 0.10223360 0.3154243 0.06
#> GBMModel.2 0.2078215 0.1938710 0.05463432 0.10857945 0.3326337 0.06
#> GBMModel.3 0.2000988 0.1889826 0.05218615 0.11031448 0.3438735 0.06
#> GBMModel.4 0.2083254 0.1956902 0.05685070 0.10189786 0.4286941 0.06
#> GBMModel.5 0.2208186 0.2032064 0.06503518 0.11316272 0.4389034 0.06
#> GBMModel.6 0.2064209 0.1902088 0.06189791 0.11349287 0.4031801 0.06
#> GBMModel.7 0.2158404 0.2015343 0.06175487 0.12184607 0.4338749 0.06
#> GBMModel.8 0.2298594 0.2177718 0.06491117 0.13337052 0.4328804 0.06
#> GBMModel.9 0.1860707 0.1760822 0.04476337 0.10706487 0.2978184 0.06
#> GBMModel.10 0.1892243 0.1814450 0.04603588 0.09965699 0.2994825 0.06
#> GBMModel.11 0.1984230 0.1878544 0.05065880 0.11119681 0.3191915 0.06
#> GBMModel.12 0.1913985 0.1792700 0.04994485 0.10789194 0.3220089 0.06
#> GBMModel.13 0.1934831 0.1834812 0.04891230 0.10869662 0.3102272 0.06
#> GBMModel.14 0.2048976 0.1832971 0.05809369 0.12257865 0.3389919 0.06
#> GBMModel.15 0.1923398 0.1777437 0.05219178 0.11132775 0.3530243 0.06
#> GBMModel.16 0.1967562 0.1889953 0.05172487 0.11615526 0.3188928 0.06
#> GBMModel.17 0.2100014 0.1980636 0.05540928 0.12776140 0.3323675 0.06
plot(gbmtune, type = "line", metrics = metrics)
The value returned by tune
contains an object produced by a call to the modelling function with the the optimal tuning parameters. Thus, the value can be passed on to the fit
function for model fitting to a set of data.
## Fit the tuned model
gbmfit <- fit(fo, data = Melanoma, model = gbmtune)
(vi <- varimp(gbmfit))
#> Overall
#> thickness 100.000000
#> age 52.547848
#> ulcer 32.768961
#> sex 8.645537
#> year 0.000000
plot(vi)
Ensemble methods combine multiple base learning algorithms as a strategy to improve predictive performance. Two ensemble methods implemented in Machineshop
are stacked regression (Breiman 1996) and super learners (Lann and Hubbard 2007). Stacked regression fits a linear combination of resampled predictions from specified base learners; whereas, super learners fit a specified model, such as GBMModel
, to the base learner predictions and optionally also to the original predictor variables. Illustrated below is a performance evaluation of stacked regression and a super learner fit to gradient boosted, conditional forest, and Lasso-based Cox regression base learners. In the latter case, a separate gradient boosted model is used as the super learner by default.
## Stacked regression
stackedperf <- resample(fo, data = Melanoma,
model = StackedModel(GBMModel, CForestModel, GLMNetModel(lambda = 0.1)))
summary(stackedperf)
#> Mean Median SD Min Max NA
#> CIndex 0.7234404 0.7033279 0.1121352 0.5698925 0.9174312 0
## Super learner
superperf <- resample(fo, data = Melanoma,
model = SuperModel(GBMModel, CForestModel, GLMNetModel(lambda = 0.1)))
summary(superperf)
#> Mean Median SD Min Max NA
#> CIndex 0.6949448 0.6904762 0.09489321 0.5634328 0.82 0
Partial dependence plots display the marginal effects of predictors on the response variable. The response scale displayed in the plots will depend on the response type; i.e. probabilities for factor, original scale for numeric, and proportional risk for Surv types.
Agreement between model-predicted and observed values can be visualized with calibration curves. In the construction of these curves, cases are partitioned into bins according to their (resampled) predicted responses. Mean observed responses are then calculated within each of the bins and plotted on the vertical axis against the bin midpoints on the horizontal axis. Calibration curves that are close to the 45-degree line indicate close agreement between observed and predicted responses and a model that is said to be well calibrated.
Lift curves depict the rate at which observed binary responses are identifiable from (resampled) predicted response probabilities. They are constructed by first sorting predicted responses in descending order. Then, the cumulative percent of positive responses (true positive findings) is plotted against the cumulative number of cases (positive test rates) in the ordering. Accordingly, the curve represents the rate at which positive responses are found as a function of the positive test rate among cases.
## Requires a binary outcome
fo_surv5 <- factor(time > 365 * 5) ~ sex + age + year + thickness + ulcer
df_surv5 <- subset(Melanoma, status != 2)
perf_surv5 <- resample(fo_surv5, data = df_surv5, model = GBMModel)
lf <- lift(perf_surv5)
plot(lf, find = 75)
Categorical responses with two or more levels should be code as a factor
variable for analysis. The metrics returned will depend on the number of factor levels. Metrics for factors with two levels are as follows.
cutoff_index
in the resampling control functions (default: Sensitivity + Specificity). The function allows for specification of tradeoffs (Perkins and Schisterman 2006) other than the default of Youden’s J statistic (Youden 1950).
Brier, ROCAUC, and PRAUC are computed directly on predicted class probabilities. The others are computed on predicted class membership. Memberships are defined to be in the second factor level if predicted probabilities are greater than a cutoff value defined in the resampling control functions (default: cutoff = 0.5
).
### Pima Indians diabetes statuses (2 levels)
library(MASS)
perf <- resample(factor(type) ~ ., data = Pima.tr, model = GBMModel)
summary(perf)
#> Mean Median SD Min Max NA
#> Accuracy 0.7357769 0.7500000 0.08295637 0.60000000 0.8500000 0
#> Kappa 0.3943684 0.4110310 0.20804864 0.07894737 0.6808511 0
#> Brier 0.1683798 0.1619929 0.05292097 0.09646383 0.2798856 0
#> ROCAUC 0.8355835 0.8461538 0.09793240 0.64835165 0.9487179 0
#> PRAUC 0.6179464 0.6391936 0.10132909 0.40034289 0.7629630 0
#> Sensitivity 0.5785714 0.5714286 0.23221577 0.14285714 0.8571429 0
#> Specificity 0.8192308 0.8461538 0.07832535 0.69230769 0.9230769 0
#> Index 1.3978022 1.3873626 0.22287169 1.06593407 1.7032967 0
Metrics for factors with more than two levels are as described below.
Brier and MLogLoss are computed directly on predicted class probabilities. The others are computed on predicted class membership, defined as the factor level with the highest predicted probability.
### Iris flowers species (3 levels)
perf <- resample(factor(Species) ~ ., data = iris, model = GBMModel)
summary(perf)
#> Mean Median SD Min Max NA
#> Accuracy 0.94666667 0.9333333 0.05258738 8.666667e-01 1.0000000 0
#> Kappa 0.92000000 0.9000000 0.07888106 8.000000e-01 1.0000000 0
#> Brier 0.08841658 0.1105564 0.08458143 1.805483e-06 0.2429727 0
#> MLogLoss 0.23594251 0.1602184 0.26788022 5.807553e-04 0.7098453 0
Numerical responses should be coded as a numeric
variable. Associated performance metrics are as defined below and illustrated with Boston housing price data (Venables and Ripley 2002).
### Boston housing prices
library(MASS)
perf <- resample(medv ~ ., data = Boston, model = GBMModel)
summary(perf)
#> Mean Median SD Min Max NA
#> R2 0.8150521 0.820236 0.0705755 0.7148417 0.9083531 0
#> RMSE 3.8567491 3.339410 0.9910331 3.0042741 5.7033421 0
#> MAE 2.6816239 2.527563 0.3394829 2.2773279 3.2836166 0
Survival responses should be coded as a Surv
variable. In addition to the ROC and Brier survival metrics described earlier in the vignette, the concordance index (Harrell et al. 1982) can be obtained if follow-up times are not specified for the prediction.
Model specification here refers to the relationship between the response and predictor variables and the data used to estimate it. Three main types of specification are supported by the fit
, resample
, and tune
functions: formulas, model frames, and recipes.
Models may be specified with the traditional formula and data frame pair, as was done in the previous examples. In this specification, in-line functions, interactions, and .
substitution of variables not already appearing in the formula may be include.
## Formula specification
gbmfit <- fit(medv ~ ., data = Boston, model = GBMModel)
varimp(gbmfit)
#> Overall
#> lstat 100.0000000
#> rm 67.8362562
#> dis 9.6903670
#> nox 8.0869277
#> crim 6.9986739
#> ptratio 4.8112413
#> tax 2.8209371
#> chas 1.0791848
#> black 0.6979187
#> rad 0.3775628
#> zn 0.0000000
#> indus 0.0000000
#> age 0.0000000
The second specification is similar to the first, except the formula and data frame pair are give in a ModelFrame
. The model frame approach has a few subtle advantages. One is that cases with missing values on any of the response or predictor variables are excluded from the model frame by default. This is often desirable for models that cannot handle missing values. Note, however, that some models like GBMModel
do accommodate missing values. For those, missing values can be retained in the model frame by setting its argument na.action = na.pass
.
## Model frame specification
mf <- ModelFrame(medv ~ ., data = Boston)
gbmfit <- fit(mf, model = GBMModel)
varimp(gbmfit)
#> Overall
#> lstat 100.0000000
#> rm 82.6120444
#> dis 10.8108254
#> nox 9.1394081
#> crim 8.9966411
#> ptratio 5.8691895
#> chas 1.9284354
#> tax 1.7704196
#> age 0.9264470
#> black 0.7716104
#> zn 0.0000000
#> indus 0.0000000
#> rad 0.0000000
Another advantage is that case weights can be included in the model frame and will be passed on to the model fitting functions.
The recipes
package (Kuhn and Wickham 2018) provides a framework for defining predictor and response variables and preprocessing steps to be applied to them prior to model fitting. Using recipes helps to ensure that estimation of predictive performance accounts for all modeling step. They are also a very convenient way of consistently applying preprocessing to new data. Recipes currently support factor
and numeric
responses, but not generally Surv
.
## Recipe specification
library(recipes)
rec <- recipe(medv ~ ., data = Boston) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_pca(all_predictors())
gbmfit <- fit(rec, model = GBMModel)
varimp(gbmfit)
#> Overall
#> PC1 100.00000
#> PC3 62.32739
#> PC4 21.66044
#> PC5 14.50879
#> PC2 0.00000
Currently available model functions are summarized in the table below according to the types of response variables with which each model can be used. The package additionally supplies a generic MLModel
function for users to create their own custom models.
Constructor | factor | numeric | ordered | Surv | |
---|---|---|---|---|---|
C5.0 Classification | C50Model | x | |||
Conditional Inference Trees | CForestModel | x | x | x | |
Cox Regression | CoxModel | x | |||
Cox Regression (Stepwise) | CoxStepAICModel | x | |||
Generalized Linear Models | GLMModel | 2 | x | ||
Generalized Linear Models (Stepwise) | GLMStepAICModel | 2 | x | ||
Gradient Boosted Models | GBMModel | x | x | x | |
Lasso and Elastic-Net | GLMNetModel | x | x | x | |
K-Nearest Neighbors Model | KNNModel | x | x | x | |
Feed-Forward Neural Networks | NNetModel | x | x | ||
Partial Least Squares | PLSModel | x | x | ||
Ordered Logistic Regression | POLRModel | x | |||
Random Forests | RandomForestModel | x | x | ||
Stacked Regression | StackedModel | x | x | x | x |
Super Learner | SuperModel | x | x | x | x |
Survival Regression | SurvRegModel | x | |||
Survival Regression (Stepwise) | SurvRegStepAICModel | x | |||
Support Vector Machines | SVMModel | x | x | ||
Extreme Gradient Boosting | XGBModel | x | x |
Andersen, PK, O Borgan, RD Gill, and N Keiding. 1993. Statistical Models Based on Counting Processes. New York: Springer.
Bache, Stefan Milton, and Hadley Wickham. 2014. Magrittr: A Forward-Pipe Operator for R. https://CRAN.R-project.org/package=magrittr.
Breiman, L. 1996. “Stacked Regression.” Machine Learning 24: 49–64.
Graf, E, C Schmoor, W Sauerbrei, and M Schumacher. 1999. “Assessment and Comparison of Prognostic Classification Schemes for Survival Data.” Statistics in Medicine 18 (17–18): 2529–45.
Harrell, FE, RM Califf, DB Pryor, KL Lee, and RA Rosati. 1982. “Evaluating the Yield of Medical Tests.” JAMA 247 (18): 2543–6.
Heagerty, PJ, T Lumley, and MS Pepe. 2004. “Time-Dependent Roc Curves for Censored Survival Data and a Diagnostic Marker.” Biometrics 56 (2): 337–44.
Kuhn, Max, and Hadley Wickham. 2018. Recipes: Preprocessing Tools to Create Design Matrices. https://CRAN.R-project.org/package=recipes.
Lann, MJ van der, and AE Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).
Microsoft, and Steve Weston. 2017a. DoParallel: Foreach Parallel Adaptor for the ’Parallel’ Package. https://CRAN.R-project.org/package=doParallel.
———. 2017b. Foreach: Provides Foreach Looping Construct for R. https://CRAN.R-project.org/package=foreach.
Perkins, Neil J., and Enrique F. Schisterman. 2006. “The Inconsistency of "Optimal" Cutpoints Obtained Using Two Criteria Based on the Receiver Operating Characteristic Curve.” American Journal of Epidemiology 163 (7): 670–75.
Venables, WN, and BD Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS4.
Youden, WJ. 1950. “Index for Rating Diagnostic Tests.” Cancer 3 (1): 32–35.