The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
{splitTools} is a fast, lightweight toolkit for data splitting.
Its two main functions partition()
and
create_folds()
support
The function create_timefolds()
does time-series
splitting in the sense that the out-of-sample data follows the in-sample
data.
We will now illustrate how to use {splitTools} in a typical modeling workflow.
We will go through the following steps:
iris
data into 60% training, 20%
validation, and 20% test data, stratified by the variable
Sepal.Length
. Since this variable is numeric,
stratification uses quantile binning.Sepal.Length
with a linear
regression, once with and once without interaction between
Species
and Sepal.Width
.library(splitTools)
# Split data into partitions
set.seed(3451)
<- partition(iris$Sepal.Length, p = c(train = 0.6, valid = 0.2, test = 0.2))
inds str(inds)
#> List of 3
#> $ train: int [1:81] 2 3 6 7 8 10 11 18 19 20 ...
#> $ valid: int [1:34] 1 12 14 15 27 34 36 38 42 48 ...
#> $ test : int [1:35] 4 5 9 13 16 17 25 39 41 45 ...
<- iris[inds$train, ]
train <- iris[inds$valid, ]
valid <- iris[inds$test, ]
test
<- function(y, pred) {
rmse sqrt(mean((y - pred)^2))
}
# Use simple validation to decide on interaction yes/no...
<- lm(Sepal.Length ~ ., data = train)
fit1 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = train)
fit2
rmse(valid$Sepal.Length, predict(fit1, valid))
#> [1] 0.3020855
rmse(valid$Sepal.Length, predict(fit2, valid))
#> [1] 0.2954321
# Yes! Choose and test final model
rmse(test$Sepal.Length, predict(fit2, test))
#> [1] 0.3482849
Since the iris
data consists of only 150 rows, investing
20% of observations for validation seems like a waste of resources.
Furthermore, the performance estimates might not be very robust. Let’s
replace simple validation by five-fold CV, again using stratification on
the response variable.
iris
into 80% training data and 20% test,
stratified by the variable Sepal.Length
.# Split into training and test
<- partition(iris$Sepal.Length, p = c(train = 0.8, test = 0.2), seed = 87)
inds
<- iris[inds$train, ]
train <- iris[inds$test, ]
test
# Get stratified CV in-sample indices
<- create_folds(train$Sepal.Length, k = 5, seed = 2734)
folds
# Vectors with results per model and fold
<- cv_rmse2 <- numeric(5)
cv_rmse1
for (i in seq_along(folds)) {
<- train[folds[[i]], ]
insample <- train[-folds[[i]], ]
out
<- lm(Sepal.Length ~ ., data = insample)
fit1 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = insample)
fit2
<- rmse(out$Sepal.Length, predict(fit1, out))
cv_rmse1[i] <- rmse(out$Sepal.Length, predict(fit2, out))
cv_rmse2[i]
}
# CV-RMSE of model 1 -> close winner
mean(cv_rmse1)
#> [1] 0.330189
# CV-RMSE of model 2
mean(cv_rmse2)
#> [1] 0.3306455
# Fit model 1 on full training data and evaluate on test data
<- lm(Sepal.Length ~ ., data = train)
final_fit rmse(test$Sepal.Length, predict(final_fit, test))
#> [1] 0.2892289
If feasible, repeated CV is recommended in order to reduce uncertainty in decisions. Otherwise, the process remains the same.
# Train/test split as before
# 15 folds instead of 5
<- create_folds(train$Sepal.Length, k = 5, seed = 2734, m_rep = 3)
folds <- cv_rmse2 <- numeric(15)
cv_rmse1
# Rest as before...
for (i in seq_along(folds)) {
<- train[folds[[i]], ]
insample <- train[-folds[[i]], ]
out
<- lm(Sepal.Length ~ ., data = insample)
fit1 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = insample)
fit2
<- rmse(out$Sepal.Length, predict(fit1, out))
cv_rmse1[i] <- rmse(out$Sepal.Length, predict(fit2, out))
cv_rmse2[i]
}
mean(cv_rmse1)
#> [1] 0.3296087
mean(cv_rmse2)
#> [1] 0.331373
# Refit and test as before
The function multi_strata()
creates a stratification
factor from multiple columns that can then be passed to
create_folds(, type = "stratified")
or
partition(, type = "stratified")
. The resulting partitions
will be (quite) balanced regarding these columns.
Two grouping strategies are offered:
Let’s have a look at a simple example where we want to model “Sepal.Width” as a function of the other variables in the iris data set. We want to do a stratified train/valid/test split, aiming at being balanced regarding not only the response “Sepal.Width”, but also regarding the important predictor “Species”. In this case, we could use the following workflow:
set.seed(3451)
<- iris[c("Sepal.Length", "Species")]
ir <- multi_strata(ir, k = 5)
y <- partition(
inds p = c(train = 0.6, valid = 0.2, test = 0.2), split_into_list = FALSE
y,
)
# Check
by(ir, inds, summary)
#> inds: train
#> Sepal.Length Species
#> Min. :4.300 setosa :30
#> 1st Qu.:5.100 versicolor:30
#> Median :5.800 virginica :30
#> Mean :5.836
#> 3rd Qu.:6.400
#> Max. :7.700
#> ------------------------------------------------------------
#> inds: valid
#> Sepal.Length Species
#> Min. :4.400 setosa :10
#> 1st Qu.:5.425 versicolor:10
#> Median :5.900 virginica :10
#> Mean :5.903
#> 3rd Qu.:6.300
#> Max. :7.900
#> ------------------------------------------------------------
#> inds: test
#> Sepal.Length Species
#> Min. :4.700 setosa :10
#> 1st Qu.:5.100 versicolor:10
#> Median :5.700 virginica :10
#> Mean :5.807
#> 3rd Qu.:6.475
#> Max. :7.100
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.