The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
{splitTools} is a fast, lightweight toolkit for data splitting.
Its two main functions partition() and
create_folds() support
The function create_timefolds() does time-series
splitting in the sense that the out-of-sample data follows the in-sample
data.
We will now illustrate how to use {splitTools} in a typical modeling workflow.
We will go through the following steps:
iris data into 60% training, 20%
validation, and 20% test data, stratified by the variable
Sepal.Length. Since this variable is numeric,
stratification uses quantile binning.Sepal.Length with a linear
regression, once with and once without interaction between
Species and Sepal.Width.library(splitTools)
# Split data into partitions
set.seed(3451)
inds <- partition(iris$Sepal.Length, p = c(train = 0.6, valid = 0.2, test = 0.2))
str(inds)
#> List of 3
#> $ train: int [1:81] 2 3 6 7 8 10 11 18 19 20 ...
#> $ valid: int [1:34] 1 12 14 15 27 34 36 38 42 48 ...
#> $ test : int [1:35] 4 5 9 13 16 17 25 39 41 45 ...
train <- iris[inds$train, ]
valid <- iris[inds$valid, ]
test <- iris[inds$test, ]
rmse <- function(y, pred) {
sqrt(mean((y - pred)^2))
}
# Use simple validation to decide on interaction yes/no...
fit1 <- lm(Sepal.Length ~ ., data = train)
fit2 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = train)
rmse(valid$Sepal.Length, predict(fit1, valid))
#> [1] 0.3020855
rmse(valid$Sepal.Length, predict(fit2, valid))
#> [1] 0.2954321
# Yes! Choose and test final model
rmse(test$Sepal.Length, predict(fit2, test))
#> [1] 0.3482849Since the iris data consists of only 150 rows, investing
20% of observations for validation seems like a waste of resources.
Furthermore, the performance estimates might not be very robust. Let’s
replace simple validation by five-fold CV, again using stratification on
the response variable.
iris into 80% training data and 20% test,
stratified by the variable Sepal.Length.# Split into training and test
inds <- partition(iris$Sepal.Length, p = c(train = 0.8, test = 0.2), seed = 87)
train <- iris[inds$train, ]
test <- iris[inds$test, ]
# Get stratified CV in-sample indices
folds <- create_folds(train$Sepal.Length, k = 5, seed = 2734)
# Vectors with results per model and fold
cv_rmse1 <- cv_rmse2 <- numeric(5)
for (i in seq_along(folds)) {
insample <- train[folds[[i]], ]
out <- train[-folds[[i]], ]
fit1 <- lm(Sepal.Length ~ ., data = insample)
fit2 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = insample)
cv_rmse1[i] <- rmse(out$Sepal.Length, predict(fit1, out))
cv_rmse2[i] <- rmse(out$Sepal.Length, predict(fit2, out))
}
# CV-RMSE of model 1 -> close winner
mean(cv_rmse1)
#> [1] 0.330189
# CV-RMSE of model 2
mean(cv_rmse2)
#> [1] 0.3306455
# Fit model 1 on full training data and evaluate on test data
final_fit <- lm(Sepal.Length ~ ., data = train)
rmse(test$Sepal.Length, predict(final_fit, test))
#> [1] 0.2892289If feasible, repeated CV is recommended in order to reduce uncertainty in decisions. Otherwise, the process remains the same.
# Train/test split as before
# 15 folds instead of 5
folds <- create_folds(train$Sepal.Length, k = 5, seed = 2734, m_rep = 3)
cv_rmse1 <- cv_rmse2 <- numeric(15)
# Rest as before...
for (i in seq_along(folds)) {
insample <- train[folds[[i]], ]
out <- train[-folds[[i]], ]
fit1 <- lm(Sepal.Length ~ ., data = insample)
fit2 <- lm(Sepal.Length ~ . + Species:Sepal.Width, data = insample)
cv_rmse1[i] <- rmse(out$Sepal.Length, predict(fit1, out))
cv_rmse2[i] <- rmse(out$Sepal.Length, predict(fit2, out))
}
mean(cv_rmse1)
#> [1] 0.3296087
mean(cv_rmse2)
#> [1] 0.331373
# Refit and test as beforeThe function multi_strata() creates a stratification
factor from multiple columns that can then be passed to
create_folds(, type = "stratified") or
partition(, type = "stratified"). The resulting partitions
will be (quite) balanced regarding these columns.
Two grouping strategies are offered:
Let’s have a look at a simple example where we want to model “Sepal.Width” as a function of the other variables in the iris data set. We want to do a stratified train/valid/test split, aiming at being balanced regarding not only the response “Sepal.Width”, but also regarding the important predictor “Species”. In this case, we could use the following workflow:
set.seed(3451)
ir <- iris[c("Sepal.Length", "Species")]
y <- multi_strata(ir, k = 5)
inds <- partition(
y, p = c(train = 0.6, valid = 0.2, test = 0.2), split_into_list = FALSE
)
# Check
by(ir, inds, summary)
#> inds: train
#> Sepal.Length Species
#> Min. :4.300 setosa :30
#> 1st Qu.:5.100 versicolor:30
#> Median :5.800 virginica :30
#> Mean :5.836
#> 3rd Qu.:6.400
#> Max. :7.700
#> ------------------------------------------------------------
#> inds: valid
#> Sepal.Length Species
#> Min. :4.400 setosa :10
#> 1st Qu.:5.425 versicolor:10
#> Median :5.900 virginica :10
#> Mean :5.903
#> 3rd Qu.:6.300
#> Max. :7.900
#> ------------------------------------------------------------
#> inds: test
#> Sepal.Length Species
#> Min. :4.700 setosa :10
#> 1st Qu.:5.100 versicolor:10
#> Median :5.700 virginica :10
#> Mean :5.807
#> 3rd Qu.:6.475
#> Max. :7.100These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.