The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

{splitTools}

CRAN status R-CMD-check Codecov test coverage

Overview

{splitTools} is a toolkit for fast data splitting. It does not have any dependencies.

Its two main functions partition() and create_folds() support

The function create_timefolds() does time-series splitting where the out-of-sample data follows the (extending or moving) in-sample data.

The result of create_folds() can be directly passed to the folds argument in CV functions of XGBoost or LightGBM. Since these functions expect out-of-sample indices, set the option invert = TRUE.

Installation

# From CRAN
install.packages("splitTools")

# Development version
devtools::install_github("mayer79/splitTools")

Usage

library(splitTools)

p <- c(train = 0.5, valid = 0.25, test = 0.25)

# Train/valid/test indices for iris data stratified by Species
str(inds <- partition(iris$Species, p, seed = 1))

# List of 3
#  $ train: int [1:73] 1 3 5 7 8 10 12 13 14 15 ...
#  $ valid: int [1:38] 4 9 19 21 27 28 29 30 32 35 ...
#  $ test : int [1:39] 2 6 11 16 18 22 26 37 38 40 ...

# Same, but different output interface
head(inds <- partition(iris$Species, p, split_into_list = FALSE, seed = 1))

# [1] train test  train valid train test 
# Levels: train valid test

# In-sample indices for 5-fold CV (stratified by Species)
str(inds <- create_folds(iris$Species, k = 5, seed = 1))

# List of 5
#  $ Fold1: int [1:120] 2 4 5 6 7 8 9 10 11 15 ...
#  $ Fold2: int [1:120] 1 2 3 4 5 6 9 10 11 12 ...
#  $ Fold3: int [1:120] 1 2 3 4 6 7 8 9 11 12 ...
#  $ Fold4: int [1:120] 1 3 5 6 7 8 10 11 12 13 ...
#  $ Fold5: int [1:120] 1 2 3 4 5 7 8 9 10 12 ...

# In-sample indices for 3 times repeated 5-fold CV (stratified by Species)
str(inds <- create_folds(iris$Species, k = 5, m_rep = 3, seed = 1))

# List of 15
#  $ Fold1.Rep1: int [1:120] 2 4 5 6 7 8 9 10 11 15 ...
#  $ Fold2.Rep1: int [1:120] 1 2 3 4 5 6 9 10 11 12 ...
#  $ Fold3.Rep1: int [1:120] 1 2 3 4 6 7 8 9 11 12 ...
#  $ Fold4.Rep1: int [1:120] 1 3 5 6 7 8 10 11 12 13 ...
#  $ Fold5.Rep1: int [1:120] 1 2 3 4 5 7 8 9 10 12 ...
#  $ Fold1.Rep2: int [1:120] 1 2 3 4 5 6 8 9 11 12 ...
#  $ Fold2.Rep2: int [1:120] 1 3 6 7 8 9 10 12 13 14 ...
# [...]

# Indices for time-series splitting
str(inds <- create_timefolds(1:100, k = 5))

# List of 5
# $ Fold1:List of 2
#  ..$ insample : int [1:17] 1 2 3 4 5 6 7 8 9 10 ...
#  ..$ outsample: int [1:17] 18 19 20 21 22 23 24 25 26 27 ...
# $ Fold2:List of 2
#  ..$ insample : int [1:34] 1 2 3 4 5 6 7 8 9 10 ...
#  ..$ outsample: int [1:17] 35 36 37 38 39 40 41 42 43 44 ...
# $ Fold3:List of 2
# [...]

For more details, check out the vignette.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.