Description of groupdata2

Ludvig Renbo Olsen

2017-01-28

Abstract

groupdata2 is a set of subsetting methods for easy grouping, windowing, folding and splitting of data. Create balanced folds for cross-validation or divide a time series into windows.
This vignette contains descriptions of functions and methods, along with simple examples of usage. For a more gentle introduction to groupdata2, please see Introduction to groupdata2  
 
Contact author at r-pkgs@ludvigolsen.dk    


Installing groupdata2

You can either install the CRAN version or the GitHub development version.

CRAN version

# Uncomment:
# install.packages("groupdata2")  

GitHub development version

# Uncomment:
# install.packages("devtools")  
# devtools::install_github("LudvigOlsen/groupdata2")

Attach packages


# Attaching groupdata2
library(groupdata2)

# Attaching other packages used in this vignette
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)

# We will also be using plyr a few times, but we don't attach this 
# because of possible conflicts with dplyr. Instead we use its functions
# like so: plyr::count()

General information

groupdata2 is a set of functions and methods for easy grouping, windowing, folding and splitting of data.

 

There are 4 main functions:
 

group_factor()

Returns a factor with group numbers, e.g. 111222333.
This can be used to subset, aggregate, group_by, etc.

group()

Returns the given data as a dataframe with added grouping factor made with group_factor(). The dataframe is grouped by the grouping factor for easy use with dplyr pipelines.

splt()

Splits the given data into the specified groups made with group_factor() and returns them in a list.

fold()

Creates (optionally) balanced folds for use in cross-validation. Balance folds on one categorical variable and/or make sure that all datapoints sharing an ID is in the same fold.

Groups, windows or folds?

When working with time series we would often refer to the kind of groups made by group_factor(), group() and splt() as windows. In this vignette, these will be referred to as groups.

fold() creates balanced groups for cross-validation by using group(). These are referred to as folds.

Use of kable()

In the examples we will be using knitr::kable() to visualize some of the data such as dataframes. You do not need to use kable() in any way when using the functions.

Methods

There are currently 6 methods for grouping the data.

It is possible to create groups based on group size, step size or number of groups. These can be given as whole number or percentage.

Here we will take a look at the different methods.

Method: ‘greedy’

‘greedy’ uses group size for dividing up the data.
Greedy means that each group grabs as many elements as possible (up to size), meaning that there might be less elements available to the last group, but that all other groups than the last are guaranteed to have the size specified.

 

Example

We have a vector with 57 values. We want to have group sizes of 10.

The greedy splitter will return groups with this many values in them:
10, 10, 10, 10, 10, 7

 

By setting force_equal to TRUE, we discard the last group if it contains fewer values than the other groups.

 

Example

We have a vector with 57 values. We want to have group sizes of 10.

The greedy splitter with force_equal set to TRUE will return groups with this many values in them:
10, 10, 10, 10, 10

meaning that 7 values have been discarded.

 

Method: ‘n_dist’ (Default)

‘n_dist’ uses a specified number of groups to divide up the data.
First it creates equal groups as large as possible. Then, if there are any excess data points, it distributes them across the groups.

 

Example

We have a vector with 57 values. We want to get back 5 groups.

‘n_dist’ with default settings would return groups with this many values in them:

11, 11, 12, 11, 12

 

By setting force_equal to TRUE, ‘n_dist’ will create the largest possible, equally sized groups by discarding excess data elements.

 

Example

‘n_dist’ with force_equal set to TRUE would return groups with this many values in them:

11, 11, 11, 11, 11

meaning that 2 values have been discarded.

 

Method: ‘n_fill’

‘n_fill’ uses a specified number of groups to divide up the data.
First it creates equal groups as large as possible. Then, if there are any excess data points, it places them in the first groups.
By setting descending to TRUE, it would be the last groups though.

 

Example

We have a vector with 57 values. We want to get back 5 groups.

‘n_fill’ with default settings would return groups with this many values in them:

12, 12, 11, 11, 11

 

By setting force_equal to TRUE, ‘n_fill’ will create the largest possible, equally sized groups by discarding excess data elements.

 

Example

‘n_fill’ with force_equal set to TRUE would return groups with this many values in them:

11, 11, 11, 11, 11

meaning that 2 values have been discarded.

 

Method: ‘n_last’

‘n_last’ uses a specified number of groups to divide up the data.

With default settings, it tries to make the groups as equally sized as possible, but notice that the last group might contain fewer or more elements, if the length of the data is not divisible with the number of groups. All, but the last, groups are guaranteed to contain the same number of elements.

 

Example

We have a vector with 57 values. We want to get back 5 groups.

‘n_last’ with default settings would return groups with this many values in them:

11, 11, 11, 11, 13

 

By setting force_equal to TRUE, ‘n_last’ will create the largest possible, equally sized groups by discarding excess data elements.

 

Example

‘n_last’ with force_equal set to TRUE would return groups with this many values in them:

11, 11, 11, 11, 11

meaning that 2 values have been discarded.

 

Notice that ‘n_last’ will always return the given number of groups. It will never return a group with zero elements. For some situations that means that the last group will contain a lot of elements. Asked to divide a vector with 57 elements into 20 groups, the first 19 groups will contain 2 elements, while the last group will itself contain 19 elements. Had we instead asked it to divide the vector into 19 groups, we would have had 3 elements in all groups.

 

Method: ‘n_rand’

‘n_fill’ uses a specified number of groups to divide up the data.
First it creates equal groups as large as possible. Then, if there are any excess data points, it places them randomly in the groups.
N.B.: It only places one extra element per group.

 

Example

We have a vector with 57 values. We want to get back 5 groups.

‘n_rand’ with default settings could return groups with this many values in them:

12, 11, 11, 11, 12

 

By setting force_equal to TRUE, ‘n_rand’ will create the largest possible, equally sized groups by discarding excess data elements.

 

Example

‘n_rand’ with force_equal set to TRUE would return groups with this many values in them:

11, 11, 11, 11, 11

meaning that 2 values have been discarded.

 

Method: ‘staircase’

‘staircase’ uses step_size to divide up the data.
For each group, the group size will be step size multiplied with the group index.

 

Example

We have a vector with 57 values. We specify a step size of 5.

‘staircase’ with default settings would return groups with this many values in them:

5, 10, 15, 20, 7

 

By setting force_equal to TRUE, ‘staircase’ will discard the last group if it does not contain the expected values (step size multiplied by group index).

 

Example

‘staircase’ with force_equal set to TRUE would return groups with this many values in them:

5, 10, 15, 20

meaning that 7 values have been discarded.

 

Find remainder - %staircase%

When using the staircase method the last group might not have the size of the second last group + step size.
Use %staircase% to find the remainder.

If the last group has the size of the second last group + step size, %staircase% will return 0.

 

Example

%staircase% on a vector with size 57 and step size of 5 would look like this:

57 %staircase% 5

and return:

7

meaning that the last group would contain 7 values

Arguments

data

Type: dataframe or vector

The data to process.

 

Used in: group_factor(), group(), splt(), fold()

n

Type: integer or numeric

n represents either group size, step size or number of groups, depending on which method is specified.
n can be given as a whole number (n > 1) or as percentage (0 < n < 1)

 

Used in: group_factor(), group(), splt()

method

Type: character

Choose which method to use when dividing up the data.
Available methods: greedy, n_dist, n_fill, n_last, n_rand, or staircase

 

Used in: group_factor(), group(), splt(), fold()

force_equal

Type: logical (TRUE or FALSE)

If you need groups with the exact same size, set force_equal to TRUE.
Implementation is different in the different methods. Read more in their sections above.
Be aware that this setting discards excess datapoints!

 

Used in: group_factor(), group(), splt()

allow_zero

Type: logical (TRUE or FALSE)

If you set n to 0, you get an error.
If you don’t want this behavior, you can set allow_zero to TRUE, and (depending on the function) you will get the following output:

group_factor() will return the factor with NAs instead of numbers. It will be the same length as expected.

group() will return the expected dataframe with NAs instead of a grouping factor.

splt() functions will return the given data (dataframe or vector) in the same list format as if it had been split.

 

Used in: group_factor(), group(), splt()

descending

Type: logical (TRUE or FALSE)

In methods like ‘n_fill’ where it makes sense to change the direction of the method, you can use this argument.
In ‘n_fill’ it fills up the excess data points starting from the last group instead of the first.
NB. Only some of the methods can use this argument.

 

Used in: group_factor(), group(), splt()

randomize

Type: logical (TRUE or FALSE)

After creating the the grouping factor using the chosen method, it is possible to randomly reorganize it before returning it. Notice that this applies to all the functions, as group() and splt() uses the grouping factor!

 

Used in: group_factor(), group(), splt()

N.B. fold() always uses some randomization.

col_name

Type: character

Name of added grouping factor column. Allows multiple grouping factors in a dataframe.

 

Used in: group()

k

Type: integer or numeric

k represents either fold size, step size or number of folds, depending on which method is specified.
k can be given as a whole number (n > 1) or as percentage (0 < n < 1)

 

Used in: fold()

cat_col

Type: categorical vector or factor (passed as column name)

Categorical variable to balance between folds.

E.g. when predicting a binary variable (a or b), it is necessary to have both represented in every fold.

 

N.B. If also passing id_col, cat_col should be a constant within IDs.
E.g. a participant must always have the same diagnosis (a or b) throughout the dataset. Else, the participant might be placed in multiple folds.

 

Used in: fold()

id_col

Type: Factor (passed as column name)

Factor with IDs. This will be used to keep all rows with an ID in the same fold (if possible).

E.g. If we have measured a participant multiple times and want to see the effect of time, we want to have all observations of this participant in the same fold.

 

Used in: fold()

 

Using Functions

We will be using ‘n_dist’ on a dataframe to showcase the functions. Afterwards we will use and compare the methods.
Notice that you can also use vectors with all the functions.

group_factor()

  1. We create a dataframe
df <- data.frame("x"=c(1:12), 
                "species" = rep(c('cat','pig', 'human'), 4), 
                "age" = sample(c(1:100), 12))
  1. Using group_factor()
groups <- group_factor(df, 5, method = 'n_dist')

groups
#>  [1] 1 1 2 2 3 3 3 4 4 5 5 5
#> Levels: 1 2 3 4 5

df$groups <- groups

df %>% kable(align = 'c')
x species age groups
1 cat 94 1
2 pig 29 1
3 human 100 2
4 cat 15 2
5 pig 27 3
6 human 74 3
7 cat 26 3
8 pig 51 4
9 human 61 4
10 cat 28 5
11 pig 30 5
12 human 50 5
  1. We could get the mean age of each group
aggregate(df[, 3], list(df$groups), mean) %>% 
  rename(group = Group.1, mean_age = x) %>%
  kable(align = 'c')
group mean_age
1 61.50000
2 57.50000
3 42.33333
4 56.00000
5 36.00000

force_equal

Getting an equal number of elements per group with group_factor().

Notice that we discard the excess values so all groups contain the same amount of elements. Since the grouping factor is shorter than the dataframe, we can’t combine them as they are. A way to do so would be to shorten the dataframe to be the same length as the grouping factor.

  1. We create a dataframe
df <- data.frame("x"=c(1:12), 
                "species" = rep(c('cat','pig', 'human'), 4), 
                "age" = sample(c(1:100), 12))
  1. Using group_factor() with force_equal
groups <- group_factor(df, 5, method = 'n_dist', force_equal = TRUE)

groups
#>  [1] 1 1 2 2 3 3 4 4 5 5
#> Levels: 1 2 3 4 5

plyr::count(groups) %>% 
  rename(group = x, size = freq) %>%
  kable(align = 'c')
group size
1 2
2 2
3 2
4 2
5 2
  1. Combining dataframe and grouping factor

First we make the dataframe the same size as the grouping factor. Then we add the grouping factor to the dataframe.

df <- head(df, length(groups)) %>%
  mutate(group = groups)

df %>% kable(align = 'c')
x species age group
1 cat 11 1
2 pig 62 1
3 human 66 2
4 cat 46 2
5 pig 15 3
6 human 35 3
7 cat 87 4
8 pig 57 4
9 human 43 5
10 cat 63 5

 

group()

  1. We create a dataframe
df <- data.frame("x"=c(1:12), 
                "species" = rep(c('cat','pig', 'human'), 4), 
                "age" = sample(c(1:100), 12))
  1. Using group()
df_grouped <- group(df, 5, method = 'n_dist')

df_grouped %>% kable(align = 'c')
x species age .groups
1 cat 72 1
2 pig 77 1
3 human 65 2
4 cat 76 2
5 pig 75 3
6 human 42 3
7 cat 40 3
8 pig 3 4
9 human 54 4
10 cat 29 5
11 pig 92 5
12 human 79 5

2.2 Using group() with dplyr pipelines to get mean age

df_means <- df %>%
  group(5, method = 'n_dist') %>%
  dplyr::summarise(mean_age = mean(age))

df_means %>% kable(align = 'c')
.groups mean_age
1 74.50000
2 70.50000
3 52.33333
4 28.50000
5 66.66667

force_equal

Getting an equal number of elements per group with group().

Notice that we discard the excess rows/elements so all groups contain the same amount of elements.

  1. We create a dataframe
df <- data.frame("x"=c(1:12), 
                "species" = rep(c('cat','pig', 'human'), 4), 
                "age" = sample(c(1:100), 12))
  1. Using group() with force_equal
df_grouped <- df %>%
  group(5, method = 'n_dist', force_equal = TRUE)

df_grouped %>% kable(align = 'c')
x species age .groups
1 cat 59 1
2 pig 80 1
3 human 19 2
4 cat 17 2
5 pig 38 3
6 human 41 3
7 cat 48 4
8 pig 69 4
9 human 30 5
10 cat 68 5

 

splt()

  1. We create a dataframe
df <- data.frame("x"=c(1:12), 
                "species" = rep(c('cat','pig', 'human'), 4), 
                "age" = sample(c(1:100), 12))
  1. Using splt()
df_list <- splt(df, 5, method = 'n_dist')

df_list %>% kable(align = 'c')
x species age
1 cat 18
2 pig 11
x species age
3 3 human 35
4 4 cat 53
x species age
5 5 pig 81
6 6 human 4
7 7 cat 36
x species age
8 8 pig 95
9 9 human 19
x species age
10 10 cat 27
11 11 pig 67
12 12 human 40
  1. Let’s see the format of the list created by splt() without using kable() to visualize it.
    splt() uses base::split() to split the data by the grouping factor.

v = c(1:6)

splt(v, 3, method = 'n_dist')
#> $`1`
#> [1] 1 2
#> 
#> $`2`
#> [1] 3 4
#> 
#> $`3`
#> [1] 5 6

force_equal

Getting an equal number of elements per group with splt().

Notice that we discard the excess rows/elements so all groups contain the same amount of elements.

  1. We create a dataframe
df <- data.frame("x"=c(1:12), 
                "species" = rep(c('cat','pig', 'human'), 4), 
                "age" = sample(c(1:100), 12))
  1. Using splt() with force_equal
df_list <- splt(df, 5, method = 'n_dist', force_equal = TRUE)

df_list %>% kable(align = 'c')
x species age
1 cat 48
2 pig 87
x species age
3 3 human 51
4 4 cat 61
x species age
5 5 pig 46
6 6 human 49
x species age
7 7 cat 3
8 8 pig 89
x species age
9 9 human 9
10 10 cat 73

 

fold()

  1. We create a dataframe
df <- data.frame("participant" = factor(rep(c('1','2', '3', '4', '5', '6'), 3)),
                "age" = rep(sample(c(1:100), 6), 3),
                "diagnosis" = rep(c('a', 'b', 'a', 'a', 'b', 'b'), 3),
                "score" = sample(c(1:100), 3*6))

df <- df[order(df$participant),] 

# Remove index
rownames(df) <- NULL

# Add session info
df$session <- rep(c('1','2', '3'), 6)

kable(df, align = 'c')
participant age diagnosis score session
1 28 a 84 1
1 28 a 96 2
1 28 a 74 3
2 61 b 37 1
2 61 b 100 2
2 61 b 13 3
3 13 a 75 1
3 13 a 87 2
3 13 a 66 3
4 54 a 32 1
4 54 a 25 2
4 54 a 85 3
5 85 b 43 1
5 85 b 3 2
5 85 b 62 3
6 40 b 2 1
6 40 b 30 2
6 40 b 10 3
  1. Using fold() without balancing
df_folded <- fold(df, 3, method = 'n_dist')

# Order by folds
df_folded <- df_folded[order(df_folded$.folds),]

kable(df_folded, align = 'c')
participant age diagnosis score session .folds
1 28 a 96 2 1
2 61 b 100 2 1
2 61 b 13 3 1
4 54 a 32 1 1
4 54 a 25 2 1
5 85 b 62 3 1
1 28 a 74 3 2
3 13 a 75 1 2
3 13 a 87 2 2
3 13 a 66 3 2
5 85 b 3 2 2
6 40 b 2 1 2
1 28 a 84 1 3
2 61 b 37 1 3
4 54 a 85 3 3
5 85 b 43 1 3
6 40 b 30 2 3
6 40 b 10 3 3
  1. Using fold() with balancing but without id_col
df_folded <- fold(df, 3, cat_col = 'diagnosis', method = 'n_dist')

# Order by folds
df_folded <- df_folded[order(df_folded$.folds),] 

kable(df_folded, align = 'c')
participant age diagnosis score session .folds
1 28 a 74 3 1
3 13 a 75 1 1
3 13 a 87 2 1
2 61 b 37 1 1
2 61 b 13 3 1
5 85 b 43 1 1
1 28 a 84 1 2
3 13 a 66 3 2
4 54 a 32 1 2
2 61 b 100 2 2
5 85 b 3 2 2
6 40 b 2 1 2
1 28 a 96 2 3
4 54 a 25 2 3
4 54 a 85 3 3
5 85 b 62 3 3
6 40 b 30 2 3
6 40 b 10 3 3

Let’s count how many of each diagnosis there are in each group.

df_folded %>% group_by(.folds) %>% count(diagnosis) %>% kable(align='c')
.folds diagnosis n
1 a 3
1 b 3
2 a 3
2 b 3
3 a 3
3 b 3
  1. Using fold() with id_col but without balancing
df_folded <- fold(df, 3, id_col = 'participant', method = 'n_dist')

# Order by folds
df_folded <- df_folded[order(df_folded$.folds),] 

# Remove index (Looks prettier in the table!)
rownames(df_folded) <- NULL

kable(df_folded, align = 'c')
participant age diagnosis score session .folds
3 13 a 75 1 1
3 13 a 87 2 1
3 13 a 66 3 1
4 54 a 32 1 1
4 54 a 25 2 1
4 54 a 85 3 1
2 61 b 37 1 2
2 61 b 100 2 2
2 61 b 13 3 2
5 85 b 43 1 2
5 85 b 3 2 2
5 85 b 62 3 2
1 28 a 84 1 3
1 28 a 96 2 3
1 28 a 74 3 3
6 40 b 2 1 3
6 40 b 30 2 3
6 40 b 10 3 3

Let’s see how participants were distributed in the groups.

df_folded %>% group_by(.folds) %>% count(participant) %>% kable(align='c')
.folds participant n
1 3 3
1 4 3
2 2 3
2 5 3
3 1 3
3 6 3
  1. Using fold() with balancing and with id_col

fold() first divides up the dataframe by cat_col and then create n folds for both diagnoses. As there are only 3 participants per diagnosis, we can maximally create 3 folds in this scenario.

df_folded <- fold(df, 3, cat_col = 'diagnosis', id_col = 'participant', method = 'n_dist')

# Order by folds
df_folded <- df_folded[order(df_folded$.folds),] 

kable(df_folded, align = 'c')
participant age diagnosis score session .folds
1 28 a 84 1 1
1 28 a 96 2 1
1 28 a 74 3 1
5 85 b 43 1 1
5 85 b 3 2 1
5 85 b 62 3 1
3 13 a 75 1 2
3 13 a 87 2 2
3 13 a 66 3 2
6 40 b 2 1 2
6 40 b 30 2 2
6 40 b 10 3 2
4 54 a 32 1 3
4 54 a 25 2 3
4 54 a 85 3 3
2 61 b 37 1 3
2 61 b 100 2 3
2 61 b 13 3 3

Let’s count how many of each diagnosis there are in each group and find which participants are in which groups.

df_folded %>% group_by(.folds) %>% count(diagnosis, participant) %>% kable(align='c')
.folds diagnosis participant n
1 a 1 3
1 b 5 3
2 a 3 3
2 b 6 3
3 a 4 3
3 b 2 3

 

Extra arguments showcase

randomize

  1. We create a dataframe
df <- data.frame("x"=c(1:12), 
                "species" = rep(c('cat','pig', 'human'), 4), 
                "age" = sample(c(1:100), 12))
  1. We use group_factor() with randomize set to TRUE
groups <- group_factor(df, 5, method = 'n_dist', randomize = TRUE)

groups
#>  [1] 2 2 3 1 5 4 3 4 5 5 3 1
#> Levels: 1 2 3 4 5
  1. We use splt() with randomize set to TRUE
    Notice that the index has been shuffled but the group sizes are the same as before!
df_list <- splt(df, 5, method = 'n_dist', randomize = TRUE)

df_list %>% kable(align = 'c')
x species age
1 1 cat 59
7 7 cat 72
x species age
4 4 cat 3
12 12 human 79
x species age
3 3 human 86
5 5 pig 71
11 11 pig 29
x species age
8 8 pig 34
10 10 cat 16
x species age
2 2 pig 18
6 6 human 43
9 9 human 95

Examples of method differences

In this section we will take a look at the outputs we get from the different methods.

n_ methods

Vector with 57 elements divided into 6 groups

Below you’ll see a dataframe with counts of group elements when dividing up the same data with the different n_ methods. The forced_equal column is simply with the force_equal set to TRUE.

forced_equal: Since this is a setting to make sure that all groups are of the same size, it makes sense that all the groups have the same size.

n_dist: compared to forced_equal we see the 3 datapoints that forced_equal had discarded. These are distributed across the groups (in this example group 2,4 and 6)

n_fill: The 3 extra datapoints are located at the first 3 groups. Had we set descending to TRUE, it would have been the last 3 groups instead.

n_last: We see that n_last creates equal group sizes in all but the last group. This means that the last group can sometimes have a group size, which is very large or small compared to the other groups. Here it is a third larger than the other groups.

n_rand: The extra datapoints are placed randomly and so we would see the extra datapoints located at different groups if we ran the script again. Unless we use set.seed() before running the function.

#>   x n_dist n_fill n_last n_rand forced_equal
#> 1 1      9     10      9      9            9
#> 2 2     10     10      9      9            9
#> 3 3      9     10      9     10            9
#> 4 4     10      9      9      9            9
#> 5 5      9      9      9     10            9
#> 6 6     10      9     12     10            9

 

Vector with 117 elements divided into 11 groups

Here is another example.

#>     x n_dist n_fill n_last n_rand forced_equal
#> 1   1     10     11     11     11           10
#> 2   2     11     11     11     10           10
#> 3   3     10     11     11     10           10
#> 4   4     11     11     11     10           10
#> 5   5     11     11     11     10           10
#> 6   6     10     11     11     11           10
#> 7   7     11     11     11     11           10
#> 8   8     11     10     11     11           10
#> 9   9     10     10     11     11           10
#> 10 10     11     10     11     11           10
#> 11 11     11     10      7     11           10

 

Greedy

Vector with 100 elements with sizes of 8, 15, 20

Below you will see group sizes when using the method ‘greedy’ and asking for group sizes of 8, 15, 20. What should become clear is that only the last group can have a different group size than what we asked for. This is important if, say, you want to split a time series into groups of 100 elements, but the time series is not divisible with 100. Then you could use force_equal to remove the excess elements, if you need equal groups.

With a size of 8, we get 13 groups. The last group (13) only contains 4 elements, but all the other groups contain 8 elements as specified.

With a size of 15, we get 7 groups. The last group (7) contains only 10 elements, but all the other groups contain 15 elements as specified.

With a size of 20, we get 5 groups. As 20 is divisible with the 100 elements that the splitted vector contained, the last group also contains 20 elements, and so we have equal groups.

 

Staircasing

Vector with 1000 elements with step sizes of 2, 5, 11

Below you’ll see a plot with the group sizes at each group when using step sizes 2, 5, and 11.

At a step size of 2 elements it simply increases 2 for each group, until the last group (32) where it runs out of elements. Had we set force_equal to TRUE, this last group would have been discarded, because of the lack of elements.

At a step size of 5 elements it increases with 5 every time. Because of this it runs out of elements faster. Again we see that the last group (20) has fewer elements.

At a step size of 11 elements it increases with 11 every time. It seems that the last group is not too small, but it can be hard to see on this scale. Actually, the last group misses 1 element to be complete and so would have been discarded if force_equal was set to TRUE.

 

 

Below we will take a quick look at the cumulative sum of group elements to get an idea of what is going on under the hood.
Remember that the splitted vector had 1000 elements? That is why they all stop at 1000 on the y-axis. There are simply no more elements left!

The End

You have reached the end! Now celebrate by taking the week off, splitting data and laughing!