A basic understanding of probability and statistics is crucial for data understanding and discovery of meaningful patterns. A great way to teach probability and statistics is to start with an experiment, like rolling a dice or flipping a coin.
This package simulates rolling a dice and flipping a coin. Each experiment generates a tibble. Dice rolls and coin flips are simulated using sample(). The properties of the dice can be changed, like the number of sides. A Coin flip is simulated using a two sided dice. Experiments can be combined with the pipe-operator.
tidydice package on Github: https://github.com/rolkra/tidydice
As the tidydice-functions fits well into the tidyverse, we load the dplyr-package. For quick visualisations we use the explore-package. To create more flexible graphics, use ggplot2.
library(tidydice)
library(dplyr)
library(explore)
The output of roll_dice() is a tibble. Each row represents a dice roll. Without parameters, a dice is rolled once.
set.seed(123)
roll_dice()
#> # A tibble: 1 x 5
#> experiment round nr result success
#> <int> <int> <int> <int> <lgl>
#> 1 1 1 1 2 FALSE
Success is defined as result = 6 (as default), while result = 1..5 is not a success. In this case the result is 2, so it is no success.
If we would define result = 2 and result = 6 as success, it would be treated as success.
set.seed(123)
roll_dice(success = c(2,6))
#> # A tibble: 1 x 5
#> experiment round nr result success
#> <int> <int> <int> <int> <lgl>
#> 1 1 1 1 2 TRUE
As default, the dice is fair. So every result (0..6) has the same probability. If you want, you can change this.
set.seed(123)
roll_dice(prob = c(0,0,0,0,0,1))
#> # A tibble: 1 x 5
#> experiment round nr result success
#> <int> <int> <int> <int> <lgl>
#> 1 1 1 1 6 TRUE
In this case we created a dice that always gets result = 6 (with 100% probability)
As default the dice has 6 sides. If you want you can change this. Here we use a dice with 12 sides. result now can have a value between 1 and 12. But result = 6 is still the default success.
set.seed(123)
roll_dice(sides = 12)
#> # A tibble: 1 x 5
#> experiment round nr result success
#> <int> <int> <int> <int> <lgl>
#> 1 1 1 1 4 FALSE
set.seed(123)
roll_dice(times = 4)
#> # A tibble: 4 x 5
#> experiment round nr result success
#> <int> <int> <int> <int> <lgl>
#> 1 1 1 1 2 FALSE
#> 2 1 1 2 5 FALSE
#> 3 1 1 3 3 FALSE
#> 4 1 1 4 6 TRUE
We get 1 success
set.seed(123)
roll_dice(times = 4, rounds = 2)
#> # A tibble: 8 x 5
#> experiment round nr result success
#> <int> <int> <int> <int> <lgl>
#> 1 1 1 1 2 FALSE
#> 2 1 1 2 5 FALSE
#> 3 1 1 3 3 FALSE
#> 4 1 1 4 6 TRUE
#> 5 1 2 1 6 TRUE
#> 6 1 2 2 1 FALSE
#> 7 1 2 3 4 FALSE
#> 8 1 2 4 6 TRUE
Rolling the dice 4 times is repeated. In the first round we got 1 success, in the secound round 2 success.
A convenient way to aggregate the result, is to use the agg parameter. Now we get one line per round.
set.seed(123)
roll_dice(times = 4, rounds = 2, agg = TRUE)
#> # A tibble: 2 x 4
#> experiment round times success
#> <int> <int> <int> <int>
#> 1 1 1 4 1
#> 2 1 2 4 2
You can aggregate by hand too using dplyr.
set.seed(123)
roll_dice(times = 4, rounds = 2) %>%
group_by(experiment, round) %>%
summarise(times = n(),
success = sum(success))
#> # A tibble: 2 x 4
#> # Groups: experiment [1]
#> experiment round times success
#> <int> <int> <int> <int>
#> 1 1 1 4 1
#> 2 1 2 4 2
You can use any package/tool you like to visualise the result. In this example we use the explore-package.
set.seed(123)
roll_dice(times = 100) %>%
explore(result, title = "Rolling a dice 100x")
In 15% of the cases we got a six. This is close to the expected value of 100/6 = 16.67%
If we increase the times parameter to 10000, the results are more balanced.
set.seed(123)
roll_dice(times = 10000) %>%
explore(result, title = "Rolling a dice 10000x")
If we repeat the experiment rolling a dice 100x with rounds = 100, we get the distribution with a peak at about 17 (16.67 is the expected value)
set.seed(123)
roll_dice(times = 100, rounds = 100, agg = TRUE) %>%
explore(success,
title = "Rolling a dice 100x",
auto_scale = FALSE)
If we increase rounds from 100 to 10000 we get a more symmetric shape. We see that success below 5 and success above 30 are very unlikely.
set.seed(123)
roll_dice(times = 100, rounds = 10000, agg = TRUE) %>%
explore(success,
title = "Rolling a dice 100x",
auto_scale = FALSE)
set.seed(123)
roll_dice(times = 100, rounds = 10000, agg = TRUE) %>%
mutate(check = ifelse(success < 5 | success > 30, 1, 0)) %>%
count(check)
#> # A tibble: 2 x 2
#> check n
#> <dbl> <int>
#> 1 0 9996
#> 2 1 4
In only 4 of 10000 cases success is below 5 or above 30. So the probability to get this result is very low.
Let’s add an experiment, where you have 10 extra dice. The shape of the distribution changes.
set.seed(123)
roll_dice(times = 100, rounds = 10000, agg = TRUE) %>%
roll_dice(times = 110, rounds = 10000, agg = TRUE) %>%
explore(success,
target = experiment,
title = "Rolling a dice 100/110x",
auto_scale = FALSE)
You can add as many experiments as you like (as long they generate the same data structure)
Adding an experiment with times = 150 will generate a smaller but wider shape.
set.seed(123)
roll_dice(times = 100, rounds = 10000, agg = TRUE) %>%
roll_dice(times = 110, rounds = 10000, agg = TRUE) %>%
roll_dice(times = 150, rounds = 10000, agg = TRUE) %>%
explore(success,
target = experiment,
title = "Rolling a dice 100/110/150x",
auto_scale = FALSE)
Internally the package handles coins as dice with only two sides. Success is defined as result = 2 (as default), while result = 1 is not a success.
set.seed(123)
flip_coin(times = 10)
#> # A tibble: 10 x 5
#> experiment round nr result success
#> <int> <int> <int> <int> <lgl>
#> 1 1 1 1 2 TRUE
#> 2 1 1 2 2 TRUE
#> 3 1 1 3 2 TRUE
#> 4 1 1 4 2 TRUE
#> 5 1 1 5 2 TRUE
#> 6 1 1 6 1 FALSE
#> 7 1 1 7 1 FALSE
#> 8 1 1 8 2 TRUE
#> 9 1 1 9 2 TRUE
#> 10 1 1 10 2 TRUE
In this case the result are 6x 2 and 4x 1. We can use the describe() function of the explore-package to get a good overview.
set.seed(123)
flip_coin(times = 10) %>%
describe(success)
#> variable = success
#> type = logical
#> na = 0 of 10 (0%)
#> unique = 2
#> FALSE = 5 (50%)
#> TRUE = 5 (50%)
Or just use the agg-parameter
set.seed(123)
flip_coin(times = 10, agg = TRUE)
#> # A tibble: 1 x 4
#> experiment round times success
#> <int> <int> <int> <int>
#> 1 1 1 10 4
The parameter rounds can be used like in roll_dice().
set.seed(123)
flip_coin(times = 10, rounds = 4, agg = TRUE)
#> # A tibble: 4 x 4
#> experiment round times success
#> <int> <int> <int> <int>
#> 1 1 1 10 4
#> 2 1 2 10 7
#> 3 1 3 10 4
#> 4 1 4 10 5
set.seed(123)
flip_coin(times = 10, agg = TRUE) %>%
flip_coin(times = 15, agg = TRUE)
#> # A tibble: 2 x 4
#> experiment round times success
#> <int> <int> <int> <int>
#> 1 1 1 10 7
#> 2 2 1 15 6