The function ggstatsplot::ggpiestats
can be used to prepare publication-ready pie charts to summarize the statistical relationship between two categorical variables. We will see examples of how to use this function in this vignette.
To begin with, here are some instances where you would want to use ggpiestats
-
Note before: The following demo uses the pipe operator (%>%
), so in case you are not familiar with this operator, here is a good explanation: http://r4ds.had.co.nz/pipes.html
ggpiestats
To demonstrate how ggpiestats
can be used to we will be using the Titanic
dataset that is included in the datasets
library. Titanic Passenger Survival Data Set provides information “on the fate of passengers on the fatal maiden voyage of the ocean liner Titanic, summarized according to economic status (class), sex, age, and survival.”
Let’s have a look at the structure of this table and also convert it into a tibble while we are at it.
library(datasets)
library(dplyr)
# looking at the table
dplyr::glimpse(x = Titanic)
#> table [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
#> - attr(*, "dimnames")=List of 4
#> ..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew"
#> ..$ Sex : chr [1:2] "Male" "Female"
#> ..$ Age : chr [1:2] "Child" "Adult"
#> ..$ Survived: chr [1:2] "No" "Yes"
# converting to tibble
tibble::as_data_frame(x = Titanic)
#> # A tibble: 32 x 5
#> Class Sex Age Survived n
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 1st Male Child No 0
#> 2 2nd Male Child No 0
#> 3 3rd Male Child No 35
#> 4 Crew Male Child No 0
#> 5 1st Female Child No 0
#> 6 2nd Female Child No 0
#> 7 3rd Female Child No 17
#> 8 Crew Female Child No 0
#> 9 1st Male Adult No 118
#> 10 2nd Male Adult No 154
#> # ... with 22 more rows
Note that the last column in this dataframe contains count information, which means we will have to modify it to reflect this count structure.
# a custom function to repeat dataframe `rep` number of times, which is going to
# be count data for us
rep_df <- function(df, rep) {
df[rep(1:nrow(df), rep), ]
}
# converting dataframe to full length based on count information
Titanic_full <- tibble::as_data_frame(datasets::Titanic) %>%
tibble::rowid_to_column(df = ., var = "id") %>%
dplyr::mutate_at(
.tbl = .,
.vars = dplyr::vars("id"),
.funs = ~ as.factor(.)
) %>%
base::split(x = ., f = .$id) %>%
purrr::map_dfr(.x = ., .f = ~ rep_df(df = ., rep = .$n)) %>%
dplyr::mutate_at(
.tbl = .,
.vars = dplyr::vars("id"),
.funs = ~ as.numeric(as.character(.))
) %>%
dplyr::mutate_if(
.tbl = .,
.predicate = is.character,
.funs = ~ base::as.factor(.)
) %>%
dplyr::mutate_if(
.tbl = .,
.predicate = is.factor,
.funs = ~ base::droplevels(.)
) %>%
dplyr::arrange(.data = ., id)
# reordering the Class variables
Titanic_full$Class <-
base::factor(x = Titanic_full$Class,
levels = c("1st", "2nd", "3rd", "Crew", ordered = TRUE))
# looking at the final dataset
dplyr::glimpse(Titanic_full)
#> Observations: 2,201
#> Variables: 6
#> $ id <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,...
#> $ Class <fct> 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd...
#> $ Sex <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male,...
#> $ Age <fct> Child, Child, Child, Child, Child, Child, Child, Chil...
#> $ Survived <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N...
#> $ n <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 3...
Notice that our final dataset contains information about over 2000 passangers, while the original dataset with counts information had only 32 rows. Now we can start asking interesting question from this dataset.
First, let’s see if the proportion of people who survived was different between sexes using ggpiestats
.
A number of arguments can be modified to change the appearance of this plot:
library(ggstatsplot)
ggstatsplot::ggpiestats(
data = Titanic_full, # dataframe
main = Survived, # rows in the contingency table
condition = Sex, # columns in the contingecy table
title = "Passengar survival by gender", # title for the entire plot
caption = "Source: Titanic survival dataset", # caption for the entire plot
legend.title = "Survived?", # legend title
facet.wrap.name = "Gender", # changing the facet wrap title
facet.proptest = TRUE, # proportion test for each facet
stat.title = "survival x gender" # title for statistical test
) + # further modification outside of ggstatsplot
ggplot2::scale_fill_brewer(palette = "Dark2")
As seen from this plot, the Pearson’s chi-square test of independence shows that the distribution of survival was different across males and females. Additionally, among both males and females, the proportion of survival was not equally likely (at 50%, i.e.), as shown by significant results (***
) from one-sample proportion tests for each facet.
In case the condition
argument is not specified, instead of chi-square test of independence, a proportion test will be carried out. For example, let’s see if there were equal proportions of different age groups.
library(ggstatsplot)
ggstatsplot::ggpiestats(
data = Titanic_full,
main = Age
) +
ggplot2::scale_fill_brewer(palette = "Set2")
As this plot shows there were overwhelmingly more number of adults than children on the boat and the proportion test attests to this.
grouped_ggpiestats
What if we want to do the same analysis separately for the four different Class on the Titanic (1st, 2nd, 3rd, Crew), i.e. checking how the survival-by-gender interaction changes by the passenger class in which the people were traveling? In that case, we will have to either write a for
loop or use purrr
, both of which are time consuming and can be a bit of a struggle.
ggstatsplot
provides a special helper function for such instances: grouped_ggpiestats
. This is merely a wrapper function around ggstatsplot::combine_plots
. It applies ggpiestats
across all levels of a specified grouping variable and then combines list of individual plots into a single plot. Note that the grouping variable can be anything: conditions in a given study, groups in a study sample, different studies, etc.
library(ggstatsplot)
ggstatsplot::grouped_ggpiestats(
# arguments relevant for ggstatsplot::gghistostats
data = Titanic_full,
grouping.var = Class,
title.prefix = "Passenger class",
stat.title = "survival x gender",
main = Survived,
condition = Sex,
# arguments relevant for ggstatsplot::combine_plots
title.text = "Survival in Titanic disaster by gender for all passenger classes",
caption.text = "Asterisks denote results from proportion tests; ***: p < 0.001, ns: non-significant",
nrow = 4,
ncol = 1,
labels = c("(a)","(b)","(c)", "(d)")
)
As seen from this quick exploratory analysis, across all passenger classes, the proportion of survived to non-survived individuals differed across genders: Men were more likely to perish than survive, whereas women were more likely to survive than perish. The only exception was the 3rd Class passangers where women were as likely to survive as to perish.
This will work even if the condition argument is not specified:
library(ggstatsplot)
ggstatsplot::grouped_ggpiestats(
data = Titanic_full,
main = Age,
grouping.var = Class
)
ggpiestats
+ purrr
Although this grouping function provides a quick way to explore the data, it leaves much to be desired. For example, the color palette can’t be further modified, the legend color combination differs across different levels of the grouping variable based on the frequencies, etc. For cases like these, it would be better to use a function like purrr::pmap
.
Note before: Unlike the function call so far, while using purrr::pmap
, we will need to quote the arguments.
# let's split the dataframe and create a list by passenger class
class_list <- Titanic_full %>%
base::split(x = ., f = .$Class, drop = TRUE)
# this created a list with 4 elements, one for each class
str(class_list)
#> List of 4
#> $ 1st :Classes 'tbl_df', 'tbl' and 'data.frame': 325 obs. of 6 variables:
#> ..$ id : num [1:325] 9 9 9 9 9 9 9 9 9 9 ...
#> ..$ Class : Factor w/ 5 levels "1st","2nd","3rd",..: 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
#> ..$ Age : Factor w/ 2 levels "Adult","Child": 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ n : num [1:325] 118 118 118 118 118 118 118 118 118 118 ...
#> $ 2nd :Classes 'tbl_df', 'tbl' and 'data.frame': 285 obs. of 6 variables:
#> ..$ id : num [1:285] 10 10 10 10 10 10 10 10 10 10 ...
#> ..$ Class : Factor w/ 5 levels "1st","2nd","3rd",..: 2 2 2 2 2 2 2 2 2 2 ...
#> ..$ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
#> ..$ Age : Factor w/ 2 levels "Adult","Child": 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ n : num [1:285] 154 154 154 154 154 154 154 154 154 154 ...
#> $ 3rd :Classes 'tbl_df', 'tbl' and 'data.frame': 706 obs. of 6 variables:
#> ..$ id : num [1:706] 3 3 3 3 3 3 3 3 3 3 ...
#> ..$ Class : Factor w/ 5 levels "1st","2nd","3rd",..: 3 3 3 3 3 3 3 3 3 3 ...
#> ..$ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
#> ..$ Age : Factor w/ 2 levels "Adult","Child": 2 2 2 2 2 2 2 2 2 2 ...
#> ..$ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ n : num [1:706] 35 35 35 35 35 35 35 35 35 35 ...
#> $ Crew:Classes 'tbl_df', 'tbl' and 'data.frame': 885 obs. of 6 variables:
#> ..$ id : num [1:885] 12 12 12 12 12 12 12 12 12 12 ...
#> ..$ Class : Factor w/ 5 levels "1st","2nd","3rd",..: 4 4 4 4 4 4 4 4 4 4 ...
#> ..$ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
#> ..$ Age : Factor w/ 2 levels "Adult","Child": 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#> ..$ n : num [1:885] 670 670 670 670 670 670 670 670 670 670 ...
# running function on every element of this list note that if you want the same
# value for a given argument across all elements of the list, you need to
# specify it just once
plot_list <- purrr::pmap(
.l = list(
data = class_list,
main = "Survived",
condition = "Sex",
facet.wrap.name = "Gender",
title = list(
"Passenger class: 1st",
"Passenger class: 2nd",
"Passenger class: 3rd",
"Passenger class: Crew"
),
caption = list(
"Total: 319, Died: 120, Survived: 199, % Survived: 62%",
"Total: 272, Died: 155, Survived: 117, % Survived: 43%",
"Total: 709, Died: 537, Survived: 172, % Survived: 25%",
"Not available"
),
messages = FALSE
),
.f = ggstatsplot::ggpiestats
)
# combining all individual plots from the list into a single plot using combine_plots function
ggstatsplot::combine_plots(
plot_list$`1st` + ggplot2::scale_fill_brewer(palette = "Dark2"),
plot_list$`2nd` + ggplot2::scale_fill_brewer(palette = "Dark2"),
plot_list$`3rd` + ggplot2::scale_fill_manual(values = c("#D95F02", "#1B9E77")), # to be consistent with other legends
plot_list$Crew + ggplot2::scale_fill_brewer(palette = "Dark2"),
title.text = "Survival in Titanic disaster by gender for all passenger classes",
caption.text = "Asterisks denote results from proportion tests; ***: p < 0.001, ns: non-significant",
nrow = 4,
ncol = 1,
labels = c("(a)","(b)","(c)", "(d)")
)
As can be appreciated from this example, although grouped_ggpiestats
provides a quick way to explore data, purrr::pmap
lets us utilize the full functionality of this function and ggplot2
.
Variant of this function for within-subjects designs, which will display results from McNemar test, is currently under work. You can still use this function just to prepare the plot for exploratory data analysis, but the statistical details displayed in the subtitle will be incorrect. You can remove them by adding + ggplot2::labs(subtitle = NULL)
.
If you find any bugs or have any suggestions/remarks, please file an issue on GitHub: https://github.com/IndrajeetPatil/ggstatsplot/issues