The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The sumvar package provides simple and easy to use tools for summarising continuous and categorical data, inspired by Stata’s “sum” and “tab” commands. All functions are tidyverse/dplyr pipe-friendly and return tibbles.
When I first moved from Stata to R about 5 years ago, the main thing I missed was the simplicity of the “sum” and “tab” functions to efficiently explore data. Most template code to perform these commands, in introductory R books or tutorials eg. https://r4ds.hadley.nz/data-tidy.html, takes typically 3-5 lines to replicate these functions in R. I couldn’t find a package that could quite as simply and efficiently explore data.
Sumvar is fast and easy to use, and brings these variable summary functions to R.
We call dist_sum() to explore a continous variable.
The tibble output shows: the number of rows in the data, and number missing, the median, interquartile range (25th and 75th centiles), mean, the standard deviation, and 95% confidence intervals using the Wald method (normal approximation), and the minimum and maximum values.
Dist_sum() will show a density plot and histogram for a single variable, or a grouped density plot when there is a grouping varialbe.
You can save the output from dist_sum as a tibble and use the
estimates for downstream analysis, eg.
sum_df <- df %>% dist_sum(age, sex)
# Example data
set.seed(123)
df <- tibble::tibble(
age = rnorm(100, mean = 50, sd = 20),
sex = sample(c("male", "female"), 100, replace = TRUE)) %>%
dplyr::mutate(age = dplyr::if_else(sex == "male", age + 10, age))
# Call dist_sum
df %>% dist_sum(age)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> # A tibble: 1 × 11
#> n n_miss median p25 p75 mean sd ci_lower ci_upper min max
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 100 0 55.6 44.0 68.1 56.9 18.2 53.3 60.5 13.8 101.
df %>% dist_sum(age, sex)
#> # A tibble: 2 × 14
#> sex n n_miss median p25 p75 mean sd min max ci_lower
#> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 female 49 0 52.5 41.1 65.6 54.7 17.6 16.3 93.7 49.8
#> 2 male 51 0 57.2 46.8 71.3 59.0 18.6 13.8 101. 53.9
#> # ℹ 3 more variables: ci_upper <dbl>, p_ttest <dbl>, p_wilcox <dbl>
To explore the distribution of dates, call dist_date() - it is similar to dist_sum. This can also be grouped by a second grouping variable. With a single date, a histogram is shown; when a grouping variable is also called, a density plot is shown.
df3 <- tibble::tibble(
dates = as.Date("2022-01-01") + rnorm(n=100, sd=50, mean=0),
group = sample(c("A", "B"), 100, TRUE)) %>%
dplyr::mutate(dt = dplyr::case_when(group == "A" ~ dates + 10, TRUE ~ dates))
df3 %>% dist_date(dates)
#> # A tibble: 1 × 7
#> n n_miss min p25 median p75 max
#> <int> <int> <date> <date> <date> <date> <date>
#> 1 100 0 2021-10-25 2021-11-26 2021-12-22 2022-01-28 2022-06-12
df3 %>% dist_date(dates, group)
#> # A tibble: 2 × 8
#> group n n_miss min p25 median p75 max
#> <chr> <int> <int> <date> <date> <date> <date> <date>
#> 1 A 43 0 2021-10-25 2021-11-25 2021-12-17 2022-01-16 2022-06-12
#> 2 B 57 0 2021-10-27 2021-12-01 2022-01-03 2022-02-07 2022-04-20
tab1() produces a tibble showing the distribution of a categorical variable and illustrates using a horizontal bar chart.
#> # A tibble: 4 × 3
#> Category Frequency Percent
#> <chr> <int> <chr>
#> 1 C 71 35.5
#> 2 A 66 33.0
#> 3 B 63 31.5
#> 4 Total 200 100.0
To explore the proportion of duplicate values and missing values in a variable, pass it to dup().
example_data <- dplyr::tibble(id = 1:200, age = round(rnorm(200, mean = 30, sd = 50), digits=0))
example_data$age[sample(1:200, size = 15)] <- NA # Replace 20 values with missing.
example_data %>% dup(age)
#> # A tibble: 1 × 7
#> Variable n n_unique n_duplicate percent_duplicate n_missing
#> <chr> <int> <int> <int> <dbl> <int>
#> 1 age 200 119 66 35.7 15
#> # ℹ 1 more variable: percent_missing <dbl>
If you send the whole database to dup(), it will produce a summary of duplicates and missingness in the whole database. Dup() illustrates with a stacked bar chart.
example_data <- dplyr::tibble(age = round(rnorm(200, mean = 30, sd = 50), digits=0),
sex = sample(c("Male", "Female"), 200, TRUE),
favourite_colour = sample(c("Red", "Blue", "Purple"), 200, TRUE))
example_data$age[sample(1:200, size = 15)] <- NA # Replace 15 values with missing.
example_data$sex[sample(1:200, size = 32)] <- NA # Replace 32 values with missing.
dup(example_data)
#> # A tibble: 3 × 7
#> Variable n n_unique n_duplicate percent_duplicate n_missing
#> <chr> <int> <int> <int> <dbl> <int>
#> 1 age 200 117 68 36.8 15
#> 2 sex 200 2 166 98.8 32
#> 3 favourite_colour 200 3 197 98.5 0
#> # ℹ 1 more variable: percent_missing <dbl>
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.