The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Title: Summarise Continuous, Date and Categorical Variables, Check for Duplicates and Missing Data
Version: 0.1
Description: Explore continuous, date and categorical variables. 'sumvar' aims to bring the ease and simplicity of the "sum" and "tab" functions from 'stata'.
Encoding: UTF-8
RoxygenNote: 7.3.2
Imports: dplyr, ggplot2, lubridate, magrittr, patchwork, purrr, rlang, scales, stats, tibble, tidyr, utils
Suggests: knitr, rmarkdown, testthat (≥ 3.0.0)
Config/testthat/edition: 3
URL: https://github.com/alstockdale/sumvar, https://alstockdale.github.io/sumvar/
BugReports: https://github.com/alstockdale/sumvar/issues
License: MIT + file LICENSE
VignetteBuilder: knitr
NeedsCompilation: no
Packaged: 2025-06-11 17:14:46 UTC; al_st
Author: Alexander Stockdale [aut, cre]
Maintainer: Alexander Stockdale <a.stockdale@liverpool.ac.uk>
Repository: CRAN
Date/Publication: 2025-06-13 20:00:02 UTC

sumvar: Summarise Continuous and Categorical Variables in R

Description

The sumvar package explores continuous and categorical variables. sumvar brings the ease and simplicity of the "sum" and "tab" functions from Stata to R.

All functions are tidyverse/dplyr-friendly and accept the %>% pipe, outputting results as a tibble. You can save outputs for further manipulation, e.g. summary <- df %>% dist_sum(var).

Author(s)

Maintainer: Alexander Stockdale a.stockdale@liverpool.ac.uk

See Also

Useful links:


Pipe operator

Description

See magrittr::%>% for details.

Usage

lhs %>% rhs

Arguments

lhs

A value or the magrittr placeholder.

rhs

A function call using the magrittr semantics.

Value

The result of calling rhs(lhs).


Summarize and visualize a date variable

Description

Summarises the minimum, maximum, median, and interquartile range of a date variable, optionally stratified by a grouping variable. Produces a histogram and (optionally) a density plot.

Usage

dist_date(data, var, by = NULL)

Arguments

data

A data frame or tibble.

var

The date variable to summarise.

by

Optional grouping variable.

Value

A tibble with summary statistics for the date variable.

See Also

dist_sum for continuous variables.

Examples

# Example ungrouped
df <- tibble::tibble(
  dt = as.Date("2020-01-01") + sample(0:1000, 100, TRUE)
)
dist_date(df, dt)

# Example grouped
df2 <- tibble::tibble(
  dt = as.Date("2020-01-01") + sample(0:1000, 100, TRUE),
  grp = sample(1:2, 100, TRUE)
)
dist_date(df2, dt, grp)
# Note this function accepts a pipe from dplyr eg. df %>% dist_date(date_var, group_var)

Explore a continuous variable.

Description

Summarises the median, interquartile range, mean, standard deviation, confidence intervals of the mean and produces a density plot, stratified by a second grouping variable.

Provides frequentist hypothesis tests for comparison between groups: T test and Wilcoxon rank sum for 2 groups, Anova and Kruskall wallis test for 3 or more groups.

The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble.

Usage

dist_sum(data, var, by = NULL)

Arguments

data

The data frame or tibble

var

The variable you would like to summarise

by

The grouping variable

Value

A tibble with a summary of the variable frequency (n), number of missing observations (n_miss), median, interquartile range, mean, SD, 95% confidence intervals of the mean (using the Z distribution), and density plots.

Shows the T test (p_ttest) and Wilcoxon rank sum (p_wilcox) hypothesis tests when there are two groups And an Anova test (p_anova) and Kruskal-Wallis test (p_kruskal) when there are three or more groups.

Examples

example_data <- dplyr::tibble(id = 1:100, age = rnorm(100, mean = 30, sd = 10),
                              group = sample(c("a", "b", "c", "d"),
                              size = 100, replace = TRUE))
dist_sum(example_data, age, group)
example_data <- dplyr::tibble(id = 1:100, age = rnorm(100, mean = 30, sd = 10),
                             sex = sample(c("male", "female"),
                             size = 100, replace = TRUE))
dist_sum(example_data, age, sex)
summary <- dist_sum(example_data, age, sex) # Save summary statistics as a tibble.

Explore duplicate and missing data

Description

Provides an integer value for the number of duplicates found within a variable The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble.

eg. example_data %>% dup(variable)

Usage

dup(data, var = NULL)

Arguments

data

The data frame or tibble

var

The variable to assess

Value

A tibble with the number and percentage of duplicate values found, and the number of missing values (NA), together with percentages.

Examples

example_data <- dplyr::tibble(id = 1:200, age = round(rnorm(200, mean = 30, sd = 50), digits=0))
example_data$age[sample(1:200, size = 15)] <- NA  # Replace 15 values with missing.
dup(example_data, age)
# It is also possible to pass a whole database to dup and it will explore all variables.
example_data <- dplyr::tibble(age = round(rnorm(200, mean = 30, sd = 50), digits=0),
                              sex = sample(c("Male", "Female"), 200, TRUE),
                              favourite_colour = sample(c("Red", "Blue", "Purple"), 200, TRUE))
example_data$age[sample(1:200, size = 15)] <- NA  # Replace 15 values with missing.
example_data$sex[sample(1:200, size = 32)] <- NA  # Replace 32 values with missing.
dup(example_data)

Create a cross-tabulation of two categorial variables

Description

Creates a "n x n" cross-tabulation of two categorical variables, with row percentages. Includes options for adding frequentist hypothesis testing.

The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble.

eg. example_data %>% tab(variable1, variable2)

Usage

tab(data, variable1, variable2, test = "none")

Arguments

data

The data frame or tibble

variable1

The first categorical variable

variable2

The second categorical variable

test

Optional frequentist hypothesis test, use test=exact for Fisher's exact or test=chi for Chi squared

Value

A tibble with a cross-tabulation of frequencies and row percentages

Examples

example_data <- dplyr::tibble(id = 1:100, group1 = sample(c("a", "b", "c", "d"),
                                                  size = 100, replace = TRUE),
                                                  group2= sample(c("male", "female"),
                                                  size = 100, replace = TRUE))
example_data$group1[sample(1:100, size = 10)] <- NA  # Replace 10 with missing
tab(example_data, group1, group2)
summary <- tab(example_data, group1, group2) # Save summary statistics as a tibble.

Summarise a categorial variable

Description

Summarises frequencies and percentages for a categorical variable.

The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble. eg. example_data %>% tab1(variable)

Usage

tab1(data, variable, dp = 1)

Arguments

data

The data frame or tibble

variable

The categorical variable you would like to summarise

dp

The number of decimal places for percentages (default=2)

Value

A tibble with frequencies and percentages

Examples

example_data <- dplyr::tibble(id = 1:100, group = sample(c("a", "b", "c", "d"),
                                                  size = 100, replace = TRUE))
example_data$group[sample(1:100, size = 10)] <- NA  # Replace 10 with missing
tab1(example_data, group)
summary <- tab1(example_data, group) # Save summary statistics as a tibble.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.