Descriptive statistics are used to summarize data. It enables us to present the data in a more meaningful way and to discern any patterns existing in the data. They can be useful for two purposes:
This document introduces you to a basic set of functions that describe data. There is a second vignette which provides details about functions which help visualize statistical distributions.
The screener
function will screen a data set and return the following: - Column/Variable Names - Data Type - Levels (in case of categorical data) - Number of missing observations - % of missing observations
mt <- mtcars
mt[, c(2, 8:11)] <- lapply(mt[, c(2, 8:11)], factor)
mt[sample(1:nrow(mt), 12), sample(1:ncol(mt), 6)] <- NA
screener(mt)
## -----------------------------------------------------------------------
## | Column Name | Data Type | Levels | Missing | Missing (%) |
## -----------------------------------------------------------------------
## | mpg | numeric | NA | 12 | 37.5 |
## | cyl | factor | 4 6 8 | 12 | 37.5 |
## | disp | numeric | NA | 0 | 0 |
## | hp | numeric | NA | 12 | 37.5 |
## | drat | numeric | NA | 12 | 37.5 |
## | wt | numeric | NA | 0 | 0 |
## | qsec | numeric | NA | 0 | 0 |
## | vs | factor | 0 1 | 0 | 0 |
## | am | factor | 0 1 | 12 | 37.5 |
## | gear | factor | 3 4 5 | 12 | 37.5 |
## | carb | factor |1 2 3 4 6 8| 0 | 0 |
## -----------------------------------------------------------------------
##
## Overall Missing Values 72
## Percentage of Missing Values 20.45 %
## Rows with Missing Values 12
## Columns With Missing Values 6
The summary_stats
function returns a comprehensive set of statistics for continuous data.
summary_stats(mtcars$mpg)
## Univariate Analysis
##
## N 32.00 Variance 36.32
## Missing 0.00 Std Deviation 6.03
## Mean 20.09 Range 23.50
## Median 19.20 Interquartile Range 7.38
## Mode 10.40 Uncorrected SS 14042.31
## Trimmed Mean 19.95 Corrected SS 1126.05
## Skewness 0.67 Coeff Variation 30.00
## Kurtosis -0.02 Std Error Mean 1.07
##
## Quantiles
##
## Quantile Value
##
## Max 33.90
## 99% 33.44
## 95% 30.09
## 90% 31.30
## Q3 22.80
## Median 19.20
## Q1 15.43
## 10% 14.34
## 5% 12.00
## 1% 10.40
## Min 10.40
##
## Extreme Values
##
## Low High
##
## Obs Value Obs Value
## 15 10.4 20 33.9
## 16 10.4 18 32.4
## 24 13.3 19 30.4
## 7 14.3 28 30.4
## 17 14.7 26 27.3
The cross_table
function creates two way tables of categorical variables. It is not necessary to coerce a variable to type factor
.
cross_table(mtcars$cyl, mtcars$gear)
## Cell Contents
## |---------------|
## | Frequency |
## | Percent |
## | Row Pct |
## | Col Pct |
## |---------------|
##
## Total Observations: 32
##
## ----------------------------------------------------------------------------
## | | gear |
## ----------------------------------------------------------------------------
## | cyl | 3 | 4 | 5 | Row Total |
## ----------------------------------------------------------------------------
## | 4 | 1 | 8 | 2 | 11 |
## | | 0.031 | 0.25 | 0.062 | |
## | | 0.09 | 0.73 | 0.18 | 0.34 |
## | | 0.07 | 0.67 | 0.4 | |
## ----------------------------------------------------------------------------
## | 6 | 2 | 4 | 1 | 7 |
## | | 0.062 | 0.125 | 0.031 | |
## | | 0.29 | 0.57 | 0.14 | 0.22 |
## | | 0.13 | 0.33 | 0.2 | |
## ----------------------------------------------------------------------------
## | 8 | 12 | 0 | 2 | 14 |
## | | 0.375 | 0 | 0.062 | |
## | | 0.86 | 0 | 0.14 | 0.44 |
## | | 0.8 | 0 | 0.4 | |
## ----------------------------------------------------------------------------
## | Column Total | 15 | 12 | 5 | 32 |
## | | 0.468 | 0.375 | 0.155 | |
## ----------------------------------------------------------------------------
A plot method has been defined which will generate:
k <- cross_table(mtcars$cyl, mtcars$gear)
plot(k, beside = TRUE)
k <- cross_table(mtcars$cyl, mtcars$gear)
plot(k)
k <- cross_table(mtcars$cyl, mtcars$gear)
plot(k, proportional = TRUE)
Mosaic plots can be created using the mosaciplot
method.
k <- cross_table(mtcars$cyl, mtcars$gear)
mosaicplot(k)
The freq_table
function creates frequency tables for categorical variables.
mt <- mtcars
mt$cyl <- as.factor(mt$cyl)
freq_table(mt$cyl)
## Variable: cyl
## |--------------------------------------------------------------------------|
## | Cumulative Cumulative |
## | Levels | Frequency | Frequency | Percent | Percent |
## |--------------------------------------------------------------------------|
## | 4 | 11 | 11 | 34.38 | 34.38 |
## |--------------------------------------------------------------------------|
## | 6 | 7 | 18 | 21.88 | 56.25 |
## |--------------------------------------------------------------------------|
## | 8 | 14 | 32 | 43.75 | 100 |
## |--------------------------------------------------------------------------|
A barplot method has been defined.
mt <- mtcars
mt$cyl <- as.factor(mt$cyl)
k <- freq_table(mt$cyl)
barplot(k)
The freq_cont
function creates frequency tables for continuous variables. The default number of intervals is 5.
freq_cont(mtcars$mpg, 4)
## Variable: mpg
## |---------------------------------------------------------------------------|
## | Cumulative Cumulative |
## | Bins | Frequency | Frequency | Percent | Percent |
## |---------------------------------------------------------------------------|
## | 10.4 - 16.3 | 10 | 10 | 31.25 | 31.25 |
## |---------------------------------------------------------------------------|
## | 16.3 - 22.1 | 13 | 23 | 40.62 | 71.88 |
## |---------------------------------------------------------------------------|
## | 22.1 - 28 | 5 | 28 | 15.62 | 87.5 |
## |---------------------------------------------------------------------------|
## | 28 - 33.9 | 4 | 32 | 12.5 | 100 |
## |---------------------------------------------------------------------------|
A hist
method has been defined.
k <- freq_cont(mtcars$mpg, 4)
hist(k)
The group_summary
function returns descriptive statistics of a continuous variable for the different levels of a categorical variable.
mt <- mtcars
mt$cyl <- as.factor(mt$cyl)
group_summary(mt$cyl, mt$mpg)
## mpg by cyl
## -----------------------------------------------------------------------------------------
## | Statistic/Levels| 4| 6| 8|
## -----------------------------------------------------------------------------------------
## | Obs| 11| 7| 14|
## | Minimum| 21.4| 17.8| 10.4|
## | Maximum| 33.9| 21.4| 19.2|
## | Mean| 26.66| 19.74| 15.1|
## | Median| 26| 19.7| 15.2|
## | Mode| 22.8| 21| 10.4|
## | Std. Deviation| 4.51| 1.45| 2.56|
## | Variance| 20.34| 2.11| 6.55|
## | Skewness| 0.35| -0.26| -0.46|
## | Kurtosis| -1.43| -1.83| 0.33|
## | Uncorrected SS| 8023.83| 2741.14| 3277.34|
## | Corrected SS| 203.39| 12.68| 85.2|
## | Coeff Variation| 16.91| 7.36| 16.95|
## | Std. Error Mean| 1.36| 0.55| 0.68|
## | Range| 12.5| 3.6| 8.8|
## | Interquartile Range| 7.6| 2.35| 1.85|
## -----------------------------------------------------------------------------------------
A boxplot
method has been defined.
mt <- mtcars
mt$cyl <- as.factor(mt$cyl)
k <- group_summary(mt$cyl, mt$mpg)
boxplot(k)
The oway_tables
function creates multiple one way tables by creating a frequency table for each categorical variable in a data frame.
mt <- mtcars
mt[, c(2, 8:11)] <- lapply(mt[, c(2, 8:11)], factor)
oway_tables(mt)
## Variable: cyl
## |--------------------------------------------------------------------------|
## | Cumulative Cumulative |
## | Levels | Frequency | Frequency | Percent | Percent |
## |--------------------------------------------------------------------------|
## | 4 | 11 | 11 | 34.38 | 34.38 |
## |--------------------------------------------------------------------------|
## | 6 | 7 | 18 | 21.88 | 56.25 |
## |--------------------------------------------------------------------------|
## | 8 | 14 | 32 | 43.75 | 100 |
## |--------------------------------------------------------------------------|
##
##
## Variable: vs
## |--------------------------------------------------------------------------|
## | Cumulative Cumulative |
## | Levels | Frequency | Frequency | Percent | Percent |
## |--------------------------------------------------------------------------|
## | 0 | 18 | 18 | 56.25 | 56.25 |
## |--------------------------------------------------------------------------|
## | 1 | 14 | 32 | 43.75 | 100 |
## |--------------------------------------------------------------------------|
##
##
## Variable: am
## |--------------------------------------------------------------------------|
## | Cumulative Cumulative |
## | Levels | Frequency | Frequency | Percent | Percent |
## |--------------------------------------------------------------------------|
## | 0 | 19 | 19 | 59.38 | 59.38 |
## |--------------------------------------------------------------------------|
## | 1 | 13 | 32 | 40.62 | 100 |
## |--------------------------------------------------------------------------|
##
##
## Variable: gear
## |--------------------------------------------------------------------------|
## | Cumulative Cumulative |
## | Levels | Frequency | Frequency | Percent | Percent |
## |--------------------------------------------------------------------------|
## | 3 | 15 | 15 | 46.88 | 46.88 |
## |--------------------------------------------------------------------------|
## | 4 | 12 | 27 | 37.5 | 84.38 |
## |--------------------------------------------------------------------------|
## | 5 | 5 | 32 | 15.62 | 100 |
## |--------------------------------------------------------------------------|
##
##
## Variable: carb
## |--------------------------------------------------------------------------|
## | Cumulative Cumulative |
## | Levels | Frequency | Frequency | Percent | Percent |
## |--------------------------------------------------------------------------|
## | 1 | 7 | 7 | 21.88 | 21.88 |
## |--------------------------------------------------------------------------|
## | 2 | 10 | 17 | 31.25 | 53.12 |
## |--------------------------------------------------------------------------|
## | 3 | 3 | 20 | 9.38 | 62.5 |
## |--------------------------------------------------------------------------|
## | 4 | 10 | 30 | 31.25 | 93.75 |
## |--------------------------------------------------------------------------|
## | 6 | 1 | 31 | 3.12 | 96.88 |
## |--------------------------------------------------------------------------|
## | 8 | 1 | 32 | 3.12 | 100 |
## |--------------------------------------------------------------------------|
The tway_tables
function creates multiple two way tables by creating a cross table for each unique pair of categorical variables in a data frame.
mt <- mtcars
mt[, c(2, 8:10)] <- lapply(mt[, c(2, 8:10)], factor)
tway_tables(mt)
## Cell Contents
## |---------------|
## | Frequency |
## | Percent |
## | Row Pct |
## | Col Pct |
## |---------------|
##
## Total Observations: 32
##
## cyl vs vs
## -------------------------------------------------------------
## | | vs |
## -------------------------------------------------------------
## | cyl | 0 | 1 | Row Total |
## -------------------------------------------------------------
## | 4 | 1 | 10 | 11 |
## | | 0.031 | 0.312 | |
## | | 0.09 | 0.91 | 0.34 |
## | | 0.06 | 0.71 | |
## -------------------------------------------------------------
## | 6 | 3 | 4 | 7 |
## | | 0.094 | 0.125 | |
## | | 0.43 | 0.57 | 0.22 |
## | | 0.17 | 0.29 | |
## -------------------------------------------------------------
## | 8 | 14 | 0 | 14 |
## | | 0.438 | 0 | |
## | | 1 | 0 | 0.44 |
## | | 0.78 | 0 | |
## -------------------------------------------------------------
## | Column Total | 18 | 14 | 32 |
## | | 0.563 | 0.437 | |
## -------------------------------------------------------------
##
##
## cyl vs am
## -------------------------------------------------------------
## | | am |
## -------------------------------------------------------------
## | cyl | 0 | 1 | Row Total |
## -------------------------------------------------------------
## | 4 | 3 | 8 | 11 |
## | | 0.094 | 0.25 | |
## | | 0.27 | 0.73 | 0.34 |
## | | 0.16 | 0.62 | |
## -------------------------------------------------------------
## | 6 | 4 | 3 | 7 |
## | | 0.125 | 0.094 | |
## | | 0.57 | 0.43 | 0.22 |
## | | 0.21 | 0.23 | |
## -------------------------------------------------------------
## | 8 | 12 | 2 | 14 |
## | | 0.375 | 0.062 | |
## | | 0.86 | 0.14 | 0.44 |
## | | 0.63 | 0.15 | |
## -------------------------------------------------------------
## | Column Total | 19 | 13 | 32 |
## | | 0.594 | 0.406 | |
## -------------------------------------------------------------
##
##
## cyl vs gear
## ----------------------------------------------------------------------------
## | | gear |
## ----------------------------------------------------------------------------
## | cyl | 3 | 4 | 5 | Row Total |
## ----------------------------------------------------------------------------
## | 4 | 1 | 8 | 2 | 11 |
## | | 0.031 | 0.25 | 0.062 | |
## | | 0.09 | 0.73 | 0.18 | 0.34 |
## | | 0.07 | 0.67 | 0.4 | |
## ----------------------------------------------------------------------------
## | 6 | 2 | 4 | 1 | 7 |
## | | 0.062 | 0.125 | 0.031 | |
## | | 0.29 | 0.57 | 0.14 | 0.22 |
## | | 0.13 | 0.33 | 0.2 | |
## ----------------------------------------------------------------------------
## | 8 | 12 | 0 | 2 | 14 |
## | | 0.375 | 0 | 0.062 | |
## | | 0.86 | 0 | 0.14 | 0.44 |
## | | 0.8 | 0 | 0.4 | |
## ----------------------------------------------------------------------------
## | Column Total | 15 | 12 | 5 | 32 |
## | | 0.468 | 0.375 | 0.155 | |
## ----------------------------------------------------------------------------
##
##
## vs vs am
## -------------------------------------------------------------
## | | am |
## -------------------------------------------------------------
## | vs | 0 | 1 | Row Total |
## -------------------------------------------------------------
## | 0 | 12 | 6 | 18 |
## | | 0.375 | 0.188 | |
## | | 0.67 | 0.33 | 0.56 |
## | | 0.63 | 0.46 | |
## -------------------------------------------------------------
## | 1 | 7 | 7 | 14 |
## | | 0.219 | 0.219 | |
## | | 0.5 | 0.5 | 0.44 |
## | | 0.37 | 0.54 | |
## -------------------------------------------------------------
## | Column Total | 19 | 13 | 32 |
## | | 0.594 | 0.407 | |
## -------------------------------------------------------------
##
##
## vs vs gear
## ----------------------------------------------------------------------------
## | | gear |
## ----------------------------------------------------------------------------
## | vs | 3 | 4 | 5 | Row Total |
## ----------------------------------------------------------------------------
## | 0 | 12 | 2 | 4 | 18 |
## | | 0.375 | 0.062 | 0.125 | |
## | | 0.67 | 0.11 | 0.22 | 0.56 |
## | | 0.8 | 0.17 | 0.8 | |
## ----------------------------------------------------------------------------
## | 1 | 3 | 10 | 1 | 14 |
## | | 0.094 | 0.312 | 0.031 | |
## | | 0.21 | 0.71 | 0.07 | 0.44 |
## | | 0.2 | 0.83 | 0.2 | |
## ----------------------------------------------------------------------------
## | Column Total | 15 | 12 | 5 | 32 |
## | | 0.469 | 0.374 | 0.156 | |
## ----------------------------------------------------------------------------
##
##
## am vs gear
## ----------------------------------------------------------------------------
## | | gear |
## ----------------------------------------------------------------------------
## | am | 3 | 4 | 5 | Row Total |
## ----------------------------------------------------------------------------
## | 0 | 15 | 4 | 0 | 19 |
## | | 0.469 | 0.125 | 0 | |
## | | 0.79 | 0.21 | 0 | 0.59 |
## | | 1 | 0.33 | 0 | |
## ----------------------------------------------------------------------------
## | 1 | 0 | 8 | 5 | 13 |
## | | 0 | 0.25 | 0.156 | |
## | | 0 | 0.62 | 0.38 | 0.41 |
## | | 0 | 0.67 | 1 | |
## ----------------------------------------------------------------------------
## | Column Total | 15 | 12 | 5 | 32 |
## | | 0.469 | 0.375 | 0.156 | |
## ----------------------------------------------------------------------------