Introduction to descriptr

2017-12-11

Introduction

Descriptive statistics are used to summarize data. It enables us to present the data in a more meaningful way and to discern any patterns existing in the data. They can be useful for two purposes:

This document introduces you to a basic set of functions that describe data. There is a second vignette which provides details about functions which help visualize statistical distributions.

Data Screening

The ds_screener() function will screen a data set and return the following: - Column/Variable Names - Data Type - Levels (in case of categorical data) - Number of missing observations - % of missing observations

mt <- mtcars
mt[, c(2, 8:11)] <- lapply(mt[, c(2, 8:11)], factor)
mt[sample(1:nrow(mt), 12), sample(1:ncol(mt), 6)] <- NA
ds_screener(mt)
## -----------------------------------------------------------------------
## |  Column Name  |  Data Type  |  Levels   |  Missing  |  Missing (%)  |
## -----------------------------------------------------------------------
## |      mpg      |   numeric   |    NA     |    12     |     37.5      |
## |      cyl      |   factor    |   4 6 8   |     0     |       0       |
## |     disp      |   numeric   |    NA     |     0     |       0       |
## |      hp       |   numeric   |    NA     |    12     |     37.5      |
## |     drat      |   numeric   |    NA     |     0     |       0       |
## |      wt       |   numeric   |    NA     |    12     |     37.5      |
## |     qsec      |   numeric   |    NA     |    12     |     37.5      |
## |      vs       |   factor    |    0 1    |    12     |     37.5      |
## |      am       |   factor    |    0 1    |     0     |       0       |
## |     gear      |   factor    |   3 4 5   |     0     |       0       |
## |     carb      |   factor    |1 2 3 4 6 8|    12     |     37.5      |
## -----------------------------------------------------------------------
## 
##  Overall Missing Values           72 
##  Percentage of Missing Values     20.45 %
##  Rows with Missing Values         12 
##  Columns With Missing Values      6

Summary Statistics

The ds_summary_stats function returns a comprehensive set of statistics for continuous data.

ds_summary_stats(mtcars$mpg)
##                         Univariate Analysis                          
## 
##  N                       32.00      Variance                36.32 
##  Missing                  0.00      Std Deviation            6.03 
##  Mean                    20.09      Range                   23.50 
##  Median                  19.20      Interquartile Range      7.38 
##  Mode                    10.40      Uncorrected SS       14042.31 
##  Trimmed Mean            19.95      Corrected SS          1126.05 
##  Skewness                 0.67      Coeff Variation         30.00 
##  Kurtosis                -0.02      Std Error Mean           1.07 
## 
##                               Quantiles                               
## 
##               Quantile                            Value                
## 
##              Max                                  33.90                
##              99%                                  33.44                
##              95%                                  30.09                
##              90%                                  31.30                
##              Q3                                   22.80                
##              Median                               19.20                
##              Q1                                   15.43                
##              10%                                  14.34                
##              5%                                   12.00                
##              1%                                   10.40                
##              Min                                  10.40                
## 
##                             Extreme Values                            
## 
##                 Low                                High                
## 
##   Obs                        Value       Obs                        Value 
##   15                         10.4        20                         33.9  
##   16                         10.4        18                         32.4  
##   24                         13.3        19                         30.4  
##    7                         14.3        28                         30.4  
##   17                         14.7        26                         27.3

Cross Tabulation

The ds_cross_table function creates two way tables of categorical variables. It is not necessary to coerce a variable to type factor.

ds_cross_table(mtcars$cyl, mtcars$gear)
##     Cell Contents
##  |---------------|
##  |     Frequency |
##  |       Percent |
##  |       Row Pct |
##  |       Col Pct |
##  |---------------|
## 
##  Total Observations:  32 
## 
## ----------------------------------------------------------------------------
## |              |                           gear                            |
## ----------------------------------------------------------------------------
## |          cyl |            3 |            4 |            5 |    Row Total |
## ----------------------------------------------------------------------------
## |            4 |            1 |            8 |            2 |           11 |
## |              |        0.031 |         0.25 |        0.062 |              |
## |              |         0.09 |         0.73 |         0.18 |         0.34 |
## |              |         0.07 |         0.67 |          0.4 |              |
## ----------------------------------------------------------------------------
## |            6 |            2 |            4 |            1 |            7 |
## |              |        0.062 |        0.125 |        0.031 |              |
## |              |         0.29 |         0.57 |         0.14 |         0.22 |
## |              |         0.13 |         0.33 |          0.2 |              |
## ----------------------------------------------------------------------------
## |            8 |           12 |            0 |            2 |           14 |
## |              |        0.375 |            0 |        0.062 |              |
## |              |         0.86 |            0 |         0.14 |         0.44 |
## |              |          0.8 |            0 |          0.4 |              |
## ----------------------------------------------------------------------------
## | Column Total |           15 |           12 |            5 |           32 |
## |              |        0.468 |        0.375 |        0.155 |              |
## ----------------------------------------------------------------------------

A plot method has been defined which will generate:

Grouped Bar Plots

k <- ds_cross_table(mtcars$cyl, mtcars$gear)
plot(k, beside = TRUE)

Stacked Bar Plots

k <- ds_cross_table(mtcars$cyl, mtcars$gear)
plot(k)

Proportional Bar Plots

k <- ds_cross_table(mtcars$cyl, mtcars$gear)
plot(k, proportional = TRUE)

Mosaic Plots

Mosaic plots can be created using the mosaciplot method.

k <- ds_cross_table(mtcars$cyl, mtcars$gear)
mosaicplot(k)

Frequency Table (Categorical Data)

The ds_freq_table() function creates frequency tables for categorical variables.

mt <- mtcars
mt$cyl <- as.factor(mt$cyl)
ds_freq_table(mt$cyl)
##                                Variable: cyl                                 
## |--------------------------------------------------------------------------|
## |                                Cumulative                    Cumulative  |
## |    Levels    |  Frequency   |   Frequency  |   Percent    |    Percent   |
## |--------------------------------------------------------------------------|
## |       4      |      11      |      11      |     34.38    |     34.38    |
## |--------------------------------------------------------------------------|
## |       6      |       7      |      18      |     21.88    |     56.25    |
## |--------------------------------------------------------------------------|
## |       8      |      14      |      32      |     43.75    |      100     |
## |--------------------------------------------------------------------------|

Bar Plot

A barplot method has been defined.

mt <- mtcars
mt$cyl <- as.factor(mt$cyl)
k <- ds_freq_table(mt$cyl)
barplot(k)

Frequency Table (Continuous Data)

The ds_freq_cont function creates frequency tables for continuous variables. The default number of intervals is 5.

ds_freq_cont(mtcars$mpg, 4)
##                                 Variable: mpg                                 
## |---------------------------------------------------------------------------|
## |      Bins       | Frequency | Cum Frequency |   Percent    | Cum Percent  |
## |---------------------------------------------------------------------------|
## |  10.4  -  16.3  |    10     |      10       |    31.25     |    31.25     |
## |---------------------------------------------------------------------------|
## |  16.3  -  22.1  |    13     |      23       |    40.62     |    71.88     |
## |---------------------------------------------------------------------------|
## |  22.1  -   28   |     5     |      28       |    15.62     |     87.5     |
## |---------------------------------------------------------------------------|
## |   28   -  33.9  |     4     |      32       |     12.5     |     100      |
## |---------------------------------------------------------------------------|

Histogram

A hist method has been defined.

k <- ds_freq_cont(mtcars$mpg, 4)
hist(k)

Group Summary

The ds_group_summary() function returns descriptive statistics of a continuous variable for the different levels of a categorical variable.

mt <- mtcars
mt$cyl <- as.factor(mt$cyl)
ds_group_summary(mt$cyl, mt$mpg)
##                                        mpg by cyl                                         
## -----------------------------------------------------------------------------------------
## |     Statistic/Levels|                    4|                    6|                    8|
## -----------------------------------------------------------------------------------------
## |                  Obs|                   11|                    7|                   14|
## |              Minimum|                 21.4|                 17.8|                 10.4|
## |              Maximum|                 33.9|                 21.4|                 19.2|
## |                 Mean|                26.66|                19.74|                 15.1|
## |               Median|                   26|                 19.7|                 15.2|
## |                 Mode|                 22.8|                   21|                 10.4|
## |       Std. Deviation|                 4.51|                 1.45|                 2.56|
## |             Variance|                20.34|                 2.11|                 6.55|
## |             Skewness|                 0.35|                -0.26|                -0.46|
## |             Kurtosis|                -1.43|                -1.83|                 0.33|
## |       Uncorrected SS|              8023.83|              2741.14|              3277.34|
## |         Corrected SS|               203.39|                12.68|                 85.2|
## |      Coeff Variation|                16.91|                 7.36|                16.95|
## |      Std. Error Mean|                 1.36|                 0.55|                 0.68|
## |                Range|                 12.5|                  3.6|                  8.8|
## |  Interquartile Range|                  7.6|                 2.35|                 1.85|
## -----------------------------------------------------------------------------------------

Box Plot

A boxplot() method has been defined.

mt <- mtcars
mt$cyl <- as.factor(mt$cyl)
k <- ds_group_summary(mt$cyl, mt$mpg)
boxplot(k)

Multiple Variable Statistics

The ds_multi_stats() function generates summary/descriptive statistics for multiple variables in a data frame/tibble.

ds_multi_stats(mtcars, mpg, disp, hp)
## # A tibble: 3 x 16
##    vars   min   max      mean    t_mean median  mode range   variance
##   <chr> <dbl> <dbl>     <dbl>     <dbl>  <dbl> <dbl> <dbl>      <dbl>
## 1  disp  71.1 472.0 230.72188 228.00000  196.3 275.8 400.9 15360.7998
## 2    hp  52.0 335.0 146.68750 143.56667  123.0 110.0 283.0  4700.8669
## 3   mpg  10.4  33.9  20.09062  19.95333   19.2  10.4  23.5    36.3241
## # ... with 7 more variables: stdev <dbl>, skew <dbl>, kurtosis <dbl>,
## #   coeff_var <dbl>, q1 <dbl>, q3 <dbl>, iqrange <dbl>

Multiple One Way Tables

The ds_oway_tables() function creates multiple one way tables by creating a frequency table for each categorical variable in a data frame.

mt <- mtcars
mt[, c(2, 8:11)] <- lapply(mt[, c(2, 8:11)], factor)
ds_oway_tables(mt)
##                                Variable: cyl                                 
## |--------------------------------------------------------------------------|
## |                                Cumulative                    Cumulative  |
## |    Levels    |  Frequency   |   Frequency  |   Percent    |    Percent   |
## |--------------------------------------------------------------------------|
## |       4      |      11      |      11      |     34.38    |     34.38    |
## |--------------------------------------------------------------------------|
## |       6      |       7      |      18      |     21.88    |     56.25    |
## |--------------------------------------------------------------------------|
## |       8      |      14      |      32      |     43.75    |      100     |
## |--------------------------------------------------------------------------|
## 
## 
##                                 Variable: vs                                 
## |--------------------------------------------------------------------------|
## |                                Cumulative                    Cumulative  |
## |    Levels    |  Frequency   |   Frequency  |   Percent    |    Percent   |
## |--------------------------------------------------------------------------|
## |       0      |      18      |      18      |     56.25    |     56.25    |
## |--------------------------------------------------------------------------|
## |       1      |      14      |      32      |     43.75    |      100     |
## |--------------------------------------------------------------------------|
## 
## 
##                                 Variable: am                                 
## |--------------------------------------------------------------------------|
## |                                Cumulative                    Cumulative  |
## |    Levels    |  Frequency   |   Frequency  |   Percent    |    Percent   |
## |--------------------------------------------------------------------------|
## |       0      |      19      |      19      |     59.38    |     59.38    |
## |--------------------------------------------------------------------------|
## |       1      |      13      |      32      |     40.62    |      100     |
## |--------------------------------------------------------------------------|
## 
## 
##                                Variable: gear                                
## |--------------------------------------------------------------------------|
## |                                Cumulative                    Cumulative  |
## |    Levels    |  Frequency   |   Frequency  |   Percent    |    Percent   |
## |--------------------------------------------------------------------------|
## |       3      |      15      |      15      |     46.88    |     46.88    |
## |--------------------------------------------------------------------------|
## |       4      |      12      |      27      |     37.5     |     84.38    |
## |--------------------------------------------------------------------------|
## |       5      |       5      |      32      |     15.62    |      100     |
## |--------------------------------------------------------------------------|
## 
## 
##                                Variable: carb                                
## |--------------------------------------------------------------------------|
## |                                Cumulative                    Cumulative  |
## |    Levels    |  Frequency   |   Frequency  |   Percent    |    Percent   |
## |--------------------------------------------------------------------------|
## |       1      |       7      |       7      |     21.88    |     21.88    |
## |--------------------------------------------------------------------------|
## |       2      |      10      |      17      |     31.25    |     53.12    |
## |--------------------------------------------------------------------------|
## |       3      |       3      |      20      |     9.38     |     62.5     |
## |--------------------------------------------------------------------------|
## |       4      |      10      |      30      |     31.25    |     93.75    |
## |--------------------------------------------------------------------------|
## |       6      |       1      |      31      |     3.12     |     96.88    |
## |--------------------------------------------------------------------------|
## |       8      |       1      |      32      |     3.12     |      100     |
## |--------------------------------------------------------------------------|

Multiple Two Way Tables

The ds_tway_tables() function creates multiple two way tables by creating a cross table for each unique pair of categorical variables in a data frame.

mt <- mtcars
mt[, c(2, 8:10)] <- lapply(mt[, c(2, 8:10)], factor)
ds_tway_tables(mt)
##     Cell Contents
##  |---------------|
##  |     Frequency |
##  |       Percent |
##  |       Row Pct |
##  |       Col Pct |
##  |---------------|
## 
##  Total Observations:  32 
## 
##                          cyl vs vs                           
## -------------------------------------------------------------
## |              |                     vs                     |
## -------------------------------------------------------------
## |          cyl |            0 |            1 |    Row Total |
## -------------------------------------------------------------
## |            4 |            1 |           10 |           11 |
## |              |        0.031 |        0.312 |              |
## |              |         0.09 |         0.91 |         0.34 |
## |              |         0.06 |         0.71 |              |
## -------------------------------------------------------------
## |            6 |            3 |            4 |            7 |
## |              |        0.094 |        0.125 |              |
## |              |         0.43 |         0.57 |         0.22 |
## |              |         0.17 |         0.29 |              |
## -------------------------------------------------------------
## |            8 |           14 |            0 |           14 |
## |              |        0.438 |            0 |              |
## |              |            1 |            0 |         0.44 |
## |              |         0.78 |            0 |              |
## -------------------------------------------------------------
## | Column Total |           18 |           14 |           32 |
## |              |        0.563 |        0.437 |              |
## -------------------------------------------------------------
## 
## 
##                          cyl vs am                           
## -------------------------------------------------------------
## |              |                     am                     |
## -------------------------------------------------------------
## |          cyl |            0 |            1 |    Row Total |
## -------------------------------------------------------------
## |            4 |            3 |            8 |           11 |
## |              |        0.094 |         0.25 |              |
## |              |         0.27 |         0.73 |         0.34 |
## |              |         0.16 |         0.62 |              |
## -------------------------------------------------------------
## |            6 |            4 |            3 |            7 |
## |              |        0.125 |        0.094 |              |
## |              |         0.57 |         0.43 |         0.22 |
## |              |         0.21 |         0.23 |              |
## -------------------------------------------------------------
## |            8 |           12 |            2 |           14 |
## |              |        0.375 |        0.062 |              |
## |              |         0.86 |         0.14 |         0.44 |
## |              |         0.63 |         0.15 |              |
## -------------------------------------------------------------
## | Column Total |           19 |           13 |           32 |
## |              |        0.594 |        0.406 |              |
## -------------------------------------------------------------
## 
## 
##                                 cyl vs gear                                 
## ----------------------------------------------------------------------------
## |              |                           gear                            |
## ----------------------------------------------------------------------------
## |          cyl |            3 |            4 |            5 |    Row Total |
## ----------------------------------------------------------------------------
## |            4 |            1 |            8 |            2 |           11 |
## |              |        0.031 |         0.25 |        0.062 |              |
## |              |         0.09 |         0.73 |         0.18 |         0.34 |
## |              |         0.07 |         0.67 |          0.4 |              |
## ----------------------------------------------------------------------------
## |            6 |            2 |            4 |            1 |            7 |
## |              |        0.062 |        0.125 |        0.031 |              |
## |              |         0.29 |         0.57 |         0.14 |         0.22 |
## |              |         0.13 |         0.33 |          0.2 |              |
## ----------------------------------------------------------------------------
## |            8 |           12 |            0 |            2 |           14 |
## |              |        0.375 |            0 |        0.062 |              |
## |              |         0.86 |            0 |         0.14 |         0.44 |
## |              |          0.8 |            0 |          0.4 |              |
## ----------------------------------------------------------------------------
## | Column Total |           15 |           12 |            5 |           32 |
## |              |        0.468 |        0.375 |        0.155 |              |
## ----------------------------------------------------------------------------
## 
## 
##                           vs vs am                           
## -------------------------------------------------------------
## |              |                     am                     |
## -------------------------------------------------------------
## |           vs |            0 |            1 |    Row Total |
## -------------------------------------------------------------
## |            0 |           12 |            6 |           18 |
## |              |        0.375 |        0.188 |              |
## |              |         0.67 |         0.33 |         0.56 |
## |              |         0.63 |         0.46 |              |
## -------------------------------------------------------------
## |            1 |            7 |            7 |           14 |
## |              |        0.219 |        0.219 |              |
## |              |          0.5 |          0.5 |         0.44 |
## |              |         0.37 |         0.54 |              |
## -------------------------------------------------------------
## | Column Total |           19 |           13 |           32 |
## |              |        0.594 |        0.407 |              |
## -------------------------------------------------------------
## 
## 
##                                 vs vs gear                                  
## ----------------------------------------------------------------------------
## |              |                           gear                            |
## ----------------------------------------------------------------------------
## |           vs |            3 |            4 |            5 |    Row Total |
## ----------------------------------------------------------------------------
## |            0 |           12 |            2 |            4 |           18 |
## |              |        0.375 |        0.062 |        0.125 |              |
## |              |         0.67 |         0.11 |         0.22 |         0.56 |
## |              |          0.8 |         0.17 |          0.8 |              |
## ----------------------------------------------------------------------------
## |            1 |            3 |           10 |            1 |           14 |
## |              |        0.094 |        0.312 |        0.031 |              |
## |              |         0.21 |         0.71 |         0.07 |         0.44 |
## |              |          0.2 |         0.83 |          0.2 |              |
## ----------------------------------------------------------------------------
## | Column Total |           15 |           12 |            5 |           32 |
## |              |        0.469 |        0.374 |        0.156 |              |
## ----------------------------------------------------------------------------
## 
## 
##                                 am vs gear                                  
## ----------------------------------------------------------------------------
## |              |                           gear                            |
## ----------------------------------------------------------------------------
## |           am |            3 |            4 |            5 |    Row Total |
## ----------------------------------------------------------------------------
## |            0 |           15 |            4 |            0 |           19 |
## |              |        0.469 |        0.125 |            0 |              |
## |              |         0.79 |         0.21 |            0 |         0.59 |
## |              |            1 |         0.33 |            0 |              |
## ----------------------------------------------------------------------------
## |            1 |            0 |            8 |            5 |           13 |
## |              |            0 |         0.25 |        0.156 |              |
## |              |            0 |         0.62 |         0.38 |         0.41 |
## |              |            0 |         0.67 |            1 |              |
## ----------------------------------------------------------------------------
## | Column Total |           15 |           12 |            5 |           32 |
## |              |        0.469 |        0.375 |        0.156 |              |
## ----------------------------------------------------------------------------