Introduction to descriptr

2017-08-31

Introduction

Descriptive statistics are used to summarize data. It enables us to present the data in a more meaningful way and to discern any patterns existing in the data. They can be useful for two purposes:

This document introduces you to a basic set of functions that describe data. There is a second vignette which provides details about functions which help visualize statistical distributions.

Data Screening

The screener function will screen a data set and return the following: - Column/Variable Names - Data Type - Levels (in case of categorical data) - Number of missing observations - % of missing observations

mt <- mtcars
mt[, c(2, 8:11)] <- lapply(mt[, c(2, 8:11)], factor)
mt[sample(1:nrow(mt), 12), sample(1:ncol(mt), 6)] <- NA
screener(mt)
## -----------------------------------------------------------------------
## |  Column Name  |  Data Type  |  Levels   |  Missing  |  Missing (%)  |
## -----------------------------------------------------------------------
## |      mpg      |   numeric   |    NA     |    12     |     37.5      |
## |      cyl      |   factor    |   4 6 8   |    12     |     37.5      |
## |     disp      |   numeric   |    NA     |     0     |       0       |
## |      hp       |   numeric   |    NA     |    12     |     37.5      |
## |     drat      |   numeric   |    NA     |    12     |     37.5      |
## |      wt       |   numeric   |    NA     |     0     |       0       |
## |     qsec      |   numeric   |    NA     |     0     |       0       |
## |      vs       |   factor    |    0 1    |     0     |       0       |
## |      am       |   factor    |    0 1    |    12     |     37.5      |
## |     gear      |   factor    |   3 4 5   |    12     |     37.5      |
## |     carb      |   factor    |1 2 3 4 6 8|     0     |       0       |
## -----------------------------------------------------------------------
## 
##  Overall Missing Values           72 
##  Percentage of Missing Values     20.45 %
##  Rows with Missing Values         12 
##  Columns With Missing Values      6

Summary Statistics

The summary_stats function returns a comprehensive set of statistics for continuous data.

summary_stats(mtcars$mpg)
##                         Univariate Analysis                          
## 
##  N                       32.00      Variance                36.32 
##  Missing                  0.00      Std Deviation            6.03 
##  Mean                    20.09      Range                   23.50 
##  Median                  19.20      Interquartile Range      7.38 
##  Mode                    10.40      Uncorrected SS       14042.31 
##  Trimmed Mean            19.95      Corrected SS          1126.05 
##  Skewness                 0.67      Coeff Variation         30.00 
##  Kurtosis                -0.02      Std Error Mean           1.07 
## 
##                               Quantiles                               
## 
##               Quantile                            Value                
## 
##              Max                                  33.90                
##              99%                                  33.44                
##              95%                                  30.09                
##              90%                                  31.30                
##              Q3                                   22.80                
##              Median                               19.20                
##              Q1                                   15.43                
##              10%                                  14.34                
##              5%                                   12.00                
##              1%                                   10.40                
##              Min                                  10.40                
## 
##                             Extreme Values                            
## 
##                 Low                                High                
## 
##   Obs                        Value       Obs                        Value 
##   15                         10.4        20                         33.9  
##   16                         10.4        18                         32.4  
##   24                         13.3        19                         30.4  
##    7                         14.3        28                         30.4  
##   17                         14.7        26                         27.3

Cross Tabulation

The cross_table function creates two way tables of categorical variables. It is not necessary to coerce a variable to type factor.

cross_table(mtcars$cyl, mtcars$gear)
##     Cell Contents
##  |---------------|
##  |     Frequency |
##  |       Percent |
##  |       Row Pct |
##  |       Col Pct |
##  |---------------|
## 
##  Total Observations:  32 
## 
## ----------------------------------------------------------------------------
## |              |                           gear                            |
## ----------------------------------------------------------------------------
## |          cyl |            3 |            4 |            5 |    Row Total |
## ----------------------------------------------------------------------------
## |            4 |            1 |            8 |            2 |           11 |
## |              |        0.031 |         0.25 |        0.062 |              |
## |              |         0.09 |         0.73 |         0.18 |         0.34 |
## |              |         0.07 |         0.67 |          0.4 |              |
## ----------------------------------------------------------------------------
## |            6 |            2 |            4 |            1 |            7 |
## |              |        0.062 |        0.125 |        0.031 |              |
## |              |         0.29 |         0.57 |         0.14 |         0.22 |
## |              |         0.13 |         0.33 |          0.2 |              |
## ----------------------------------------------------------------------------
## |            8 |           12 |            0 |            2 |           14 |
## |              |        0.375 |            0 |        0.062 |              |
## |              |         0.86 |            0 |         0.14 |         0.44 |
## |              |          0.8 |            0 |          0.4 |              |
## ----------------------------------------------------------------------------
## | Column Total |           15 |           12 |            5 |           32 |
## |              |        0.468 |        0.375 |        0.155 |              |
## ----------------------------------------------------------------------------

A plot method has been defined which will generate:

Grouped Bar Plots

k <- cross_table(mtcars$cyl, mtcars$gear)
plot(k, beside = TRUE)

Stacked Bar Plots

k <- cross_table(mtcars$cyl, mtcars$gear)
plot(k)

Proportional Bar Plots

k <- cross_table(mtcars$cyl, mtcars$gear)
plot(k, proportional = TRUE)

Mosaic Plots

Mosaic plots can be created using the mosaciplot method.

k <- cross_table(mtcars$cyl, mtcars$gear)
mosaicplot(k)

Frequency Table (Categorical Data)

The freq_table function creates frequency tables for categorical variables.

mt <- mtcars
mt$cyl <- as.factor(mt$cyl)
freq_table(mt$cyl)
##                                Variable: cyl                                 
## |--------------------------------------------------------------------------|
## |                                Cumulative                    Cumulative  |
## |    Levels    |  Frequency   |   Frequency  |   Percent    |    Percent   |
## |--------------------------------------------------------------------------|
## |       4      |      11      |      11      |     34.38    |     34.38    |
## |--------------------------------------------------------------------------|
## |       6      |       7      |      18      |     21.88    |     56.25    |
## |--------------------------------------------------------------------------|
## |       8      |      14      |      32      |     43.75    |      100     |
## |--------------------------------------------------------------------------|

Bar Plot

A barplot method has been defined.

mt <- mtcars
mt$cyl <- as.factor(mt$cyl)
k <- freq_table(mt$cyl)
barplot(k)

Frequency Table (Continuous Data)

The freq_cont function creates frequency tables for continuous variables. The default number of intervals is 5.

freq_cont(mtcars$mpg, 4)
##                                 Variable: mpg                                 
## |---------------------------------------------------------------------------|
## |                                 Cumulative                    Cumulative  |
## |     Bins      |  Frequency   |   Frequency  |   Percent    |    Percent   |
## |---------------------------------------------------------------------------|
## | 10.4  - 16.3  |      10      |      10      |        31.25 |        31.25 |
## |---------------------------------------------------------------------------|
## | 16.3  - 22.1  |      13      |      23      |        40.62 |        71.88 |
## |---------------------------------------------------------------------------|
## | 22.1  -  28   |      5       |      28      |        15.62 |         87.5 |
## |---------------------------------------------------------------------------|
## |  28   - 33.9  |      4       |      32      |         12.5 |          100 |
## |---------------------------------------------------------------------------|

Histogram

A hist method has been defined.

k <- freq_cont(mtcars$mpg, 4)
hist(k)

Group Summary

The group_summary function returns descriptive statistics of a continuous variable for the different levels of a categorical variable.

mt <- mtcars
mt$cyl <- as.factor(mt$cyl)
group_summary(mt$cyl, mt$mpg)
##                                        mpg by cyl                                         
## -----------------------------------------------------------------------------------------
## |     Statistic/Levels|                    4|                    6|                    8|
## -----------------------------------------------------------------------------------------
## |                  Obs|                   11|                    7|                   14|
## |              Minimum|                 21.4|                 17.8|                 10.4|
## |              Maximum|                 33.9|                 21.4|                 19.2|
## |                 Mean|                26.66|                19.74|                 15.1|
## |               Median|                   26|                 19.7|                 15.2|
## |                 Mode|                 22.8|                   21|                 10.4|
## |       Std. Deviation|                 4.51|                 1.45|                 2.56|
## |             Variance|                20.34|                 2.11|                 6.55|
## |             Skewness|                 0.35|                -0.26|                -0.46|
## |             Kurtosis|                -1.43|                -1.83|                 0.33|
## |       Uncorrected SS|              8023.83|              2741.14|              3277.34|
## |         Corrected SS|               203.39|                12.68|                 85.2|
## |      Coeff Variation|                16.91|                 7.36|                16.95|
## |      Std. Error Mean|                 1.36|                 0.55|                 0.68|
## |                Range|                 12.5|                  3.6|                  8.8|
## |  Interquartile Range|                  7.6|                 2.35|                 1.85|
## -----------------------------------------------------------------------------------------

Box Plot

A boxplot method has been defined.

mt <- mtcars
mt$cyl <- as.factor(mt$cyl)
k <- group_summary(mt$cyl, mt$mpg)
boxplot(k)

Multiple One Way Tables

The oway_tables function creates multiple one way tables by creating a frequency table for each categorical variable in a data frame.

mt <- mtcars
mt[, c(2, 8:11)] <- lapply(mt[, c(2, 8:11)], factor)
oway_tables(mt)
##                                Variable: cyl                                 
## |--------------------------------------------------------------------------|
## |                                Cumulative                    Cumulative  |
## |    Levels    |  Frequency   |   Frequency  |   Percent    |    Percent   |
## |--------------------------------------------------------------------------|
## |       4      |      11      |      11      |     34.38    |     34.38    |
## |--------------------------------------------------------------------------|
## |       6      |       7      |      18      |     21.88    |     56.25    |
## |--------------------------------------------------------------------------|
## |       8      |      14      |      32      |     43.75    |      100     |
## |--------------------------------------------------------------------------|
## 
## 
##                                 Variable: vs                                 
## |--------------------------------------------------------------------------|
## |                                Cumulative                    Cumulative  |
## |    Levels    |  Frequency   |   Frequency  |   Percent    |    Percent   |
## |--------------------------------------------------------------------------|
## |       0      |      18      |      18      |     56.25    |     56.25    |
## |--------------------------------------------------------------------------|
## |       1      |      14      |      32      |     43.75    |      100     |
## |--------------------------------------------------------------------------|
## 
## 
##                                 Variable: am                                 
## |--------------------------------------------------------------------------|
## |                                Cumulative                    Cumulative  |
## |    Levels    |  Frequency   |   Frequency  |   Percent    |    Percent   |
## |--------------------------------------------------------------------------|
## |       0      |      19      |      19      |     59.38    |     59.38    |
## |--------------------------------------------------------------------------|
## |       1      |      13      |      32      |     40.62    |      100     |
## |--------------------------------------------------------------------------|
## 
## 
##                                Variable: gear                                
## |--------------------------------------------------------------------------|
## |                                Cumulative                    Cumulative  |
## |    Levels    |  Frequency   |   Frequency  |   Percent    |    Percent   |
## |--------------------------------------------------------------------------|
## |       3      |      15      |      15      |     46.88    |     46.88    |
## |--------------------------------------------------------------------------|
## |       4      |      12      |      27      |     37.5     |     84.38    |
## |--------------------------------------------------------------------------|
## |       5      |       5      |      32      |     15.62    |      100     |
## |--------------------------------------------------------------------------|
## 
## 
##                                Variable: carb                                
## |--------------------------------------------------------------------------|
## |                                Cumulative                    Cumulative  |
## |    Levels    |  Frequency   |   Frequency  |   Percent    |    Percent   |
## |--------------------------------------------------------------------------|
## |       1      |       7      |       7      |     21.88    |     21.88    |
## |--------------------------------------------------------------------------|
## |       2      |      10      |      17      |     31.25    |     53.12    |
## |--------------------------------------------------------------------------|
## |       3      |       3      |      20      |     9.38     |     62.5     |
## |--------------------------------------------------------------------------|
## |       4      |      10      |      30      |     31.25    |     93.75    |
## |--------------------------------------------------------------------------|
## |       6      |       1      |      31      |     3.12     |     96.88    |
## |--------------------------------------------------------------------------|
## |       8      |       1      |      32      |     3.12     |      100     |
## |--------------------------------------------------------------------------|

Multiple Two Way Tables

The tway_tables function creates multiple two way tables by creating a cross table for each unique pair of categorical variables in a data frame.

mt <- mtcars
mt[, c(2, 8:10)] <- lapply(mt[, c(2, 8:10)], factor)
tway_tables(mt)
##     Cell Contents
##  |---------------|
##  |     Frequency |
##  |       Percent |
##  |       Row Pct |
##  |       Col Pct |
##  |---------------|
## 
##  Total Observations:  32 
## 
##                          cyl vs vs                           
## -------------------------------------------------------------
## |              |                     vs                     |
## -------------------------------------------------------------
## |          cyl |            0 |            1 |    Row Total |
## -------------------------------------------------------------
## |            4 |            1 |           10 |           11 |
## |              |        0.031 |        0.312 |              |
## |              |         0.09 |         0.91 |         0.34 |
## |              |         0.06 |         0.71 |              |
## -------------------------------------------------------------
## |            6 |            3 |            4 |            7 |
## |              |        0.094 |        0.125 |              |
## |              |         0.43 |         0.57 |         0.22 |
## |              |         0.17 |         0.29 |              |
## -------------------------------------------------------------
## |            8 |           14 |            0 |           14 |
## |              |        0.438 |            0 |              |
## |              |            1 |            0 |         0.44 |
## |              |         0.78 |            0 |              |
## -------------------------------------------------------------
## | Column Total |           18 |           14 |           32 |
## |              |        0.563 |        0.437 |              |
## -------------------------------------------------------------
## 
## 
##                          cyl vs am                           
## -------------------------------------------------------------
## |              |                     am                     |
## -------------------------------------------------------------
## |          cyl |            0 |            1 |    Row Total |
## -------------------------------------------------------------
## |            4 |            3 |            8 |           11 |
## |              |        0.094 |         0.25 |              |
## |              |         0.27 |         0.73 |         0.34 |
## |              |         0.16 |         0.62 |              |
## -------------------------------------------------------------
## |            6 |            4 |            3 |            7 |
## |              |        0.125 |        0.094 |              |
## |              |         0.57 |         0.43 |         0.22 |
## |              |         0.21 |         0.23 |              |
## -------------------------------------------------------------
## |            8 |           12 |            2 |           14 |
## |              |        0.375 |        0.062 |              |
## |              |         0.86 |         0.14 |         0.44 |
## |              |         0.63 |         0.15 |              |
## -------------------------------------------------------------
## | Column Total |           19 |           13 |           32 |
## |              |        0.594 |        0.406 |              |
## -------------------------------------------------------------
## 
## 
##                                 cyl vs gear                                 
## ----------------------------------------------------------------------------
## |              |                           gear                            |
## ----------------------------------------------------------------------------
## |          cyl |            3 |            4 |            5 |    Row Total |
## ----------------------------------------------------------------------------
## |            4 |            1 |            8 |            2 |           11 |
## |              |        0.031 |         0.25 |        0.062 |              |
## |              |         0.09 |         0.73 |         0.18 |         0.34 |
## |              |         0.07 |         0.67 |          0.4 |              |
## ----------------------------------------------------------------------------
## |            6 |            2 |            4 |            1 |            7 |
## |              |        0.062 |        0.125 |        0.031 |              |
## |              |         0.29 |         0.57 |         0.14 |         0.22 |
## |              |         0.13 |         0.33 |          0.2 |              |
## ----------------------------------------------------------------------------
## |            8 |           12 |            0 |            2 |           14 |
## |              |        0.375 |            0 |        0.062 |              |
## |              |         0.86 |            0 |         0.14 |         0.44 |
## |              |          0.8 |            0 |          0.4 |              |
## ----------------------------------------------------------------------------
## | Column Total |           15 |           12 |            5 |           32 |
## |              |        0.468 |        0.375 |        0.155 |              |
## ----------------------------------------------------------------------------
## 
## 
##                           vs vs am                           
## -------------------------------------------------------------
## |              |                     am                     |
## -------------------------------------------------------------
## |           vs |            0 |            1 |    Row Total |
## -------------------------------------------------------------
## |            0 |           12 |            6 |           18 |
## |              |        0.375 |        0.188 |              |
## |              |         0.67 |         0.33 |         0.56 |
## |              |         0.63 |         0.46 |              |
## -------------------------------------------------------------
## |            1 |            7 |            7 |           14 |
## |              |        0.219 |        0.219 |              |
## |              |          0.5 |          0.5 |         0.44 |
## |              |         0.37 |         0.54 |              |
## -------------------------------------------------------------
## | Column Total |           19 |           13 |           32 |
## |              |        0.594 |        0.407 |              |
## -------------------------------------------------------------
## 
## 
##                                 vs vs gear                                  
## ----------------------------------------------------------------------------
## |              |                           gear                            |
## ----------------------------------------------------------------------------
## |           vs |            3 |            4 |            5 |    Row Total |
## ----------------------------------------------------------------------------
## |            0 |           12 |            2 |            4 |           18 |
## |              |        0.375 |        0.062 |        0.125 |              |
## |              |         0.67 |         0.11 |         0.22 |         0.56 |
## |              |          0.8 |         0.17 |          0.8 |              |
## ----------------------------------------------------------------------------
## |            1 |            3 |           10 |            1 |           14 |
## |              |        0.094 |        0.312 |        0.031 |              |
## |              |         0.21 |         0.71 |         0.07 |         0.44 |
## |              |          0.2 |         0.83 |          0.2 |              |
## ----------------------------------------------------------------------------
## | Column Total |           15 |           12 |            5 |           32 |
## |              |        0.469 |        0.374 |        0.156 |              |
## ----------------------------------------------------------------------------
## 
## 
##                                 am vs gear                                  
## ----------------------------------------------------------------------------
## |              |                           gear                            |
## ----------------------------------------------------------------------------
## |           am |            3 |            4 |            5 |    Row Total |
## ----------------------------------------------------------------------------
## |            0 |           15 |            4 |            0 |           19 |
## |              |        0.469 |        0.125 |            0 |              |
## |              |         0.79 |         0.21 |            0 |         0.59 |
## |              |            1 |         0.33 |            0 |              |
## ----------------------------------------------------------------------------
## |            1 |            0 |            8 |            5 |           13 |
## |              |            0 |         0.25 |        0.156 |              |
## |              |            0 |         0.62 |         0.38 |         0.41 |
## |              |            0 |         0.67 |            1 |              |
## ----------------------------------------------------------------------------
## | Column Total |           15 |           12 |            5 |           32 |
## |              |        0.469 |        0.375 |        0.156 |              |
## ----------------------------------------------------------------------------