The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Categorical Data

Aravind Hebbali

2020-12-09

Introduction

In this document, we will introduce you to functions for exploring and visualizing categorical data.

Data

We have modified the mtcars data to create a new data set mtcarz. The only difference between the two data sets is related to the variable types.

str(mtcarz)
#> 'data.frame':    32 obs. of  11 variables:
#>  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
#>  $ disp: num  160 160 108 258 360 ...
#>  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
#>  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#>  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#>  $ qsec: num  16.5 17 18.6 19.4 17 ...
#>  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
#>  $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
#>  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
#>  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

Cross Tabulation

The ds_cross_table() function creates two way tables of categorical variables.

ds_cross_table(mtcarz, cyl, gear)
#>     Cell Contents
#>  |---------------|
#>  |     Frequency |
#>  |       Percent |
#>  |       Row Pct |
#>  |       Col Pct |
#>  |---------------|
#> 
#>  Total Observations:  32 
#> 
#> ----------------------------------------------------------------------------
#> |              |                           gear                            |
#> ----------------------------------------------------------------------------
#> |          cyl |            3 |            4 |            5 |    Row Total |
#> ----------------------------------------------------------------------------
#> |            4 |            1 |            8 |            2 |           11 |
#> |              |        0.031 |         0.25 |        0.062 |              |
#> |              |         0.09 |         0.73 |         0.18 |         0.34 |
#> |              |         0.07 |         0.67 |          0.4 |              |
#> ----------------------------------------------------------------------------
#> |            6 |            2 |            4 |            1 |            7 |
#> |              |        0.062 |        0.125 |        0.031 |              |
#> |              |         0.29 |         0.57 |         0.14 |         0.22 |
#> |              |         0.13 |         0.33 |          0.2 |              |
#> ----------------------------------------------------------------------------
#> |            8 |           12 |            0 |            2 |           14 |
#> |              |        0.375 |            0 |        0.062 |              |
#> |              |         0.86 |            0 |         0.14 |         0.44 |
#> |              |          0.8 |            0 |          0.4 |              |
#> ----------------------------------------------------------------------------
#> | Column Total |           15 |           12 |            5 |           32 |
#> |              |        0.468 |        0.375 |        0.155 |              |
#> ----------------------------------------------------------------------------

If you want the above result as a tibble, use ds_twoway_table().

ds_twoway_table(mtcarz, cyl, gear)
#> Joining, by = c("cyl", "gear", "count")
#> # A tibble: 8 x 6
#>   cyl   gear  count percent row_percent col_percent
#>   <fct> <fct> <int>   <dbl>       <dbl>       <dbl>
#> 1 4     3         1  0.0312      0.0909      0.0667
#> 2 4     4         8  0.25        0.727       0.667 
#> 3 4     5         2  0.0625      0.182       0.4   
#> 4 6     3         2  0.0625      0.286       0.133 
#> 5 6     4         4  0.125       0.571       0.333 
#> 6 6     5         1  0.0312      0.143       0.2   
#> 7 8     3        12  0.375       0.857       0.8   
#> 8 8     5         2  0.0625      0.143       0.4

A plot() method has been defined which will generate:

Grouped Bar Plots

k <- ds_cross_table(mtcarz, cyl, gear)
plot(k)

Stacked Bar Plots

k <- ds_cross_table(mtcarz, cyl, gear)
plot(k, stacked = TRUE)

Proportional Bar Plots

k <- ds_cross_table(mtcarz, cyl, gear)
plot(k, proportional = TRUE)

Frequency Table

The ds_freq_table() function creates frequency tables.

ds_freq_table(mtcarz, cyl)
#>                              Variable: cyl                              
#> -----------------------------------------------------------------------
#> Levels     Frequency    Cum Frequency       Percent        Cum Percent  
#> -----------------------------------------------------------------------
#>    4          11             11              34.38            34.38    
#> -----------------------------------------------------------------------
#>    6           7             18              21.88            56.25    
#> -----------------------------------------------------------------------
#>    8          14             32              43.75             100     
#> -----------------------------------------------------------------------
#>  Total        32              -             100.00              -      
#> -----------------------------------------------------------------------

A plot() method has been defined which will create a bar plot.

k <- ds_freq_table(mtcarz, cyl)
plot(k)

Multiple One Way Tables

The ds_auto_freq_table() function creates multiple one way tables by creating a frequency table for each categorical variable in a data set. You can also specify a subset of variables if you do not want all the variables in the data set to be used.

ds_auto_freq_table(mtcarz)
#>                              Variable: cyl                              
#> -----------------------------------------------------------------------
#> Levels     Frequency    Cum Frequency       Percent        Cum Percent  
#> -----------------------------------------------------------------------
#>    4          11             11              34.38            34.38    
#> -----------------------------------------------------------------------
#>    6           7             18              21.88            56.25    
#> -----------------------------------------------------------------------
#>    8          14             32              43.75             100     
#> -----------------------------------------------------------------------
#>  Total        32              -             100.00              -      
#> -----------------------------------------------------------------------
#> 
#>                              Variable: vs                               
#> -----------------------------------------------------------------------
#> Levels     Frequency    Cum Frequency       Percent        Cum Percent  
#> -----------------------------------------------------------------------
#>    0          18             18              56.25            56.25    
#> -----------------------------------------------------------------------
#>    1          14             32              43.75             100     
#> -----------------------------------------------------------------------
#>  Total        32              -             100.00              -      
#> -----------------------------------------------------------------------
#> 
#>                              Variable: am                               
#> -----------------------------------------------------------------------
#> Levels     Frequency    Cum Frequency       Percent        Cum Percent  
#> -----------------------------------------------------------------------
#>    0          19             19              59.38            59.38    
#> -----------------------------------------------------------------------
#>    1          13             32              40.62             100     
#> -----------------------------------------------------------------------
#>  Total        32              -             100.00              -      
#> -----------------------------------------------------------------------
#> 
#>                             Variable: gear                              
#> -----------------------------------------------------------------------
#> Levels     Frequency    Cum Frequency       Percent        Cum Percent  
#> -----------------------------------------------------------------------
#>    3          15             15              46.88            46.88    
#> -----------------------------------------------------------------------
#>    4          12             27              37.5             84.38    
#> -----------------------------------------------------------------------
#>    5           5             32              15.62             100     
#> -----------------------------------------------------------------------
#>  Total        32              -             100.00              -      
#> -----------------------------------------------------------------------
#> 
#>                             Variable: carb                              
#> -----------------------------------------------------------------------
#> Levels     Frequency    Cum Frequency       Percent        Cum Percent  
#> -----------------------------------------------------------------------
#>    1           7              7              21.88            21.88    
#> -----------------------------------------------------------------------
#>    2          10             17              31.25            53.12    
#> -----------------------------------------------------------------------
#>    3           3             20              9.38             62.5     
#> -----------------------------------------------------------------------
#>    4          10             30              31.25            93.75    
#> -----------------------------------------------------------------------
#>    6           1             31              3.12             96.88    
#> -----------------------------------------------------------------------
#>    8           1             32              3.12              100     
#> -----------------------------------------------------------------------
#>  Total        32              -             100.00              -      
#> -----------------------------------------------------------------------

Multiple Two Way Tables

The ds_auto_cross_table() function creates multiple two way tables by creating a cross table for each unique pair of categorical variables in a data set. You can also specify a subset of variables if you do not want all the variables in the data set to be used.

ds_auto_cross_table(mtcarz, cyl, gear, am)
#>     Cell Contents
#>  |---------------|
#>  |     Frequency |
#>  |       Percent |
#>  |       Row Pct |
#>  |       Col Pct |
#>  |---------------|
#> 
#>  Total Observations:  32 
#> 
#>                                 cyl vs gear                                 
#> ----------------------------------------------------------------------------
#> |              |                           gear                            |
#> ----------------------------------------------------------------------------
#> |          cyl |            3 |            4 |            5 |    Row Total |
#> ----------------------------------------------------------------------------
#> |            4 |            1 |            8 |            2 |           11 |
#> |              |        0.031 |         0.25 |        0.062 |              |
#> |              |         0.09 |         0.73 |         0.18 |         0.34 |
#> |              |         0.07 |         0.67 |          0.4 |              |
#> ----------------------------------------------------------------------------
#> |            6 |            2 |            4 |            1 |            7 |
#> |              |        0.062 |        0.125 |        0.031 |              |
#> |              |         0.29 |         0.57 |         0.14 |         0.22 |
#> |              |         0.13 |         0.33 |          0.2 |              |
#> ----------------------------------------------------------------------------
#> |            8 |           12 |            0 |            2 |           14 |
#> |              |        0.375 |            0 |        0.062 |              |
#> |              |         0.86 |            0 |         0.14 |         0.44 |
#> |              |          0.8 |            0 |          0.4 |              |
#> ----------------------------------------------------------------------------
#> | Column Total |           15 |           12 |            5 |           32 |
#> |              |        0.468 |        0.375 |        0.155 |              |
#> ----------------------------------------------------------------------------
#> 
#> 
#>                          cyl vs am                           
#> -------------------------------------------------------------
#> |              |                     am                     |
#> -------------------------------------------------------------
#> |          cyl |            0 |            1 |    Row Total |
#> -------------------------------------------------------------
#> |            4 |            3 |            8 |           11 |
#> |              |        0.094 |         0.25 |              |
#> |              |         0.27 |         0.73 |         0.34 |
#> |              |         0.16 |         0.62 |              |
#> -------------------------------------------------------------
#> |            6 |            4 |            3 |            7 |
#> |              |        0.125 |        0.094 |              |
#> |              |         0.57 |         0.43 |         0.22 |
#> |              |         0.21 |         0.23 |              |
#> -------------------------------------------------------------
#> |            8 |           12 |            2 |           14 |
#> |              |        0.375 |        0.062 |              |
#> |              |         0.86 |         0.14 |         0.44 |
#> |              |         0.63 |         0.15 |              |
#> -------------------------------------------------------------
#> | Column Total |           19 |           13 |           32 |
#> |              |        0.594 |        0.406 |              |
#> -------------------------------------------------------------
#> 
#> 
#>                          gear vs am                          
#> -------------------------------------------------------------
#> |              |                     am                     |
#> -------------------------------------------------------------
#> |         gear |            0 |            1 |    Row Total |
#> -------------------------------------------------------------
#> |            3 |           15 |            0 |           15 |
#> |              |        0.469 |            0 |              |
#> |              |            1 |            0 |         0.47 |
#> |              |         0.79 |            0 |              |
#> -------------------------------------------------------------
#> |            4 |            4 |            8 |           12 |
#> |              |        0.125 |         0.25 |              |
#> |              |         0.33 |         0.67 |         0.38 |
#> |              |         0.21 |         0.62 |              |
#> -------------------------------------------------------------
#> |            5 |            0 |            5 |            5 |
#> |              |            0 |        0.156 |              |
#> |              |            0 |            1 |         0.16 |
#> |              |            0 |         0.38 |              |
#> -------------------------------------------------------------
#> | Column Total |           19 |           13 |           32 |
#> |              |        0.594 |        0.406 |              |
#> -------------------------------------------------------------

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.