Creating Frequency Tables

Matthijs S. Berends

Introduction

Frequency tables (or frequency distributions) are summaries of the distribution of values in a sample. With the freq function, you can create univariate frequency tables. Multiple variables will be pasted into one variable, so it forces a univariate distribution. We take the septic_patients dataset (included in this AMR package) as example.

Frequencies of one variable

To only show and quickly review the content of one variable, you can just select this variable in various ways. Let’s say we want to get the frequencies of the sex variable of the septic_patients dataset:

# just using base R
freq(septic_patients$sex)

# using base R to select the variable and pass it on with a pipe from the dplyr package
septic_patients$sex %>% freq()

# do it all with pipes, using the `select` function from the dplyr package
septic_patients %>%
  select(sex) %>%
  freq()

# or the preferred way: using a pipe to pass the variable on to the freq function
septic_patients %>% freq(sex) # this also shows 'age' in the title

This will all lead to the following table:

freq(septic_patients$sex)
# Frequency table  
# Class:     character
# Length:    2000 (of which NA: 0 = 0.00%)
# Unique:    2
# 
#      Item    Count   Percent   Cum. Count   Cum. Percent
# ---  -----  ------  --------  -----------  -------------
# 1    M        1032     51.6%         1032          51.6%
# 2    V         968     48.4%         2000         100.0%

This immediately shows the class of the variable, its length and availability (i.e. the amount of NA), the amount of unique values and (most importantly) that among septic patients men are more prevalent than women.

Frequencies of more than one variable

Multiple variables will be pasted into one variable to review individual cases, keeping a univariate frequency table.

For illustration, we could add some more variables to the septic_patients dataset to learn about bacterial properties:

my_patients <- septic_patients %>% left_join_microorganisms()

Now all variables of the microorganisms dataset have been joined to the septic_patients dataset. The microorganisms dataset consists of the following variables:

colnames(microorganisms)
#  [1] "bactid"       "bactsys"      "family"       "genus"       
#  [5] "species"      "subspecies"   "fullname"     "type"        
#  [9] "gramstain"    "aerobic"      "type_nl"      "gramstain_nl"

If we compare the dimensions between the old and new dataset, we can see that these 11 variables were added:

dim(septic_patients)
# [1] 2000   49
dim(my_patients)
# [1] 2000   60

So now the genus and species variables are available. A frequency table of these combined variables can be created like this:

my_patients %>% freq(genus, species)
# Frequency table of `genus` and `species` 
# Columns:   2
# Length:    2000 (of which NA: 0 = 0.00%)
# Unique:    110
# 
#      Item                                 Count   Percent   Cum. Count   Cum. Percent
# ---  ----------------------------------  ------  --------  -----------  -------------
# 1    Escherichia coli                       460     23.0%          460          23.0%
# 2    Staphylococcus coagulase negative      309     15.4%          769          38.5%
# 3    Staphylococcus aureus                  233     11.7%         1002          50.1%
# 4    Staphylococcus epidermidis             174      8.7%         1176          58.8%
# 5    Streptococcus pneumoniae               102      5.1%         1278          63.9%
# 6    Staphylococcus hominis                  80      4.0%         1358          67.9%
# 7    Klebsiella pneumoniae                   57      2.8%         1415          70.8%
# 8    Enterococcus faecalis                   39      2.0%         1454          72.7%
# 9    Proteus mirabilis                       35      1.8%         1489          74.5%
# 10   Pseudomonas aeruginosa                  30      1.5%         1519          75.9%
# 11   Serratia marcescens                     23      1.1%         1542          77.1%
# 12   Enterobacter cloacae                    22      1.1%         1564          78.2%
# 13   Enterococcus faecium                    21      1.1%         1585          79.2%
# 14   Staphylococcus capitis                  21      1.1%         1606          80.3%
# 15   Bacteroides fragilis                    20      1.0%         1626          81.3%
# [ reached getOption("max.print.freq") -- omitted 95 entries, n = 374 (18.7%) ]

Frequencies of numeric values

Frequency tables can be created of any input.

In case of numeric values (like integers, doubles, etc.) additional descriptive statistics will be calculated and shown into the header:

# # get age distribution of unique patients
septic_patients %>% 
  distinct(patient_id, .keep_all = TRUE) %>% 
  freq(age, nmax = 5)
# Frequency table of `age` 
# Class:     numeric
# Length:    989 (of which NA: 0 = 0.00%)
# Unique:    73
#   
# Mean:      71
# Std. dev.: 14 (CV: 0.2, MAD: 13)
# Five-Num:  14 | 63 | 74 | 82 | 97 (IQR: 19, CQV: 0.13)
# Outliers:  15 (unique: 12)
# 
#       Item   Count   Percent   Cum. Count   Cum. Percent
# ---  -----  ------  --------  -----------  -------------
# 1       83      44      4.4%           44           4.4%
# 2       76      43      4.3%           87           8.8%
# 3       75      38      3.8%          125          12.6%
# 4       78      33      3.3%          158          16.0%
# 5       82      33      3.3%          191          19.3%
# [ reached `nmax = 5` -- omitted 68 entries, n = 798 (80.7%) ]

So the following properties are determined, where NA values are always ignored:

So for example, the above frequency table quickly shows the median age of patients being 74.

Frequencies of factors

Frequencies of factors will be sorted on factor level instead of item count by default. This can be changed with the sort.count parameter. Frequency tables of factors always show the factor level as an additional last column.

sort.count is TRUE by default, except for factors. Compare this default behaviour…

septic_patients %>%
  freq(hospital_id)
# Frequency table of `hospital_id` 
# Class:     factor
# Length:    2000 (of which NA: 0 = 0.00%)
# Unique:    4
# 
#      Item    Count   Percent   Cum. Count   Cum. Percent   (Factor Level)
# ---  -----  ------  --------  -----------  -------------  ---------------
# 1    A         319     16.0%          319          16.0%                1
# 2    B         661     33.1%          980          49.0%                2
# 3    C         256     12.8%         1236          61.8%                3
# 4    D         764     38.2%         2000         100.0%                4

… with this, where items are now sorted on count:

septic_patients %>%
  freq(hospital_id, sort.count = TRUE)
# Frequency table of `hospital_id` 
# Class:     factor
# Length:    2000 (of which NA: 0 = 0.00%)
# Unique:    4
# 
#      Item    Count   Percent   Cum. Count   Cum. Percent   (Factor Level)
# ---  -----  ------  --------  -----------  -------------  ---------------
# 1    D         764     38.2%          764          38.2%                4
# 2    B         661     33.1%         1425          71.2%                2
# 3    A         319     16.0%         1744          87.2%                1
# 4    C         256     12.8%         2000         100.0%                3

All classes will be printed into the header. Variables with the new rsi class of this AMR package are actually ordered factors and have three classes (look at Class in the header):

septic_patients %>%
  select(amox) %>% 
  freq()
# Frequency table  
# Class:     factor > ordered > rsi
# Length:    2000 (of which NA: 1002 = 50.1%)
# Unique:    3
#   
# %IR:       66.33%
# Ratio SIR: 1.0 : 0.009 : 2.0
# 
#      Item    Count   Percent   Cum. Count   Cum. Percent   (Factor Level)
# ---  -----  ------  --------  -----------  -------------  ---------------
# 1    S         336     33.7%          336          33.7%                1
# 2    I           3      0.3%          339          34.0%                2
# 3    R         659     66.0%          998         100.0%                3

Frequencies of dates

Frequencies of dates will show the oldest and newest date in the data, and the amount of days between them:

septic_patients %>%
  select(date) %>% 
  freq(nmax = 5)
# Frequency table  
# Class:     Date
# Length:    2000 (of which NA: 0 = 0.00%)
# Unique:    1151
# 
# Oldest:    2 januari 2002
# Newest:    28 december 2017 (+5839)
# Median:    7 augustus 2009 (~48%)
# 
#      Item          Count   Percent   Cum. Count   Cum. Percent
# ---  -----------  ------  --------  -----------  -------------
# 1    2016-05-21       10      0.5%           10           0.5%
# 2    2004-11-15        8      0.4%           18           0.9%
# 3    2013-07-29        8      0.4%           26           1.3%
# 4    2017-06-12        8      0.4%           34           1.7%
# 5    2015-11-19        7      0.4%           41           2.1%
# [ reached `nmax = 5` -- omitted 1146 entries, n = 1959 (98.0%) ]

Assigning a frequency table to an object

A frequency table is actaually a regular data.frame, with the exception that it contains an additional class.

my_df <- septic_patients %>% freq(age)
class(my_df)
# [1] "frequency_tbl" "data.frame"

Because of this additional class, a frequency table prints like the examples above. But the object itself contains the complete table without a row limitation:

dim(my_df)
# [1] 74  5

Additional parameters

Parameter na.rm

With the na.rm parameter (defaults to TRUE, but they will always be shown into the header), you can include NA values in the frequency table:

Parameter row.names

The default frequency tables shows row indices. To remove them, use row.names = FALSE:

Parameter markdown

The markdown parameter can be used in reports created with R Markdown. This will always print all rows:


AMR, (c) 2018, https://github.com/msberends/AMR

Licensed under the GNU General Public License v2.0.