Creating Frequency Tables

Matthijs S. Berends

Introduction

Frequency tables (or frequency distributions) are summaries of the distribution of values in a sample. With the freq function, you can create univariate frequency tables. Multiple variables will be pasted into one variable, so it forces a univariate distribution. We take the septic_patients dataset (included in this AMR package) as example.

Frequencies of one variable

To only show and quickly review the content of one variable, you can just select this variable in various ways. Let’s say we want to get the frequencies of the gender variable of the septic_patients dataset:

septic_patients %>% freq(gender)
# Frequency table of `gender` 
# Class:     character
# Length:    2000 (of which NA: 0 = 0.00%)
# Unique:    2
# 
# Shortest:  1
# Longest:   1
# 
#      Item    Count   Percent   Cum. Count   Cum. Percent
# ---  -----  ------  --------  -----------  -------------
# 1    M        1031     51.5%         1031          51.5%
# 2    F         969     48.4%         2000         100.0%

This immediately shows the class of the variable, its length and availability (i.e. the amount of NA), the amount of unique values and (most importantly) that among septic patients men are more prevalent than women.

Frequencies of more than one variable

Multiple variables will be pasted into one variable to review individual cases, keeping a univariate frequency table.

For illustration, we could add some more variables to the septic_patients dataset to learn about bacterial properties:

my_patients <- septic_patients %>% left_join_microorganisms()

Now all variables of the microorganisms dataset have been joined to the septic_patients dataset. The microorganisms dataset consists of the following variables:

colnames(microorganisms)
#  [1] "mo"         "tsn"        "genus"      "species"    "subspecies"
#  [6] "fullname"   "family"     "order"      "class"      "phylum"    
# [11] "subkingdom" "gramstain"  "type"       "prevalence" "ref"

If we compare the dimensions between the old and new dataset, we can see that these 14 variables were added:

dim(septic_patients)
# [1] 2000   49
dim(my_patients)
# [1] 2000   63

So now the genus and species variables are available. A frequency table of these combined variables can be created like this:

my_patients %>% freq(genus, species)
# Frequency table of `genus` and `species` 
# Columns:   2
# Length:    2000 (of which NA: 0 = 0.00%)
# Unique:    96
# 
# Shortest:  12
# Longest:   34
# 
#      Item                                 Count   Percent   Cum. Count   Cum. Percent
# ---  ----------------------------------  ------  --------  -----------  -------------
# 1    Escherichia coli                       467     23.4%          467          23.4%
# 2    Staphylococcus coagulase negative      313     15.7%          780          39.0%
# 3    Staphylococcus aureus                  235     11.8%         1015          50.7%
# 4    Staphylococcus epidermidis             174      8.7%         1189          59.5%
# 5    Streptococcus pneumoniae               117      5.9%         1306          65.3%
# 6    Staphylococcus hominis                  81      4.0%         1387          69.3%
# 7    Klebsiella pneumoniae                   58      2.9%         1445          72.2%
# 8    Enterococcus faecalis                   39      2.0%         1484          74.2%
# 9    Proteus mirabilis                       36      1.8%         1520          76.0%
# 10   Pseudomonas aeruginosa                  30      1.5%         1550          77.5%
# 11   Serratia marcescens                     25      1.2%         1575          78.8%
# 12   Enterobacter cloacae                    23      1.1%         1598          79.9%
# 13   Enterococcus faecium                    21      1.1%         1619          81.0%
# 14   Staphylococcus capitis                  21      1.1%         1640          82.0%
# 15   Bacteroides fragilis                    20      1.0%         1660          83.0%
# [ reached getOption("max.print.freq") -- omitted 81 entries, n = 340 (17.0%) ]

Frequencies of numeric values

Frequency tables can be created of any input.

In case of numeric values (like integers, doubles, etc.) additional descriptive statistics will be calculated and shown into the header:

# # get age distribution of unique patients
septic_patients %>% 
  distinct(patient_id, .keep_all = TRUE) %>% 
  freq(age, nmax = 5)
# Frequency table of `age` 
# Class:     numeric
# Length:    981 (of which NA: 0 = 0.00%)
# Unique:    73
#   
# Mean:      71
# Std. dev.: 14 (CV: 0.2, MAD: 13)
# Five-Num:  14 | 63 | 74 | 82 | 97 (IQR: 19, CQV: 0.13)
# Outliers:  15 (unique: 12)
# 
#       Item   Count   Percent   Cum. Count   Cum. Percent
# ---  -----  ------  --------  -----------  -------------
# 1       83      44      4.5%           44           4.5%
# 2       76      43      4.4%           87           8.9%
# 3       75      37      3.8%          124          12.6%
# 4       82      33      3.4%          157          16.0%
# 5       78      32      3.3%          189          19.3%
# [ reached `nmax = 5` -- omitted 68 entries, n = 792 (80.7%) ]

So the following properties are determined, where NA values are always ignored:

So for example, the above frequency table quickly shows the median age of patients being 74.

Frequencies of factors

Frequencies of factors will be sorted on factor level instead of item count by default. This can be changed with the sort.count parameter. Frequency tables of factors always show the factor level as an additional last column.

sort.count is TRUE by default, except for factors. Compare this default behaviour…

septic_patients %>%
  freq(hospital_id)
# Frequency table of `hospital_id` 
# Class:     factor (numeric)
# Length:    2000 (of which NA: 0 = 0.00%)
# Unique:    4
# 
#      Item    Count   Percent   Cum. Count   Cum. Percent   (Factor Level)
# ---  -----  ------  --------  -----------  -------------  ---------------
# 1    A         321     16.1%          321          16.1%                1
# 2    B         663     33.1%          984          49.2%                2
# 3    C         254     12.7%         1238          61.9%                3
# 4    D         762     38.1%         2000         100.0%                4

… with this, where items are now sorted on count:

septic_patients %>%
  freq(hospital_id, sort.count = TRUE)
# Frequency table of `hospital_id` 
# Class:     factor (numeric)
# Length:    2000 (of which NA: 0 = 0.00%)
# Unique:    4
# 
#      Item    Count   Percent   Cum. Count   Cum. Percent   (Factor Level)
# ---  -----  ------  --------  -----------  -------------  ---------------
# 1    D         762     38.1%          762          38.1%                4
# 2    B         663     33.1%         1425          71.2%                2
# 3    A         321     16.1%         1746          87.3%                1
# 4    C         254     12.7%         2000         100.0%                3

All classes will be printed into the header. Variables with the new rsi class of this AMR package are actually ordered factors and have three classes (look at Class in the header):

septic_patients %>%
  select(amox) %>% 
  freq()
# Frequency table  
# Class:     factor > ordered > rsi (numeric)
# Length:    2000 (of which NA: 1000 = 50.00%)
# Unique:    3
#   
# %IR:       66.5%
# Ratio SIR: 1.0 : 0.009 : 2.0
# 
#      Item    Count   Percent   Cum. Count   Cum. Percent   (Factor Level)
# ---  -----  ------  --------  -----------  -------------  ---------------
# 1    S         335     33.5%          335          33.5%                1
# 2    I           3      0.3%          338          33.8%                2
# 3    R         662     66.2%         1000         100.0%                3

Frequencies of dates

Frequencies of dates will show the oldest and newest date in the data, and the amount of days between them:

septic_patients %>%
  select(date) %>% 
  freq(nmax = 5)
# Frequency table  
# Class:     Date (numeric)
# Length:    2000 (of which NA: 0 = 0.00%)
# Unique:    1140
# 
# Oldest:    2 januari 2002
# Newest:    28 december 2017 (+5839)
# Median:    31 juli 2009 (~47%)
# 
#      Item          Count   Percent   Cum. Count   Cum. Percent
# ---  -----------  ------  --------  -----------  -------------
# 1    2016-05-21       10      0.5%           10           0.5%
# 2    2004-11-15        8      0.4%           18           0.9%
# 3    2013-07-29        8      0.4%           26           1.3%
# 4    2017-06-12        8      0.4%           34           1.7%
# 5    2015-11-19        7      0.4%           41           2.1%
# [ reached `nmax = 5` -- omitted 1135 entries, n = 1959 (98.0%) ]

Assigning a frequency table to an object

A frequency table is actaually a regular data.frame, with the exception that it contains an additional class.

my_df <- septic_patients %>% freq(age)
class(my_df)
# [1] "frequency_tbl" "data.frame"

Because of this additional class, a frequency table prints like the examples above. But the object itself contains the complete table without a row limitation:

dim(my_df)
# [1] 74  5

Additional parameters

Parameter na.rm

With the na.rm parameter (defaults to TRUE, but they will always be shown into the header), you can include NA values in the frequency table:

Parameter row.names

The default frequency tables shows row indices. To remove them, use row.names = FALSE:

Parameter markdown

The markdown parameter can be used in reports created with R Markdown. This will always print all rows:


AMR, (c) 2018, https://github.com/msberends/AMR

Licensed under the GNU General Public License v2.0.