Frequency tables (or frequency distributions) are summaries of the distribution of values in a sample. With the freq
function, you can create univariate frequency tables. Multiple variables will be pasted into one variable, so it forces a univariate distribution. We take the septic_patients
dataset (included in this AMR package) as example.
To only show and quickly review the content of one variable, you can just select this variable in various ways. Let’s say we want to get the frequencies of the sex
variable of the septic_patients
dataset:
# just using base R
freq(septic_patients$sex)
# using base R to select the variable and pass it on with a pipe from the dplyr package
septic_patients$sex %>% freq()
# do it all with pipes, using the `select` function from the dplyr package
septic_patients %>%
select(sex) %>%
freq()
# or the preferred way: using a pipe to pass the variable on to the freq function
septic_patients %>% freq(sex) # this also shows 'age' in the title
This will all lead to the following table:
freq(septic_patients$sex)
# Frequency table
# Class: character
# Length: 2000 (of which NA: 0 = 0.00%)
# Unique: 2
#
# Item Count Percent Cum. Count Cum. Percent
# --- ----- ------ -------- ----------- -------------
# 1 M 1032 51.6% 1032 51.6%
# 2 V 968 48.4% 2000 100.0%
This immediately shows the class of the variable, its length and availability (i.e. the amount of NA
), the amount of unique values and (most importantly) that among septic patients men are more prevalent than women.
Multiple variables will be pasted into one variable to review individual cases, keeping a univariate frequency table.
For illustration, we could add some more variables to the septic_patients
dataset to learn about bacterial properties:
Now all variables of the microorganisms
dataset have been joined to the septic_patients
dataset. The microorganisms
dataset consists of the following variables:
colnames(microorganisms)
# [1] "bactid" "bactsys" "family" "genus"
# [5] "species" "subspecies" "fullname" "type"
# [9] "gramstain" "aerobic" "type_nl" "gramstain_nl"
If we compare the dimensions between the old and new dataset, we can see that these 11 variables were added:
So now the genus
and species
variables are available. A frequency table of these combined variables can be created like this:
my_patients %>% freq(genus, species)
# Frequency table of `genus` and `species`
# Columns: 2
# Length: 2000 (of which NA: 0 = 0.00%)
# Unique: 110
#
# Item Count Percent Cum. Count Cum. Percent
# --- ---------------------------------- ------ -------- ----------- -------------
# 1 Escherichia coli 460 23.0% 460 23.0%
# 2 Staphylococcus coagulase negative 309 15.4% 769 38.5%
# 3 Staphylococcus aureus 233 11.7% 1002 50.1%
# 4 Staphylococcus epidermidis 174 8.7% 1176 58.8%
# 5 Streptococcus pneumoniae 102 5.1% 1278 63.9%
# 6 Staphylococcus hominis 80 4.0% 1358 67.9%
# 7 Klebsiella pneumoniae 57 2.8% 1415 70.8%
# 8 Enterococcus faecalis 39 2.0% 1454 72.7%
# 9 Proteus mirabilis 35 1.8% 1489 74.5%
# 10 Pseudomonas aeruginosa 30 1.5% 1519 75.9%
# 11 Serratia marcescens 23 1.1% 1542 77.1%
# 12 Enterobacter cloacae 22 1.1% 1564 78.2%
# 13 Enterococcus faecium 21 1.1% 1585 79.2%
# 14 Staphylococcus capitis 21 1.1% 1606 80.3%
# 15 Bacteroides fragilis 20 1.0% 1626 81.3%
# [ reached getOption("max.print.freq") -- omitted 95 entries, n = 374 (18.7%) ]
Frequency tables can be created of any input.
In case of numeric values (like integers, doubles, etc.) additional descriptive statistics will be calculated and shown into the header:
# # get age distribution of unique patients
septic_patients %>%
distinct(patient_id, .keep_all = TRUE) %>%
freq(age, nmax = 5)
# Frequency table of `age`
# Class: numeric
# Length: 989 (of which NA: 0 = 0.00%)
# Unique: 73
#
# Mean: 71
# Std. dev.: 14 (CV: 0.2, MAD: 13)
# Five-Num: 14 | 63 | 74 | 82 | 97 (IQR: 19, CQV: 0.13)
# Outliers: 15 (unique: 12)
#
# Item Count Percent Cum. Count Cum. Percent
# --- ----- ------ -------- ----------- -------------
# 1 83 44 4.4% 44 4.4%
# 2 76 43 4.3% 87 8.8%
# 3 75 38 3.8% 125 12.6%
# 4 78 33 3.3% 158 16.0%
# 5 82 33 3.3% 191 19.3%
# [ reached `nmax = 5` -- omitted 68 entries, n = 798 (80.7%) ]
So the following properties are determined, where NA
values are always ignored:
Mean
Standard deviation
Coefficient of variation (CV), the standard deviation divided by the mean
Five numbers of Tukey (min, Q1, median, Q3, max)
Coefficient of quartile variation (CQV, sometimes called coefficient of dispersion), calculated as (Q3 - Q1) / (Q3 + Q1) using quantile with type = 6
as quantile algorithm to comply with SPSS standards
Outliers (total count and unique count)
So for example, the above frequency table quickly shows the median age of patients being 74.
Frequencies of factors will be sorted on factor level instead of item count by default. This can be changed with the sort.count
parameter. Frequency tables of factors always show the factor level as an additional last column.
sort.count
is TRUE
by default, except for factors. Compare this default behaviour…
septic_patients %>%
freq(hospital_id)
# Frequency table of `hospital_id`
# Class: factor
# Length: 2000 (of which NA: 0 = 0.00%)
# Unique: 4
#
# Item Count Percent Cum. Count Cum. Percent (Factor Level)
# --- ----- ------ -------- ----------- ------------- ---------------
# 1 A 319 16.0% 319 16.0% 1
# 2 B 661 33.1% 980 49.0% 2
# 3 C 256 12.8% 1236 61.8% 3
# 4 D 764 38.2% 2000 100.0% 4
… with this, where items are now sorted on count:
septic_patients %>%
freq(hospital_id, sort.count = TRUE)
# Frequency table of `hospital_id`
# Class: factor
# Length: 2000 (of which NA: 0 = 0.00%)
# Unique: 4
#
# Item Count Percent Cum. Count Cum. Percent (Factor Level)
# --- ----- ------ -------- ----------- ------------- ---------------
# 1 D 764 38.2% 764 38.2% 4
# 2 B 661 33.1% 1425 71.2% 2
# 3 A 319 16.0% 1744 87.2% 1
# 4 C 256 12.8% 2000 100.0% 3
All classes will be printed into the header. Variables with the new rsi
class of this AMR package are actually ordered factors and have three classes (look at Class
in the header):
septic_patients %>%
select(amox) %>%
freq()
# Frequency table
# Class: factor > ordered > rsi
# Length: 2000 (of which NA: 1002 = 50.1%)
# Unique: 3
#
# %IR: 66.33%
# Ratio SIR: 1.0 : 0.009 : 2.0
#
# Item Count Percent Cum. Count Cum. Percent (Factor Level)
# --- ----- ------ -------- ----------- ------------- ---------------
# 1 S 336 33.7% 336 33.7% 1
# 2 I 3 0.3% 339 34.0% 2
# 3 R 659 66.0% 998 100.0% 3
Frequencies of dates will show the oldest and newest date in the data, and the amount of days between them:
septic_patients %>%
select(date) %>%
freq(nmax = 5)
# Frequency table
# Class: Date
# Length: 2000 (of which NA: 0 = 0.00%)
# Unique: 1151
#
# Oldest: 2 januari 2002
# Newest: 28 december 2017 (+5839)
# Median: 7 augustus 2009 (~48%)
#
# Item Count Percent Cum. Count Cum. Percent
# --- ----------- ------ -------- ----------- -------------
# 1 2016-05-21 10 0.5% 10 0.5%
# 2 2004-11-15 8 0.4% 18 0.9%
# 3 2013-07-29 8 0.4% 26 1.3%
# 4 2017-06-12 8 0.4% 34 1.7%
# 5 2015-11-19 7 0.4% 41 2.1%
# [ reached `nmax = 5` -- omitted 1146 entries, n = 1959 (98.0%) ]
A frequency table is actaually a regular data.frame
, with the exception that it contains an additional class.
Because of this additional class, a frequency table prints like the examples above. But the object itself contains the complete table without a row limitation:
na.rm
With the na.rm
parameter (defaults to TRUE
, but they will always be shown into the header), you can include NA
values in the frequency table:
septic_patients %>%
freq(amox, na.rm = FALSE)
# Frequency table of `amox`
# Class: factor > ordered > rsi
# Length: 3002 (of which NA: 1002 = 33.38%)
# Unique: 4
#
# %IR: NA
# Ratio SIR: 1.0 : NA : NA
#
# Item Count Percent Cum. Count Cum. Percent (Factor Level)
# --- ----- ------ -------- ----------- ------------- ---------------
# 1 S 336 16.8% 336 16.8% 1
# 2 I 3 0.2% 339 17.0% 2
# 3 R 659 33.0% 998 49.9% 3
# 4 <NA> 1002 50.1% 2000 100.0% <NA>
row.names
The default frequency tables shows row indices. To remove them, use row.names = FALSE
:
septic_patients %>%
freq(hospital_id, row.names = FALSE)
# Frequency table of `hospital_id`
# Class: factor
# Length: 2000 (of which NA: 0 = 0.00%)
# Unique: 4
#
# Item Count Percent Cum. Count Cum. Percent (Factor Level)
# ----- ------ -------- ----------- ------------- ---------------
# A 319 16.0% 319 16.0% 1
# B 661 33.1% 980 49.0% 2
# C 256 12.8% 1236 61.8% 3
# D 764 38.2% 2000 100.0% 4
markdown
The markdown
parameter can be used in reports created with R Markdown. This will always print all rows:
septic_patients %>%
freq(hospital_id, markdown = TRUE)
# Frequency table of `hospital_id`
#
# Class: factor
#
# Length: 2000 (of which NA: 0 = 0.00%)
#
# Unique: 4
#
# | |Item | Count| Percent| Cum. Count| Cum. Percent| (Factor Level)|
# |:--|:----|-----:|-------:|----------:|------------:|--------------:|
# |1 |A | 319| 16.0%| 319| 16.0%| 1|
# |2 |B | 661| 33.1%| 980| 49.0%| 2|
# |3 |C | 256| 12.8%| 1236| 61.8%| 3|
# |4 |D | 764| 38.2%| 2000| 100.0%| 4|
AMR, (c) 2018, https://github.com/msberends/AMR
Licensed under the GNU General Public License v2.0.