Frequency tables (or frequency distributions) are summaries of the distribution of values in a sample. With the freq
function, you can create univariate frequency tables. Multiple variables will be pasted into one variable, so it forces a univariate distribution. We take the septic_patients
dataset (included in this AMR package) as example.
To only show and quickly review the content of one variable, you can just select this variable in various ways. Let’s say we want to get the frequencies of the gender
variable of the septic_patients
dataset:
septic_patients %>% freq(gender)
# Frequency table of `gender`
# Class: character
# Length: 2000 (of which NA: 0 = 0.00%)
# Unique: 2
#
# Shortest: 1
# Longest: 1
#
# Item Count Percent Cum. Count Cum. Percent
# --- ----- ------ -------- ----------- -------------
# 1 M 1031 51.5% 1031 51.5%
# 2 F 969 48.4% 2000 100.0%
This immediately shows the class of the variable, its length and availability (i.e. the amount of NA
), the amount of unique values and (most importantly) that among septic patients men are more prevalent than women.
Multiple variables will be pasted into one variable to review individual cases, keeping a univariate frequency table.
For illustration, we could add some more variables to the septic_patients
dataset to learn about bacterial properties:
Now all variables of the microorganisms
dataset have been joined to the septic_patients
dataset. The microorganisms
dataset consists of the following variables:
colnames(microorganisms)
# [1] "mo" "tsn" "genus" "species" "subspecies"
# [6] "fullname" "family" "order" "class" "phylum"
# [11] "subkingdom" "gramstain" "type" "prevalence" "ref"
If we compare the dimensions between the old and new dataset, we can see that these 14 variables were added:
So now the genus
and species
variables are available. A frequency table of these combined variables can be created like this:
my_patients %>% freq(genus, species)
# Frequency table of `genus` and `species`
# Columns: 2
# Length: 2000 (of which NA: 0 = 0.00%)
# Unique: 96
#
# Shortest: 12
# Longest: 34
#
# Item Count Percent Cum. Count Cum. Percent
# --- ---------------------------------- ------ -------- ----------- -------------
# 1 Escherichia coli 467 23.4% 467 23.4%
# 2 Staphylococcus coagulase negative 313 15.7% 780 39.0%
# 3 Staphylococcus aureus 235 11.8% 1015 50.7%
# 4 Staphylococcus epidermidis 174 8.7% 1189 59.5%
# 5 Streptococcus pneumoniae 117 5.9% 1306 65.3%
# 6 Staphylococcus hominis 81 4.0% 1387 69.3%
# 7 Klebsiella pneumoniae 58 2.9% 1445 72.2%
# 8 Enterococcus faecalis 39 2.0% 1484 74.2%
# 9 Proteus mirabilis 36 1.8% 1520 76.0%
# 10 Pseudomonas aeruginosa 30 1.5% 1550 77.5%
# 11 Serratia marcescens 25 1.2% 1575 78.8%
# 12 Enterobacter cloacae 23 1.1% 1598 79.9%
# 13 Enterococcus faecium 21 1.1% 1619 81.0%
# 14 Staphylococcus capitis 21 1.1% 1640 82.0%
# 15 Bacteroides fragilis 20 1.0% 1660 83.0%
# [ reached getOption("max.print.freq") -- omitted 81 entries, n = 340 (17.0%) ]
Frequency tables can be created of any input.
In case of numeric values (like integers, doubles, etc.) additional descriptive statistics will be calculated and shown into the header:
# # get age distribution of unique patients
septic_patients %>%
distinct(patient_id, .keep_all = TRUE) %>%
freq(age, nmax = 5)
# Frequency table of `age`
# Class: numeric
# Length: 981 (of which NA: 0 = 0.00%)
# Unique: 73
#
# Mean: 71
# Std. dev.: 14 (CV: 0.2, MAD: 13)
# Five-Num: 14 | 63 | 74 | 82 | 97 (IQR: 19, CQV: 0.13)
# Outliers: 15 (unique: 12)
#
# Item Count Percent Cum. Count Cum. Percent
# --- ----- ------ -------- ----------- -------------
# 1 83 44 4.5% 44 4.5%
# 2 76 43 4.4% 87 8.9%
# 3 75 37 3.8% 124 12.6%
# 4 82 33 3.4% 157 16.0%
# 5 78 32 3.3% 189 19.3%
# [ reached `nmax = 5` -- omitted 68 entries, n = 792 (80.7%) ]
So the following properties are determined, where NA
values are always ignored:
Mean
Standard deviation
Coefficient of variation (CV), the standard deviation divided by the mean
Five numbers of Tukey (min, Q1, median, Q3, max)
Coefficient of quartile variation (CQV, sometimes called coefficient of dispersion), calculated as (Q3 - Q1) / (Q3 + Q1) using quantile with type = 6
as quantile algorithm to comply with SPSS standards
Outliers (total count and unique count)
So for example, the above frequency table quickly shows the median age of patients being 74.
Frequencies of factors will be sorted on factor level instead of item count by default. This can be changed with the sort.count
parameter. Frequency tables of factors always show the factor level as an additional last column.
sort.count
is TRUE
by default, except for factors. Compare this default behaviour…
septic_patients %>%
freq(hospital_id)
# Frequency table of `hospital_id`
# Class: factor (numeric)
# Length: 2000 (of which NA: 0 = 0.00%)
# Unique: 4
#
# Item Count Percent Cum. Count Cum. Percent (Factor Level)
# --- ----- ------ -------- ----------- ------------- ---------------
# 1 A 321 16.1% 321 16.1% 1
# 2 B 663 33.1% 984 49.2% 2
# 3 C 254 12.7% 1238 61.9% 3
# 4 D 762 38.1% 2000 100.0% 4
… with this, where items are now sorted on count:
septic_patients %>%
freq(hospital_id, sort.count = TRUE)
# Frequency table of `hospital_id`
# Class: factor (numeric)
# Length: 2000 (of which NA: 0 = 0.00%)
# Unique: 4
#
# Item Count Percent Cum. Count Cum. Percent (Factor Level)
# --- ----- ------ -------- ----------- ------------- ---------------
# 1 D 762 38.1% 762 38.1% 4
# 2 B 663 33.1% 1425 71.2% 2
# 3 A 321 16.1% 1746 87.3% 1
# 4 C 254 12.7% 2000 100.0% 3
All classes will be printed into the header. Variables with the new rsi
class of this AMR package are actually ordered factors and have three classes (look at Class
in the header):
septic_patients %>%
select(amox) %>%
freq()
# Frequency table
# Class: factor > ordered > rsi (numeric)
# Length: 2000 (of which NA: 1000 = 50.00%)
# Unique: 3
#
# %IR: 66.5%
# Ratio SIR: 1.0 : 0.009 : 2.0
#
# Item Count Percent Cum. Count Cum. Percent (Factor Level)
# --- ----- ------ -------- ----------- ------------- ---------------
# 1 S 335 33.5% 335 33.5% 1
# 2 I 3 0.3% 338 33.8% 2
# 3 R 662 66.2% 1000 100.0% 3
Frequencies of dates will show the oldest and newest date in the data, and the amount of days between them:
septic_patients %>%
select(date) %>%
freq(nmax = 5)
# Frequency table
# Class: Date (numeric)
# Length: 2000 (of which NA: 0 = 0.00%)
# Unique: 1140
#
# Oldest: 2 januari 2002
# Newest: 28 december 2017 (+5839)
# Median: 31 juli 2009 (~47%)
#
# Item Count Percent Cum. Count Cum. Percent
# --- ----------- ------ -------- ----------- -------------
# 1 2016-05-21 10 0.5% 10 0.5%
# 2 2004-11-15 8 0.4% 18 0.9%
# 3 2013-07-29 8 0.4% 26 1.3%
# 4 2017-06-12 8 0.4% 34 1.7%
# 5 2015-11-19 7 0.4% 41 2.1%
# [ reached `nmax = 5` -- omitted 1135 entries, n = 1959 (98.0%) ]
A frequency table is actaually a regular data.frame
, with the exception that it contains an additional class.
Because of this additional class, a frequency table prints like the examples above. But the object itself contains the complete table without a row limitation:
na.rm
With the na.rm
parameter (defaults to TRUE
, but they will always be shown into the header), you can include NA
values in the frequency table:
septic_patients %>%
freq(amox, na.rm = FALSE)
# Frequency table of `amox`
# Class: factor > ordered > rsi (numeric)
# Length: 3000 (of which NA: 1000 = 33.33%)
# Unique: 4
#
# %IR: NA
# Ratio SIR: 1.0 : NA : NA
#
# Item Count Percent Cum. Count Cum. Percent (Factor Level)
# --- ----- ------ -------- ----------- ------------- ---------------
# 1 S 335 16.8% 335 16.8% 1
# 2 I 3 0.2% 338 16.9% 2
# 3 R 662 33.1% 1000 50.0% 3
# 4 <NA> 1000 50.0% 2000 100.0% <NA>
row.names
The default frequency tables shows row indices. To remove them, use row.names = FALSE
:
septic_patients %>%
freq(hospital_id, row.names = FALSE)
# Frequency table of `hospital_id`
# Class: factor (numeric)
# Length: 2000 (of which NA: 0 = 0.00%)
# Unique: 4
#
# Item Count Percent Cum. Count Cum. Percent (Factor Level)
# ----- ------ -------- ----------- ------------- ---------------
# A 321 16.1% 321 16.1% 1
# B 663 33.1% 984 49.2% 2
# C 254 12.7% 1238 61.9% 3
# D 762 38.1% 2000 100.0% 4
markdown
The markdown
parameter can be used in reports created with R Markdown. This will always print all rows:
septic_patients %>%
freq(hospital_id, markdown = TRUE)
# Frequency table of `hospital_id`
#
# Class: factor (numeric)
#
# Length: 2000 (of which NA: 0 = 0.00%)
#
# Unique: 4
#
# | |Item | Count| Percent| Cum. Count| Cum. Percent| (Factor Level)|
# |:--|:----|-----:|-------:|----------:|------------:|--------------:|
# |1 |A | 321| 16.1%| 321| 16.1%| 1|
# |2 |B | 663| 33.1%| 984| 49.2%| 2|
# |3 |C | 254| 12.7%| 1238| 61.9%| 3|
# |4 |D | 762| 38.1%| 2000| 100.0%| 4|
AMR, (c) 2018, https://github.com/msberends/AMR
Licensed under the GNU General Public License v2.0.