The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Beta diversity is a measure of how different two samples are. The
different metrics described in this vignette quantify that difference,
referred to as the “distance” or “dissimilarity” between a pair of
samples. The distance is typically 0 for identical samples
and 1 for completely different samples.
We will use the ex_counts feature table included with
ecodive. It contains the number of observations of each bacterial genera
in each sample. In the text below, you can substitute the word ‘genera’
for the feature of interest in your own data.
library(ecodive)
counts <- rarefy(ex_counts)
t(counts)
#> Saliva Gums Nose Stool
#> Streptococcus 162 309 6 1
#> Bacteroides 2 2 0 341
#> Corynebacterium 0 0 171 1
#> Haemophilus 180 34 0 1
#> Propionibacterium 1 0 82 0
#> Staphylococcus 0 0 86 1Looking at the matrix above, you can see that saliva and gums are similar, while saliva and stool are quite different.
Before selecting a formula, it is important to distinguish between weighted and unweighted metrics.
You can consult list_metrics() to see which category a
specific metric falls into or to list all available options
programmatically.
list_metrics('beta', 'id', weighted = FALSE)
#> [1] "sorensen" "hamming" "jaccard" "ochiai" "unweighted_unifrac"
list_metrics('beta', 'id', weighted = TRUE)
#> [1] "aitchison" "bhattacharyya" "bray"
#> [4] "canberra" "chebyshev" "chord"
#> [7] "clark" "divergence" "euclidean"
#> [10] "generalized_unifrac" "gower" "hellinger"
#> [13] "horn" "jensen" "jsd"
#> [16] "lorentzian" "manhattan" "matusita"
#> [19] "minkowski" "morisita" "motyka"
#> [22] "normalized_unifrac" "psym_chisq" "soergel"
#> [25] "squared_chisq" "squared_chord" "squared_euclidean"
#> [28] "topsoe" "variance_adjusted_unifrac" "wave_hedges"
#> [31] "weighted_unifrac"The following tables detail the mathematical definitions for the
metrics available in ecodive.
Given:
Aitchison distance aitchison() |
\(\sqrt{\sum_{i=1}^{n} [(\ln{X_i} - X_L) - (\ln{Y_i} - Y_L)]^2}\) |
Bhattacharyya distance bhattacharyya() |
\(-\ln{\sum_{i=1}^{n}\sqrt{P_{i}Q_{i}}}\) |
Bray-Curtis dissimilarity bray() |
\(\displaystyle \frac{\sum_{i=1}^{n} |X_i - Y_i|}{\sum_{i=1}^{n} (X_i + Y_i)}\) |
Canberra distance canberra() |
\(\displaystyle \sum_{i=1}^{n} \frac{|X_i - Y_i|}{X_i + Y_i}\) |
Chebyshev distance chebyshev() |
\(\max(|X_i - Y_i|)\) |
Chord distance chord() |
\(\displaystyle \sqrt{\sum_{i=1}^{n} \left(\frac{X_i}{\sqrt{\sum_{j=1}^{n} X_j^2}} - \frac{Y_i}{\sqrt{\sum_{j=1}^{n} Y_j^2}}\right)^2}\) |
Clark’s divergence distance clark() |
\(\displaystyle \sqrt{\sum_{i=1}^{n}\left(\frac{X_i - Y_i}{X_i + Y_i}\right)^{2}}\) |
Divergence divergence() |
\(\displaystyle 2\sum_{i=1}^{n} \frac{(P_i - Q_i)^2}{(P_i + Q_i)^2}\) |
Euclidean distance euclidean() |
\(\sqrt{\sum_{i=1}^{n} (X_i - Y_i)^2}\) |
Gower distance gower() |
\(\displaystyle \frac{1}{n}\sum_{i=1}^{n}\frac{|X_i - Y_i|}{R_i}\) |
Hellinger distance hellinger() |
\(\sqrt{\sum_{i=1}^{n}(\sqrt{P_i} - \sqrt{Q_i})^{2}}\) |
Horn-Morisita dissimilarity horn() |
\(\displaystyle 1 - \frac{2\sum_{i=1}^{n}P_{i}Q_{i}}{\sum_{i=1}^{n}P_i^2 + \sum_{i=1}^{n}Q_i^2}\) |
Jensen-Shannon distance jensen() |
\(\displaystyle \sqrt{\frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]}\) |
Jensen-Shannon divergence (JSD) jsd() |
\(\displaystyle \frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]\) |
Lorentzian distance lorentzian() |
\(\sum_{i=1}^{n}\ln{(1 + |X_i - Y_i|)}\) |
Manhattan distance manhattan() |
\(\sum_{i=1}^{n} |X_i - Y_i|\) |
Matusita distance matusita() |
\(\sqrt{\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2}\) |
Minkowski distance minkowski() |
\(\sqrt[p]{\sum_{i=1}^{n}
(X_i - Y_i)^p}\) Where \(p\) is the geometry of the space. |
| Morisita dissimilarity (integers only) morisita() |
\(\displaystyle 1 - \frac{2\sum_{i=1}^{n}X_{i}Y_{i}}{\displaystyle \left(\frac{\sum_{i=1}^{n}X_i(X_i - 1)}{X_T(X_T - 1)} + \frac{\sum_{i=1}^{n}Y_i(Y_i - 1)}{Y_T(Y_T - 1)}\right)X_{T}Y_{T}}\) |
Motyka dissimilarity motyka() |
\(\displaystyle \frac{\sum_{i=1}^{n} \max(X_i, Y_i)}{\sum_{i=1}^{n} (X_i + Y_i)}\) |
Probabilistic Symmetric \(\chi^2\) distance psym_chisq() |
\(\displaystyle 2\sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}\) |
Soergel distance soergel() |
\(\displaystyle \frac{\sum_{i=1}^{n} |X_i - Y_i|}{\sum_{i=1}^{n} \max(X_i, Y_i)}\) |
Squared \(\chi^2\) distance squared_chisq() |
\(\displaystyle \sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}\) |
Squared Chord distance squared_chord() |
\(\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2\) |
Squared Euclidean distance squared_euclidean() |
\(\sum_{i=1}^{n} (X_i - Y_i)^2\) |
Topsoe distance topsoe() |
\(\displaystyle \sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\) |
Wave Hedges distance wave_hedges() |
\(\displaystyle \sum_{i=1}^{n}\frac{|X_i - Y_i|}{\max(X_i, Y_i)}\) |
Given:
Dice-Sorensen dissimilarity sorensen() |
\(\displaystyle \frac{2J}{(A + B)}\) |
Hamming distance hamming() |
\(\displaystyle (A + B) - 2J\) |
Jaccard distance jaccard() |
\(\displaystyle 1 - \frac{J}{(A + B - J)}\) |
Otsuka-Ochiai dissimilarity ochiai() |
\(\displaystyle 1 - \frac{J}{\sqrt{AB}}\) |
Note:
sorensen()is equivalent tobray(norm = 'binary'), andjaccard()is equivalent tosoergel(norm = 'binary').
Given \(n\) branches with lengths \(L\) and a pair of samples’ binary (\(A\) and \(B\)) or proportional abundances (\(P\) and \(Q\)) on each of those branches.
Unweighted UniFrac unweighted_unifrac() |
\(\displaystyle \frac{1}{n}\sum_{i=1}^{n} L_i|A_i - B_i|\) |
Weighted UniFrac weighted_unifrac() |
\(\displaystyle \sum_{i=1}^{n} L_i|P_i - Q_i|\) |
Normalized Weighted UniFrac normalized_unifrac() |
\(\displaystyle \frac{\sum_{i=1}^{n} L_i|P_i - Q_i|}{\sum_{i=1}^{n} L_i(P_i + Q_i)}\) |
Generalized UniFrac (GUniFrac) generalized_unifrac() |
\(\displaystyle
\frac{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}\left|\displaystyle
\frac{P_i - Q_i}{P_i + Q_i}\right|}{\sum_{i=1}^{n} L_i(P_i +
Q_i)^{\alpha}}\) Where \(\alpha\) is a scalable weighting factor. |
Variance-Adjusted Weighted UniFrac
variance_adjusted_unifrac() |
\(\displaystyle \frac{\displaystyle \sum_{i=1}^{n} L_i\displaystyle \frac{|P_i - Q_i|}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }{\displaystyle \sum_{i=1}^{n} L_i\displaystyle \frac{P_i + Q_i}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }\) |
See https://cmmr.github.io/ecodive/articles/unifrac.html for detailed example UniFrac calculations.
The default value of pairs=NULL in ecodive’s beta
diversity functions results in the returned all-vs-all distance matrix
being completely filled in.
bray(counts)
#> Saliva Gums Nose
#> Gums 0.4260870
#> Nose 0.9797101 0.9826087
#> Stool 0.9884058 0.9884058 0.9913043If you are doing a reference-vs-all comparison, you can use the
pairs parameter to skip unwanted calculations and save some
CPU time. The larger the dataset, the more noticeable the improvement
will be.
bray(counts, pairs = 1:3)
#> Saliva Gums Nose
#> Gums 0.4260870
#> Nose 0.9797101 NA
#> Stool 0.9884058 NA NAThe pairs argument can be:
function(i,j) that returns whether rows
i and j should be compared.Therefore, all of the following are equivalent:
bray(counts, pairs = 1:3)
bray(counts, pairs = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE))
bray(counts, pairs = function (i, j) i == 1)The ordering of pairs follows the pairings produced by
combn().
# Column index pairings
combn(nrow(counts), 2)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 1 1 1 2 2 3
#> [2,] 2 3 4 3 4 4
# Sample name pairings
combn(rownames(counts), 2)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] "Saliva" "Saliva" "Saliva" "Gums" "Gums" "Nose"
#> [2,] "Gums" "Nose" "Stool" "Nose" "Stool" "Stool"So, for instance, to use gums as the reference sample:
Levy, A., Shalom, B. R., & Chalamish, M. (2024). A guide to similarity measures. arXiv.
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.