The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Beta Diversity

Introduction

Beta diversity is a measure of how different two samples are. The different metrics described in this vignette quantify that difference, referred to as the “distance” or “dissimilarity” between a pair of samples. The distance is typically 0 for identical samples and 1 for completely different samples.

Input Data

We will use the ex_counts feature table included with ecodive. It contains the number of observations of each bacterial genera in each sample. In the text below, you can substitute the word ‘genera’ for the feature of interest in your own data.

library(ecodive)

counts <- rarefy(ex_counts)
t(counts)
#>                   Saliva Gums Nose Stool
#> Streptococcus        162  309    6     1
#> Bacteroides            2    2    0   341
#> Corynebacterium        0    0  171     1
#> Haemophilus          180   34    0     1
#> Propionibacterium      1    0   82     0
#> Staphylococcus         0    0   86     1

Looking at the matrix above, you can see that saliva and gums are similar, while saliva and stool are quite different.

Weighted vs Unweighted Metrics

Before selecting a formula, it is important to distinguish between weighted and unweighted metrics.

Weighted metrics take relative abundances into account.
Unweighted metrics only consider presence/absence (binary data).

You can consult list_metrics() to see which category a specific metric falls into or to list all available options programmatically.

list_metrics('beta', 'id', weighted = FALSE)
#> [1] "sorensen"  "hamming"  "jaccard"  "ochiai"  "unweighted_unifrac"

list_metrics('beta', 'id', weighted = TRUE)
#>  [1] "aitchison"            "bhattacharyya"              "bray"                     
#>  [4] "canberra"             "chebyshev"                  "chord"                    
#>  [7] "clark"                "divergence"                 "euclidean"                
#> [10] "generalized_unifrac"  "gower"                      "hellinger"                
#> [13] "horn"                 "jensen"                     "jsd"                      
#> [16] "lorentzian"           "manhattan"                  "matusita"                 
#> [19] "minkowski"            "morisita"                   "motyka"                   
#> [22] "normalized_unifrac"   "psym_chisq"                 "soergel"                  
#> [25] "squared_chisq"        "squared_chord"              "squared_euclidean"        
#> [28] "topsoe"               "variance_adjusted_unifrac"  "wave_hedges"
#> [31] "weighted_unifrac"

Formulas

The following tables detail the mathematical definitions for the metrics available in ecodive.

Abundance-Based (Weighted)

Given:

\(n\) : The number of features.
\(X_i\), \(Y_i\) : Absolute counts for the \(i\)-th feature in samples \(X\) and \(Y\).
\(X_T\), \(Y_T\) : Total counts in each sample. \(X_T = \sum_{i=1}^{n} X_i\)
\(P_i\), \(Q_i\) : Proportional abundances of \(X_i\) and \(Y_i\). \(P_i = X_i / X_T\)
\(X_L\), \(Y_L\) : Mean log of abundances. \(X_L = \frac{1}{n}\sum_{i=1}^{n} \ln{X_i}\)
\(R_i\) : The range of the \(i\)-th feature across all samples (max - min).

Aitchison distance `aitchison()`	\(\sqrt{\sum_{i=1}^{n} [(\ln{X_i} - X_L) - (\ln{Y_i} - Y_L)]^2}\)
Bhattacharyya distance `bhattacharyya()`	\(-\ln{\sum_{i=1}^{n}\sqrt{P_{i}Q_{i}}}\)
Bray-Curtis dissimilarity `bray()`	\(\displaystyle \frac{\sum_{i=1}^{n} \|X_i - Y_i\|}{\sum_{i=1}^{n} (X_i + Y_i)}\)
Canberra distance `canberra()`	\(\displaystyle \sum_{i=1}^{n} \frac{\|X_i - Y_i\|}{X_i + Y_i}\)
Chebyshev distance `chebyshev()`	\(\max(\|X_i - Y_i\|)\)
Chord distance `chord()`	\(\displaystyle \sqrt{\sum_{i=1}^{n} \left(\frac{X_i}{\sqrt{\sum_{j=1}^{n} X_j^2}} - \frac{Y_i}{\sqrt{\sum_{j=1}^{n} Y_j^2}}\right)^2}\)
Clark’s divergence distance `clark()`	\(\displaystyle \sqrt{\sum_{i=1}^{n}\left(\frac{X_i - Y_i}{X_i + Y_i}\right)^{2}}\)
Divergence `divergence()`	\(\displaystyle 2\sum_{i=1}^{n} \frac{(P_i - Q_i)^2}{(P_i + Q_i)^2}\)
Euclidean distance `euclidean()`	\(\sqrt{\sum_{i=1}^{n} (X_i - Y_i)^2}\)
Gower distance `gower()`	\(\displaystyle \frac{1}{n}\sum_{i=1}^{n}\frac{\|X_i - Y_i\|}{R_i}\)
Hellinger distance `hellinger()`	\(\sqrt{\sum_{i=1}^{n}(\sqrt{P_i} - \sqrt{Q_i})^{2}}\)
Horn-Morisita dissimilarity `horn()`	\(\displaystyle 1 - \frac{2\sum_{i=1}^{n}P_{i}Q_{i}}{\sum_{i=1}^{n}P_i^2 + \sum_{i=1}^{n}Q_i^2}\)
Jensen-Shannon distance `jensen()`	\(\displaystyle \sqrt{\frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]}\)
Jensen-Shannon divergence (JSD) `jsd()`	\(\displaystyle \frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]\)
Lorentzian distance `lorentzian()`	\(\sum_{i=1}^{n}\ln{(1 + \|X_i - Y_i\|)}\)
Manhattan distance `manhattan()`	\(\sum_{i=1}^{n} \|X_i - Y_i\|\)
Matusita distance `matusita()`	\(\sqrt{\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2}\)
Minkowski distance `minkowski()`	\(\sqrt[p]{\sum_{i=1}^{n} (X_i - Y_i)^p}\) Where \(p\) is the geometry of the space.
Morisita dissimilarity (integers only) `morisita()`	\(\displaystyle 1 - \frac{2\sum_{i=1}^{n}X_{i}Y_{i}}{\displaystyle \left(\frac{\sum_{i=1}^{n}X_i(X_i - 1)}{X_T(X_T - 1)} + \frac{\sum_{i=1}^{n}Y_i(Y_i - 1)}{Y_T(Y_T - 1)}\right)X_{T}Y_{T}}\)
Motyka dissimilarity `motyka()`	\(\displaystyle \frac{\sum_{i=1}^{n} \max(X_i, Y_i)}{\sum_{i=1}^{n} (X_i + Y_i)}\)
Probabilistic Symmetric \(\chi^2\) distance `psym_chisq()`	\(\displaystyle 2\sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}\)
Soergel distance `soergel()`	\(\displaystyle \frac{\sum_{i=1}^{n} \|X_i - Y_i\|}{\sum_{i=1}^{n} \max(X_i, Y_i)}\)
Squared \(\chi^2\) distance `squared_chisq()`	\(\displaystyle \sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}\)
Squared Chord distance `squared_chord()`	\(\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2\)
Squared Euclidean distance `squared_euclidean()`	\(\sum_{i=1}^{n} (X_i - Y_i)^2\)
Topsoe distance `topsoe()`	\(\displaystyle \sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\)
Wave Hedges distance `wave_hedges()`	\(\displaystyle \sum_{i=1}^{n}\frac{\|X_i - Y_i\|}{\max(X_i, Y_i)}\)

Presence / Absence (Unweighted)

Given:

\(A\), \(B\) : Number of features in each sample.
\(J\) : Number of features in common.

Dice-Sorensen dissimilarity `sorensen()`	\(\displaystyle \frac{2J}{(A + B)}\)
Hamming distance `hamming()`	\(\displaystyle (A + B) - 2J\)
Jaccard distance `jaccard()`	\(\displaystyle 1 - \frac{J}{(A + B - J)}\)
Otsuka-Ochiai dissimilarity `ochiai()`	\(\displaystyle 1 - \frac{J}{\sqrt{AB}}\)

Note: sorensen() is equivalent to bray(norm = 'binary'), and jaccard() is equivalent to soergel(norm = 'binary').

Phylogenetic

Given \(n\) branches with lengths \(L\) and a pair of samples’ binary (\(A\) and \(B\)) or proportional abundances (\(P\) and \(Q\)) on each of those branches.

Unweighted UniFrac `unweighted_unifrac()`	\(\displaystyle \frac{1}{n}\sum_{i=1}^{n} L_i\|A_i - B_i\|\)
Weighted UniFrac `weighted_unifrac()`	\(\displaystyle \sum_{i=1}^{n} L_i\|P_i - Q_i\|\)
Normalized Weighted UniFrac `normalized_unifrac()`	\(\displaystyle \frac{\sum_{i=1}^{n} L_i\|P_i - Q_i\|}{\sum_{i=1}^{n} L_i(P_i + Q_i)}\)
Generalized UniFrac (GUniFrac) `generalized_unifrac()`	\(\displaystyle \frac{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}\left\|\displaystyle \frac{P_i - Q_i}{P_i + Q_i}\right\|}{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}}\) Where \(\alpha\) is a scalable weighting factor.
Variance-Adjusted Weighted UniFrac `variance_adjusted_unifrac()`	\(\displaystyle \frac{\displaystyle \sum_{i=1}^{n} L_i\displaystyle \frac{\|P_i - Q_i\|}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }{\displaystyle \sum_{i=1}^{n} L_i\displaystyle \frac{P_i + Q_i}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }\)

See https://cmmr.github.io/ecodive/articles/unifrac.html for detailed example UniFrac calculations.

Partial Calculation

The default value of pairs=NULL in ecodive’s beta diversity functions results in the returned all-vs-all distance matrix being completely filled in.

bray(counts)
#>          Saliva      Gums      Nose
#> Gums  0.4260870                    
#> Nose  0.9797101 0.9826087          
#> Stool 0.9884058 0.9884058 0.9913043

If you are doing a reference-vs-all comparison, you can use the pairs parameter to skip unwanted calculations and save some CPU time. The larger the dataset, the more noticeable the improvement will be.

bray(counts, pairs = 1:3)
#>          Saliva      Gums      Nose
#> Gums  0.4260870                    
#> Nose  0.9797101        NA          
#> Stool 0.9884058        NA        NA

The pairs argument can be:

A numeric vector, giving the positions in the result to calculate.
A logical vector, indicating whether to calculate a position in the result.
A function(i,j) that returns whether rows i and j should be compared.

Therefore, all of the following are equivalent:

bray(counts, pairs = 1:3)
bray(counts, pairs = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE))
bray(counts, pairs = function (i, j) i == 1)

The ordering of pairs follows the pairings produced by combn().

# Column index pairings
combn(nrow(counts), 2)
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    1    1    1    2    2    3
#> [2,]    2    3    4    3    4    4

# Sample name pairings
combn(rownames(counts), 2)
#>      [,1]     [,2]     [,3]     [,4]   [,5]    [,6]   
#> [1,] "Saliva" "Saliva" "Saliva" "Gums" "Gums"  "Nose" 
#> [2,] "Gums"   "Nose"   "Stool"  "Nose" "Stool" "Stool"

So, for instance, to use gums as the reference sample:

my_combn <- combn(rownames(counts), 2)
my_pairs <- my_combn[1,] == 'Gums' | my_combn[2,] == 'Gums'

my_pairs
#> [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE

bray(counts, pairs = my_pairs)
#>          Saliva      Gums      Nose
#> Gums  0.4260870                    
#> Nose         NA 0.9826087          
#> Stool        NA 0.9884058        NA

References

Levy, A., Shalom, B. R., & Chalamish, M. (2024). A guide to similarity measures. arXiv.

Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.