| Type: | Package |
| Title: | Parallel and Memory-Efficient Ecological Diversity Metrics |
| Version: | 2.2.6 |
| Description: | Computes alpha and beta diversity metrics using concurrent 'C' threads. Metrics include 'UniFrac', Faith's phylogenetic diversity, Bray-Curtis dissimilarity, Shannon diversity index, and many others. Also parses newick trees into 'phylo' objects and rarefies feature tables. |
| URL: | https://cmmr.github.io/ecodive/, https://github.com/cmmr/ecodive |
| BugReports: | https://github.com/cmmr/ecodive/issues |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| LazyData: | true |
| Depends: | R (≥ 3.6.0) |
| RoxygenNote: | 7.3.3 |
| Config/Needs/website: | rmarkdown |
| Config/testthat/edition: | 3 |
| Imports: | parallel, utils |
| Suggests: | knitr, Matrix, parallelly, rmarkdown, slam, testthat (≥ 3.0.0) |
| VignetteBuilder: | knitr |
| NeedsCompilation: | yes |
| Packaged: | 2026-04-14 16:52:58 UTC; Daniel |
| Author: | Daniel P. Smith |
| Maintainer: | Daniel P. Smith <dansmith01@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-04-14 17:30:22 UTC |
Abundance-based Coverage Estimator (ACE)
Description
A non-parametric estimator of species richness that separates features into abundant and rare groups.
Usage
ace(counts, cutoff = 10L, margin = 1L, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
cutoff |
The maximum number of observations to consider "rare".
Default: |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The ACE metric separates features into "abundant" and "rare" groups based on a cutoff (usually 10 counts). It assumes that the presence of abundant species is certain, while the true number of rare species must be estimated.
Equations:
C_{ace} = 1 - \frac{F_1}{X_{rare}}
\gamma_{ace}^2 = \max\left[\frac{F_{rare} \sum_{i=1}^{r}i(i-1)F_i}{C_{ace}X_{rare}(X_{rare} - 1)} - 1, 0\right]
D_{ace} = F_{abund} + \frac{F_{rare}}{C_{ace}} + \frac{F_1}{C_{ace}}\gamma_{ace}^2
Where:
-
r: Rare cutoff (default 10). Features with\le rcounts are considered rare. -
F_i: Number of features with exactlyicounts. -
F_1: Number of features whereX_i = 1(singletons). -
F_{rare}: Number of rare features whereX_i \le r. -
F_{abund}: Number of abundant features whereX_i > r. -
X_{rare}: Total counts belonging to rare features. -
C_{ace}: The sample abundance coverage estimator. -
\gamma_{ace}^2: The estimated coefficient of variation.
Parameter: cutoff The integer threshold distinguishing rare from abundant species. Standard practice is to use 10.
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Chao, A., & Lee, S. M. (1992). Estimating the number of classes via sample coverage. Journal of the American Statistical Association, 87(417), 210-217. doi:10.1080/01621459.1992.10475194
See Also
alpha_div(), vignette('adiv')
Other Richness metrics:
chao1(),
margalef(),
menhinick(),
observed(),
squares()
Examples
ace(ex_counts)
documentation
Description
documentation
Arguments
counts |
A numeric matrix of count data (samples |
documentation
Description
documentation
Arguments
counts |
A numeric matrix of count data (samples |
documentation
Description
documentation
Arguments
counts |
A numeric matrix of count data (samples |
Aitchison distance
Description
Calculates the Euclidean distance between centered log-ratio (CLR) transformed abundances.
Usage
aitchison(
counts,
margin = 1L,
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Aitchison distance is defined as:
\sqrt{\sum_{i=1}^{n} [(\ln{X_i} - X_L) - (\ln{Y_i} - Y_L)]^2}
Where:
-
X_i,Y_i: Absolute counts for thei-th feature. -
X_L,Y_L: Mean log of abundances.X_L = \frac{1}{n}\sum_{i=1}^{n} \ln{X_i}. -
n: The number of features.
Base R Equivalent:
x <- log((x + pseudocount) / exp(mean(log(x + pseudocount)))) y <- log((y + pseudocount) / exp(mean(log(y + pseudocount)))) sqrt(sum((x-y)^2)) # Euclidean distance
Pseudocount
Zeros are undefined in the Aitchison (CLR) transformation. If
pseudocount is NULL (the default) and zeros are detected,
the function uses half the minimum non-zero value (min(x[x>0]) / 2)
and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Aitchison, J. (1986). The statistical analysis of compositional data. Chapman and Hall. doi:10.1002/bimj.4710300705
Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2), 139-160. doi:10.1111/j.2517-6161.1982.tb01195.x
Costea, P. I., Zeller, G., Sunagawa, S., & Bork, P. (2014). A fair comparison. Nature Methods, 11(4), 359. doi:10.1038/nmeth.2897
Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., & Egozcue, J. J. (2017). Microbiome datasets are compositional: and this is not optional. Frontiers in Microbiology, 8, 2224. doi:10.3389/fmicb.2017.02224
Kaul, A., Mandal, S., Davidov, O., & Peddada, S. D. (2017). Analysis of microbiome data in the presence of excess zeros. Frontiers in Microbiology, 8, 2114. doi:10.3389/fmicb.2017.02114
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
aitchison(ex_counts, pseudocount = 1)
Alpha Diversity Wrapper Function
Description
Alpha Diversity Wrapper Function
Usage
alpha_div(
counts,
metric,
norm = "percent",
cutoff = 10L,
digits = 3L,
tree = NULL,
margin = 1L,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
metric |
The name of an alpha diversity metric. One of |
norm |
Normalize the incoming counts. Options are:
Default: |
cutoff |
The maximum number of observations to consider "rare".
Default: |
digits |
Precision of the returned values, in number of decimal
places. E.g. the default |
tree |
A |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Details
Integer Count Requirements
A frequent and critical error in alpha diversity analysis is providing the wrong type of data to a metric's formula. Some indices are mathematically defined based on counts of individuals and require raw, integer abundance data. Others are based on proportional abundances and can accept either integer counts (which are then converted to proportions) or pre-normalized proportional data. Using proportional data with a metric that requires integer counts will return an error message.
Requires Integer Counts Only
Chao1
ACE
Squares Richness Estimator
Margalef's Index
Menhinick's Index
Fisher's Alpha
Brillouin Index
Can Use Proportional Data
Observed Features
Shannon Index
Gini-Simpson Index
Inverse Simpson Index
Berger-Parker Index
McIntosh Index
Faith's PD
Value
A numeric vector.
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Examples
# Example counts matrix
ex_counts
# Shannon diversity values
alpha_div(ex_counts, 'Shannon')
# Chao1 diversity values
alpha_div(ex_counts, 'c')
# Faith PD values
alpha_div(ex_counts, 'faith', tree = ex_tree)
documentation
Description
documentation
Arguments
counts |
A numeric matrix of count data (samples |
documentation
Description
documentation
Arguments
counts |
A numeric matrix of count data (samples |
documentation
Description
documentation
Arguments
counts |
A numeric matrix of count data (samples |
Berger-Parker Index
Description
A measure of the numerical importance of the most abundant species.
Usage
berger(counts, margin = 1L, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Berger-Parker index is defined as the proportional abundance of the most dominant feature:
\max(P_i)
Where:
-
P_i: Proportional abundance of thei-th feature.
Base R Equivalent:
x <- ex_counts[1,] p <- x / sum(x) max(p)
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Berger, W. H., & Parker, F. L. (1970). Diversity of planktonic foraminifera in deep-sea sediments. Science, 168(3937), 1345-1347. doi:10.1126/science.168.3937.1345
See Also
alpha_div(), vignette('adiv')
Other Dominance metrics:
mcintosh()
Examples
berger(ex_counts)
Beta Diversity Wrapper Function
Description
Beta Diversity Wrapper Function
Usage
beta_div(
counts,
metric,
margin = 1L,
norm = "none",
pseudocount = NULL,
power = 1.5,
alpha = 0.5,
tree = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
metric |
The name of a beta diversity metric. One of |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
power |
Only used when |
alpha |
Only used when |
tree |
Only used by phylogeny-aware metrics. A |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
List of Beta Diversity Metrics
| Option / Function Name | Metric Name |
aitchison | Aitchison distance |
bhattacharyya | Bhattacharyya distance |
bray | Bray-Curtis dissimilarity |
canberra | Canberra distance |
chebyshev | Chebyshev distance |
chord | Chord distance |
clark | Clark's divergence distance |
divergence | Divergence |
euclidean | Euclidean distance |
generalized_unifrac | Generalized UniFrac (GUniFrac) |
gower | Gower distance |
hamming | Hamming distance |
hellinger | Hellinger distance |
horn | Horn-Morisita dissimilarity |
jaccard | Jaccard distance |
jensen | Jensen-Shannon distance |
jsd | Jesen-Shannon divergence (JSD) |
lorentzian | Lorentzian distance |
manhattan | Manhattan distance |
matusita | Matusita distance |
minkowski | Minkowski distance |
morisita | Morisita dissimilarity |
motyka | Motyka dissimilarity |
normalized_unifrac | Normalized Weighted UniFrac |
ochiai | Otsuka-Ochiai dissimilarity |
psym_chisq | Probabilistic Symmetric Chi-Squared distance |
soergel | Soergel distance |
sorensen | Dice-Sorensen dissimilarity |
squared_chisq | Squared Chi-Squared distance |
squared_chord | Squared Chord distance |
squared_euclidean | Squared Euclidean distance |
topsoe | Topsoe distance |
unweighted_unifrac | Unweighted UniFrac |
variance_adjusted_unifrac | Variance-Adjusted Weighted UniFrac (VAW-UniFrac) |
wave_hedges | Wave Hedges distance |
weighted_unifrac | Weighted UniFrac |
Flexible name matching
Case insensitive and partial matching. Any runs of non-alpha characters are
converted to underscores. E.g. metric = 'Weighted UniFrac selects
weighted_unifrac.
UniFrac names can be shortened to the first letter plus "unifrac". E.g.
uunifrac, w_unifrac, or V UniFrac. These also support partial matching.
Finished code should always use the full primary option name to avoid ambiguity with future additions to the metrics list.
Value
A numeric vector.
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Examples
# Example counts matrix
ex_counts
# Bray-Curtis distances
beta_div(ex_counts, 'bray')
# Generalized UniFrac distances
beta_div(ex_counts, 'GUniFrac', tree = ex_tree)
Bhattacharyya distance
Description
Measures the similarity of two probability distributions.
Usage
bhattacharyya(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Bhattacharyya distance is defined as:
-\ln{\sum_{i=1}^{n}\sqrt{P_{i}Q_{i}}}
Where:
-
P_i,Q_i: Proportional abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) -log(sum(sqrt(p * q)))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Bhattacharyya, A. (1943). On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society, 35, 99-109.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
bhattacharyya(ex_counts)
Bray-Curtis dissimilarity
Description
A standard ecological metric quantifying the dissimilarity between communities.
Usage
bray(
counts,
margin = 1L,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Bray-Curtis dissimilarity is defined as:
\frac{\sum_{i=1}^{n} |X_i - Y_i|}{\sum_{i=1}^{n} (X_i + Y_i)}
Where:
-
X_i,Y_i: Absolute abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(abs(x-y)) / sum(x+y)
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Bray, J. R., & Curtis, J. T. (1957). An ordination of the upland forest communities of southern Wisconsin. Ecological Monographs, 27(4), 325-349. doi:10.2307/1942268
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
bray(ex_counts)
Brillouin Index
Description
A diversity index derived from information theory, appropriate for fully censused communities.
Usage
brillouin(counts, margin = 1L, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Brillouin index is defined as:
\frac{\ln{[(\sum_{i = 1}^{n} X_i)!]} - \sum_{i = 1}^{n} \ln{(X_i!)}}{\sum_{i = 1}^{n} X_i}
Where:
-
n: The number of features. -
X_i: Integer count of thei-th feature.
Base R Equivalent:
x <- ex_counts[1,] # note: lgamma(x + 1) == log(x!) (lgamma(sum(x) + 1) - sum(lgamma(x + 1))) / sum(x)
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Brillouin, L. (1956). Science and information theory. Academic Press.
See Also
alpha_div(), vignette('adiv')
Other Diversity metrics:
fisher(),
inv_simpson(),
shannon(),
simpson()
Examples
brillouin(ex_counts)
Canberra distance
Description
A weighted version of the Manhattan distance, sensitive to differences when both values are small.
Usage
canberra(
counts,
margin = 1L,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Canberra distance is defined as:
\sum_{i=1}^{n} \frac{|X_i - Y_i|}{X_i + Y_i}
Where:
-
X_i,Y_i: Absolute abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(abs(x-y) / (x+y))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Lance, G. N., & Williams, W. T. (1966). Computer programs for hierarchical polythetic classification ("similarity analyses"). The Computer Journal, 9(1), 60-64. doi:10.1093/comjnl/9.1.60
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
canberra(ex_counts)
Chao1 Richness Estimator
Description
A non-parametric estimator of the lower bound of species richness.
Usage
chao1(counts, margin = 1L, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Chao1 estimator uses the ratio of singletons to doubletons to estimate the number of missing species:
n + \frac{(F_1)^2}{2 F_2}
Where:
-
n: The number of observed features. -
F_1: Number of features observed once (singletons). -
F_2: Number of features observed twice (doubletons).
Base R Equivalent:
x <- ex_counts[1,] sum(x>0) + (sum(x == 1) ** 2) / (2 * sum(x == 2))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Chao, A. (1984). Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistics, 11, 265-270.
See Also
alpha_div(), vignette('adiv')
Other Richness metrics:
ace(),
margalef(),
menhinick(),
observed(),
squares()
Examples
chao1(ex_counts)
Chebyshev distance
Description
The maximum difference between any single feature across two samples.
Usage
chebyshev(
counts,
margin = 1L,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Chebyshev distance is defined as:
\max(|X_i - Y_i|)
Where:
-
X_i,Y_i: Absolute abundances of thei-th feature.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] max(abs(x-y))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Cantrell, C. D. (2000). Modern mathematical methods for physicists and engineers. Cambridge University Press.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
chebyshev(ex_counts)
Chord distance
Description
Euclidean distance between normalized vectors.
Usage
chord(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Chord distance is defined as:
\sqrt{\sum_{i=1}^{n} \left(\frac{X_i}{\sqrt{\sum_{j=1}^{n} X_j^2}} - \frac{Y_i}{\sqrt{\sum_{j=1}^{n} Y_j^2}}\right)^2}
Where:
-
X_i,Y_i: Absolute counts of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sqrt(sum(((x / sqrt(sum(x ^ 2))) - (y / sqrt(sum(y ^ 2))))^2))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Orlóci, L. (1967). An agglomerative method for classification of plant communities. Journal of Ecology, 55(1), 193-206. doi:10.2307/2257725
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
chord(ex_counts)
Clark's divergence distance
Description
Also known as the coefficient of divergence.
Usage
clark(
counts,
margin = 1L,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
Clark's divergence distance is defined as:
\sqrt{\sum_{i=1}^{n}\left(\frac{X_i - Y_i}{X_i + Y_i}\right)^{2}}
Where:
-
X_i,Y_i: Absolute abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sqrt(sum((abs(x - y) / (x + y)) ^ 2))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Clark, P. J. (1952). An extension of the coefficient of divergence for use with multiple characters. Copeia, 1952(2), 61-64. doi:10.2307/1438598
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
clark(ex_counts)
Divergence
Description
A probabilistic divergence metric.
Usage
divergence(
counts,
margin = 1L,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
Divergence is defined as:
2\sum_{i=1}^{n} \frac{(P_i - Q_i)^2}{(P_i + Q_i)^2}
Where:
-
P_i,Q_i: Proportional abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) 2 * sum((p - q)^2 / (p + q)^2)
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
divergence(ex_counts)
documentation
Description
documentation
Arguments
alpha |
How much weight to give to relative abundances; a value
between 0 and 1, inclusive. Setting |
counts |
A numeric matrix of count data (samples |
cpus |
How many parallel processing threads should be used. The
default, |
cutoff |
The maximum number of observations to consider "rare".
Default: |
digits |
Precision of the returned values, in number of decimal
places. E.g. the default |
norm |
Normalize the incoming counts. Options are:
Default: |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
power |
Scaling factor for the magnitude of differences between
communities ( |
pseudocount |
Value added to counts to handle zeros when
|
margin |
The margin containing samples. |
tree |
A |
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Euclidean distance
Description
The straight-line distance between two points in multidimensional space.
Usage
euclidean(
counts,
margin = 1L,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Euclidean distance is defined as:
\sqrt{\sum_{i=1}^{n} (X_i - Y_i)^2}
Where:
-
X_i,Y_i: Absolute abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sqrt(sum((x-y)^2))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Legendre, P., & Legendre, L. (2012). Numerical ecology. Elsevier.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
euclidean(ex_counts)
Example counts matrix
Description
Genera found on four human body sites.
Usage
ex_counts
Format
A matrix of 4 samples (columns) x 6 genera (rows).
Source
Derived from The Human Microbiome Project dataset.
Example phylogenetic tree
Description
Companion tree for ex_counts.
Usage
ex_tree
Format
A phylo object.
Details
ex_tree encodes this tree structure:
+----------44---------- Haemophilus
+-2-|
| +----------------68---------------- Bacteroides
|
| +---18---- Streptococcus
| +--12--|
| | +--11-- Staphylococcus
+--11--|
| +-----24----- Corynebacterium
+--12--|
+--13-- Propionibacterium
Faith's Phylogenetic Diversity (PD)
Description
Calculates the sum of the branch lengths for all species present in a sample.
Usage
faith(counts, tree = NULL, margin = 1L, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
tree |
A |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Details
Faith's PD is defined as:
\sum_{i = 1}^{n} L_i A_i
Where:
-
n: The number of branches in the phylogenetic tree. -
L_i: The length of thei-th branch. -
A_i: A binary value (1 if any descendants of branchiare present in the sample, 0 otherwise).
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Faith, D. P. (1992). Conservation evaluation and phylogenetic diversity. Biological Conservation, 61(1), 1-10. doi:10.1016/0006-3207(92)91201-3
See Also
alpha_div(), vignette('adiv')
Other Phylogenetic metrics:
generalized_unifrac(),
normalized_unifrac(),
unweighted_unifrac(),
variance_adjusted_unifrac(),
weighted_unifrac()
Examples
faith(ex_counts, tree = ex_tree)
Fisher's Alpha
Description
A parametric diversity index assuming species abundances follow a log-series distribution.
Usage
fisher(counts, digits = 3L, margin = 1L, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
digits |
Precision of the returned values, in number of decimal
places. E.g. the default |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Details
Fisher's Alpha (\alpha) is the parameter in the equation:
\frac{n}{\alpha} = \ln{\left(1 + \frac{X_T}{\alpha}\right)}
Where:
-
n: The number of features. -
X_T: Total of all counts (sequencing depth).
The value of \alpha is solved for iteratively.
Parameter: digits
The precision (number of decimal places) to use when solving the equation.
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Fisher, R. A., Corbet, A. S., & Williams, C. B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology, 12, 42-58. doi:10.2307/1411
See Also
alpha_div(), vignette('adiv')
Other Diversity metrics:
brillouin(),
inv_simpson(),
shannon(),
simpson()
Examples
fisher(ex_counts)
Generalized UniFrac (GUniFrac)
Description
A unified UniFrac distance that balances the weight of abundant and rare lineages.
Usage
generalized_unifrac(
counts,
tree = NULL,
alpha = 0.5,
margin = 1L,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
tree |
A |
alpha |
How much weight to give to relative abundances; a value
between 0 and 1, inclusive. Setting |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Generalized UniFrac distance is defined as:
\frac{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}\left|\frac{P_i - Q_i}{P_i + Q_i}\right|}{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}}
Where:
-
n: The number of branches in the tree. -
L_i: The length of thei-th branch. -
P_i,Q_i: The proportion of the community descending from branchiin sample P and Q. -
\alpha: A scalable weighting factor.
Parameter: alpha
The alpha parameter controls the weight given to abundant lineages. \alpha = 1 corresponds to Weighted UniFrac, while \alpha = 0 corresponds to Unweighted UniFrac.
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Chen, J., Bittinger, K., Charlson, E. S., Hoffmann, C., Lewis, J., Wu, G. D., ... & Li, H. (2012). Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics, 28(16), 2106-2113. doi:10.1093/bioinformatics/bts342
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Phylogenetic metrics:
faith(),
normalized_unifrac(),
unweighted_unifrac(),
variance_adjusted_unifrac(),
weighted_unifrac()
Examples
generalized_unifrac(ex_counts, tree = ex_tree, alpha = 0.5)
Gower distance
Description
A distance metric that normalizes differences by the range of the feature.
Usage
gower(
counts,
margin = 1L,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Gower distance is defined as:
\frac{1}{n}\sum_{i=1}^{n}\frac{|X_i - Y_i|}{R_i}
Where:
-
X_i,Y_i: Absolute abundances of thei-th feature. -
R_i: The range of thei-th feature across all samples (max - min). -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] r <- abs(x - y) n <- length(x) sum(abs(x-y) / r) / n
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27(4), 857-871. doi:10.2307/2528823
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
gower(ex_counts)
Hamming distance
Description
Measures the minimum number of substitutions required to change one string into the other.
Usage
hamming(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Hamming distance is defined as:
(A + B) - 2J
Where:
-
A,B: Number of features in each sample. -
J: Number of features in common (intersection).
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(xor(x, y))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System Technical Journal, 29(2), 147-160. doi:10.1002/j.1538-7305.1950.tb00463.x
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Presence/Absence metrics:
jaccard(),
ochiai(),
sorensen()
Examples
hamming(ex_counts)
Hellinger distance
Description
A distance metric related to the Bhattacharyya distance, often used for community data with many zeros.
Usage
hellinger(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Hellinger distance is defined as:
\sqrt{\sum_{i=1}^{n}(\sqrt{P_i} - \sqrt{Q_i})^{2}}
Where:
-
P_i,Q_i: Proportional abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) sqrt(sum((sqrt(p) - sqrt(q)) ^ 2))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Rao, C. R. (1995). A review of canonical coordinates and an alternative to correspondence analysis using Hellinger distance. Qüestiió, 19, 23-63.
Hellinger, E. (1909). Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. Journal für die reine und angewandte Mathematik, 136, 210–271. doi:10.1515/crll.1909.136.210
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
hellinger(ex_counts)
Horn-Morisita dissimilarity
Description
A similarity index based on Simpson's diversity index, suitable for abundance data.
Usage
horn(
counts,
margin = 1L,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Horn-Morisita dissimilarity is defined as:
1 - \frac{2\sum_{i=1}^{n}P_{i}Q_{i}}{\sum_{i=1}^{n}P_i^2 + \sum_{i=1}^{n}Q_i^2}
Where:
-
P_i,Q_i: Proportional abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] z <- sum(x^2) / sum(x)^2 + sum(y^2) / sum(y)^2 1 - ((2 * sum(x * y)) / (z * sum(x) * sum(y)))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Horn, H. S. (1966). Measurement of "overlap" in comparative ecological studies. The American Naturalist, 100(914), 419-424. doi:10.1086/282436
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
horn(ex_counts)
Inverse Simpson Index
Description
A transformation of the Simpson index that represents the "effective number of species".
Usage
inv_simpson(counts, margin = 1L, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Inverse Simpson index is defined as:
1 / \sum_{i = 1}^{n} P_i^2
Where:
-
n: The number of features. -
P_i: Proportional abundance of thei-th feature.
Base R Equivalent:
x <- ex_counts[1,] p <- x / sum(x) 1 / sum(p ** 2)
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Simpson, E. H. (1949). Measurement of diversity. Nature, 163, 688. doi:10.1038/163688a0
See Also
alpha_div(), vignette('adiv')
Other Diversity metrics:
brillouin(),
fisher(),
shannon(),
simpson()
Examples
inv_simpson(ex_counts)
Jaccard distance
Description
Measures dissimilarity between sample sets.
Usage
jaccard(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Jaccard distance is defined as:
1 - \frac{J}{(A + B - J)}
Where:
-
A,B: Number of features in each sample. -
J: Number of features in common (intersection).
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] 1 - sum(x & y) / sum(x | y)
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37-50. doi:10.1111/j.1469-8137.1912.tb05611.x
Jaccard, P. (1908). Nouvelles recherches sur la distribution florale. Bulletin de la Societe Vaudoise des Sciences Naturelles, 44(163), 223-270. doi:10.5169/seals-268384
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Presence/Absence metrics:
hamming(),
ochiai(),
sorensen()
Examples
jaccard(ex_counts)
Jensen-Shannon distance
Description
The square root of the Jensen-Shannon divergence.
Usage
jensen(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Jensen-Shannon distance is defined as:
\sqrt{\frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]}
Where:
-
P_i,Q_i: Proportional abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) sqrt(sum(p * log(2 * p / (p+q)), q * log(2 * q / (p+q))) / 2)
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Endres, D. M., & Schindelin, J. E. (2003). A new metric for probability distributions. IEEE Transactions on Information Theory, 49(7), 1858-1860. doi:10.1109/TIT.2003.813506
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
jensen(ex_counts)
Jensen-Shannon divergence (JSD)
Description
A symmetrized and smoothed version of the Kullback-Leibler divergence.
Usage
jsd(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Jensen-Shannon divergence (JSD) is defined as:
\frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]
Where:
-
P_i,Q_i: Proportional abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) sum(p * log(2 * p / (p+q)), q * log(2 * q / (p+q))) / 2
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151. doi:10.1109/18.61115
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
jsd(ex_counts)
Find and Browse Available Metrics
Description
Programmatic access to the lists of available metrics, and their associated functions.
Usage
list_metrics(
div = c(NA, "alpha", "beta"),
val = c("data.frame", "list", "func", "id", "name", "div", "phylo", "weighted",
"int_only", "true_metric"),
nm = c(NA, "id", "name"),
phylo = NULL,
weighted = NULL,
int_only = NULL,
true_metric = NULL
)
match_metric(
metric,
div = NULL,
phylo = NULL,
weighted = NULL,
int_only = NULL,
true_metric = NULL
)
Arguments
div |
Filter by diversity type. One of |
val |
Sets the return value for this function call. See "Value"
section below. Default: |
nm |
What value to use for the names of the returned object.
Default is |
phylo |
Filter by whether a phylogenetic tree is required.
|
weighted |
Filter by whether relative abundance is used. |
int_only |
Filter by whether integer counts are required. |
true_metric |
Filter by whether the metric satisfies the triangle
inequality. |
metric |
The name of an alpha/beta diversity metric to search for. Supports partial matching. All non-alpha characters are ignored. |
Value
match_metric()
A list with the following elements.
-
name: Metric name, e.g."Faith's Phylogenetic Diversity" -
id: Metric ID - also the name of the function, e.g."faith" -
div: Either"alpha"or"beta". -
phylo:TRUEif metric requires a phylogenetic tree;FALSEotherwise. -
weighted:TRUEif metric takes relative abundance into account;FALSEif it only uses presence/absence. -
int_only:TRUEif metric requires integer counts;FALSEotherwise. -
true_metric:TRUEif metric is a true metric and satisfies the triangle inequality;FALSEif it is a non-metric dissimilarity;NAfor alpha diversity metrics. -
func: The function for this metric, e.g.ecodive::faith -
params: Formal args forfunc, e.g.c("counts", "norm", "tree", "cpus")
list_metrics()
The returned object's type and values are controlled with the val and nm arguments.
-
val = "data.frame": The data.frame from which the below options are sourced. -
val = "list": A list of objects as returned bymatch_metric()(above). -
val = "func": A list of functions. -
val = "id": A character vector of metric IDs. -
val = "name": A character vector of metric names. -
val = "div": A character vector"alpha"and/or"beta". -
val = "phylo": A logical vector indicating which metrics require a tree. -
val = "weighted": A logical vector indicating which metrics take relative abundance into account (as opposed to just presence/absence). -
val = "int_only": A logical vector indicating which metrics require integer counts. -
val = "true_metric": A logical vector indicating which metrics are true metrics and satisfy the triangle inequality, which work better for ordinations such as PCoA.
If nm is set, then the names of the vector or list will be the metric ID
(nm="id") or name (nm="name"). When val="data.frame", the names will be
applied to the rownames() property of the data.table.
Examples
# A data.frame of all available metrics.
head(list_metrics())
# All alpha diversity function names.
list_metrics('alpha', val = 'id')
# Try to find a metric named 'otus'.
m <- match_metric('otus')
# The result is a list that includes the function.
str(m)
Lorentzian distance
Description
A log-based distance metric that is robust to outliers.
Usage
lorentzian(
counts,
margin = 1L,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Lorentzian distance is defined as:
\sum_{i=1}^{n}\ln{(1 + |X_i - Y_i|)}
Where:
-
X_i,Y_i: Absolute abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(log(1 + abs(x - y)))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
lorentzian(ex_counts)
Manhattan distance
Description
The sum of absolute differences, also known as the taxicab geometry.
Usage
manhattan(
counts,
margin = 1L,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Manhattan distance is defined as:
\sum_{i=1}^{n} |X_i - Y_i|
Where:
-
X_i,Y_i: Absolute abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(abs(x-y))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Krause, E. F. (1987). Taxicab geometry: An adventure in non-Euclidean geometry. Dover Publications.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
manhattan(ex_counts)
Margalef's Richness Index
Description
A richness metric that normalizes the number of species by the log of the total sample size.
Usage
margalef(counts, margin = 1L, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Details
Margalef's index is defined as:
\frac{n - 1}{\ln{X_T}}
Where:
-
n: The number of features. -
X_T: Total of all counts (sequencing depth).
Base R Equivalent:
x <- ex_counts[1,] (sum(x > 0) - 1) / log(sum(x))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Margalef, R. (1958). Information theory in ecology. General Systems, 3, 36-71.
Gamito, S. (2010). Caution is needed when applying Margalef diversity index. Ecological Indicators, 10(2), 550-551. doi:10.1016/j.ecolind.2009.07.006
See Also
alpha_div(), vignette('adiv')
Other Richness metrics:
ace(),
chao1(),
menhinick(),
observed(),
squares()
Examples
margalef(ex_counts)
Matusita distance
Description
A distance measure closely related to the Hellinger distance.
Usage
matusita(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Matusita distance is defined as:
\sqrt{\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2}
Where:
-
P_i,Q_i: Proportional abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) sqrt(sum((sqrt(p) - sqrt(q)) ^ 2))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Matusita, K. (1955). Decision rules, based on the distance, for problems of fit, two samples, and estimation. The Annals of Mathematical Statistics, 26(4), 631-640. doi:10.1214/aoms/1177728422
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
matusita(ex_counts)
McIntosh Index
Description
A dominance index based on the Euclidean distance from the origin.
Usage
mcintosh(counts, margin = 1L, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The McIntosh index is defined as:
\frac{X_T - \sqrt{\sum_{i = 1}^{n} (X_i)^2}}{X_T - \sqrt{X_T}}
Where:
-
n: The number of features. -
X_i: Integer count of thei-th feature. -
X_T: Total of all counts.
Base R Equivalent:
x <- ex_counts[1,] (sum(x) - sqrt(sum(x^2))) / (sum(x) - sqrt(sum(x)))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
McIntosh, R. P. (1967). An index of diversity and the relation of certain concepts to diversity. Ecology, 48(3), 392-404. doi:10.2307/1932674
See Also
alpha_div(), vignette('adiv')
Other Dominance metrics:
berger()
Examples
mcintosh(ex_counts)
Menhinick's Richness Index
Description
A richness metric that normalizes the number of species by the square root of the total sample size.
Usage
menhinick(counts, margin = 1L, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Details
Menhinick's index is defined as:
\frac{n}{\sqrt{X_T}}
Where:
-
n: The number of features. -
X_T: Total of all counts.
Base R Equivalent:
x <- ex_counts[1,] sum(x > 0) / sqrt(sum(x))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Menhinick, E. F. (1964). A comparison of some species-individuals diversity indices applied to samples of field insects. Ecology, 45(4), 859-861. doi:10.2307/1934933
See Also
alpha_div(), vignette('adiv')
Other Richness metrics:
ace(),
chao1(),
margalef(),
observed(),
squares()
Examples
menhinick(ex_counts)
Minkowski distance
Description
A generalized metric that includes Euclidean and Manhattan distance as special cases.
Usage
minkowski(
counts,
margin = 1L,
power = 1.5,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
power |
Scaling factor for the magnitude of differences between
communities ( |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Minkowski distance is defined as:
\sqrt[p]{\sum_{i=1}^{n} (X_i - Y_i)^p}
Where:
-
X_i,Y_i: Absolute abundances of thei-th feature. -
n: The number of features. -
p: The geometry of the space (power parameter).
Parameter: power
The power parameter (default 1.5) determines the value of p in the equation.
Special Cases
-
Manhattan distance: When
p = 1, the formula reduces to the sum of absolute differences. -
Euclidean distance: When
p = 2, the formula reduces to the standard straight-line distance. -
Chebyshev distance: When
p \to \infty, the formula reduces to the maximum absolute difference.
Base R Equivalent:
p <- 1.5 x <- ex_counts[1,] y <- ex_counts[2,] sum(abs(x - y)^p) ^ (1/p)
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Deza, M. M., & Deza, E. (2009). Encyclopedia of distances. Springer.
Minkowski, H. (1896). Geometrie der Zahlen. Teubner.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
minkowski(ex_counts, power = 2) # Equivalent to Euclidean
Morisita dissimilarity
Description
A measure of overlap between samples that is independent of sample size. Requires integer counts.
Usage
morisita(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Morisita dissimilarity is defined as:
1 - \frac{2\sum_{i=1}^{n}X_{i}Y_{i}}{\left(\frac{\sum_{i=1}^{n}X_i(X_i - 1)}{X_T(X_T - 1)} + \frac{\sum_{i=1}^{n}Y_i(Y_i - 1)}{Y_T(Y_T - 1)}\right)X_{T}Y_{T}}
Where:
-
X_i,Y_i: Absolute counts of thei-th feature. -
X_T,Y_T: Total counts in each sample.X_T = \sum_{i=1}^{n} X_i. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] simpson_x <- sum(x * (x - 1)) / (sum(x) * (sum(x) - 1)) simpson_y <- sum(y * (y - 1)) / (sum(y) * (sum(y) - 1)) 1 - ((2 * sum(x * y)) / ((simpson_x + simpson_y) * sum(x) * sum(y)))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Morisita, M. (1959). Measuring of interspecific association and similarity between communities. Memoirs of the Faculty of Science, Kyushu University, Series E (Biology), 3, 65-80.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
morisita(ex_counts)
Motyka dissimilarity
Description
Also known as the Bray-Curtis dissimilarity when applied to abundance data, but formulated slightly differently.
Usage
motyka(
counts,
margin = 1L,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Motyka dissimilarity is defined as:
\frac{\sum_{i=1}^{n} \max(X_i, Y_i)}{\sum_{i=1}^{n} (X_i + Y_i)}
Where:
-
X_i,Y_i: Absolute abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(pmax(x, y)) / sum(x, y)
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Motyka, J. (1947). O celach i metodach badan geobotanicznych. Annales Universitatis Mariae Curie-Sklodowska, Sectio C, 3, 1-168.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
motyka(ex_counts)
Number of CPU Cores
Description
A thin wrapper around parallely::availableCores(). If the parallely
package is not installed, then it falls back to
parallel::detectCores(all.tests = TRUE, logical = TRUE). Returns 1 if
pthread support is unavailable or when the number of cpus cannot be
determined.
Usage
n_cpus()
Value
A scalar integer, guaranteed to be at least 1.
Examples
n_cpus()
Normalized Weighted UniFrac
Description
Weighted UniFrac normalized by the tree length to allow comparison between trees.
Usage
normalized_unifrac(
counts,
tree = NULL,
margin = 1L,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
tree |
A |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Normalized Weighted UniFrac distance is defined as:
\frac{\sum_{i=1}^{n} L_i|P_i - Q_i|}{\sum_{i=1}^{n} L_i(P_i + Q_i)}
Where:
-
n: The number of branches in the tree. -
L_i: The length of thei-th branch. -
P_i,Q_i: The proportion of the community descending from branchiin sample P and Q.
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Lozupone, C. A., Hamady, M., Kelley, S. T., & Knight, R. (2007). Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Applied and Environmental Microbiology, 73(5), 1576-1585. doi:10.1128/AEM.01996-06
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Phylogenetic metrics:
faith(),
generalized_unifrac(),
unweighted_unifrac(),
variance_adjusted_unifrac(),
weighted_unifrac()
Examples
normalized_unifrac(ex_counts, tree = ex_tree)
Observed Features
Description
The count of unique features (richness) in a sample.
Usage
observed(counts, margin = 1L, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Details
Observed features is defined simply as the number of features with non-zero abundance:
n
Base R Equivalent:
x <- ex_counts[1,] sum(x > 0)
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
See Also
alpha_div(), vignette('adiv')
Other Richness metrics:
ace(),
chao1(),
margalef(),
menhinick(),
squares()
Examples
observed(ex_counts)
Otsuka-Ochiai dissimilarity
Description
Also known as the cosine similarity for binary data.
Usage
ochiai(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Otsuka-Ochiai dissimilarity is defined as:
1 - \frac{J}{\sqrt{AB}}
Where:
-
A,B: Number of features in each sample. -
J: Number of features in common (intersection).
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] 1 - sum(x & y) / sqrt(sum(x>0) * sum(y>0))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Ochiai, A. (1957). Zoogeographic studies on the soleoid fishes found in Japan and its neighbouring regions. Bulletin of the Japanese Society of Scientific Fisheries, 22, 526-530.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Presence/Absence metrics:
hamming(),
jaccard(),
sorensen()
Examples
ochiai(ex_counts)
documentation
Description
documentation
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Probabilistic Symmetric Chi-Squared distance
Description
A chi-squared based distance metric for comparing probability distributions.
Usage
psym_chisq(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Probabilistic Symmetric \chi^2 distance is defined as:
2\sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}
Where:
-
P_i,Q_i: Proportional abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) 2 * sum((p - q)^2 / (p + q))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
psym_chisq(ex_counts)
Rarefy Observation Counts
Description
Sub-sample observations from a feature table such that all samples have the same library size (depth). This is performed via random sampling without replacement.
Usage
rarefy(
counts,
depth = NULL,
seed = 0,
times = NULL,
drop = TRUE,
margin = 1L,
cpus = n_cpus(),
warn = interactive()
)
Arguments
counts |
A numeric matrix or sparse matrix object (e.g., |
depth |
The number of observations to keep per sample. If |
seed |
An integer seed for the random number generator. Providing
the same seed guarantees reproducible results. Default: |
times |
The number of independent rarefactions to perform. If set,
returns a list of matrices. Seeds for subsequent iterations are
sequential ( |
drop |
Logical. If |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
warn |
Logical. If |
Value
A rarefied matrix. The output class (matrix, dgCMatrix, etc.)
matches the input class.
Auto-Depth Selection
If depth is NULL, the function defaults to the highest depth that retains
at least 10% of the total observations in the dataset.
Dropping vs. Retaining Samples
If a sample has fewer observations than the specified depth:
-
drop = TRUE(Default): The sample is removed from the output matrix. -
drop = FALSE: The sample is returned unmodified (with its original counts). It is not rarefied or zeroed out.
Zero-Sum Features
Features (OTUs, ASVs, Genes) that lose all observations during rarefaction are always retained as columns/rows of zeros. This ensures the output matrix dimensions remain consistent with the input (barring dropped samples).
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Examples
# A 4-sample x 5-OTU matrix with samples in rows.
counts <- matrix(c(0,0,0,0,0,8,9,10,5,5,5,5,2,0,0,0,6,5,7,0), 4, 5,
dimnames = list(LETTERS[1:4], paste0('OTU', 1:5)))
counts
rowSums(counts)
# Rarefy all samples to a depth of 18.
# Sample 'A' (13 counts) and 'D' (15 counts) will be dropped.
r_mtx <- rarefy(counts, depth = 18)
r_mtx
rowSums(r_mtx)
# Keep under-sampled samples by setting `drop = FALSE`.
# Samples 'A' and 'D' are returned with their original counts.
r_mtx <- rarefy(counts, depth = 18, drop = FALSE)
r_mtx
rowSums(r_mtx)
# Perform 3 independent rarefactions.
r_list <- rarefy(counts, times = 3)
length(r_list)
# Sparse matrices are supported and their class is preserved.
if (requireNamespace('Matrix', quietly = TRUE)) {
counts_dgC <- Matrix::Matrix(counts, sparse = TRUE)
str(rarefy(counts_dgC))
}
Read a newick formatted phylogenetic tree.
Description
A phylogenetic tree is required for computing UniFrac distance matrices. You can load a tree from a file or by providing the tree string directly. This tree must be in Newick format, also known as parenthetic format and New Hampshire format.
Usage
read_tree(newick, underscores = FALSE)
Arguments
newick |
Input data as either a file path, URL, or Newick string. Compressed (gzip or bzip2) files are also supported. |
underscores |
If |
Value
A phylo class object representing the tree.
Examples
tree <- read_tree("
(A:0.99,((B:0.87,C:0.89):0.51,(((D:0.16,(E:0.83,F:0.96)
:0.94):0.69,(G:0.92,(H:0.62,I:0.85):0.54):0.23):0.74,J:0.1
2):0.43):0.67);")
class(tree)
Shannon Diversity Index
Description
A commonly used diversity index accounting for both abundance and evenness.
Usage
shannon(counts, margin = 1L, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Shannon index (entropy) is defined as:
-\sum_{i = 1}^{n} P_i \times \ln(P_i)
Where:
-
n: The number of features. -
P_i: Proportional abundance of thei-th feature.
Base R Equivalent:
x <- ex_counts[1,] p <- x / sum(x) -sum(p * log(p))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379-423.
Shannon, C. E., & Weaver, W. (1949). The Mathematical Theory of Communication. University of Illinois Press.
See Also
alpha_div(), vignette('adiv')
Other Diversity metrics:
brillouin(),
fisher(),
inv_simpson(),
simpson()
Examples
shannon(ex_counts)
Gini-Simpson Index
Description
The probability that two entities taken at random from the dataset represent different types.
Usage
simpson(counts, margin = 1L, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Gini-Simpson index is defined as:
1 - \sum_{i = 1}^{n} P_i^2
Where:
-
n: The number of features. -
P_i: Proportional abundance of thei-th feature.
Base R Equivalent:
x <- ex_counts[1,] p <- x / sum(x) 1 - sum(p ** 2)
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Simpson, E. H. (1949). Measurement of diversity. Nature, 163, 688. doi:10.1038/163688a0
See Also
alpha_div(), vignette('adiv')
Other Diversity metrics:
brillouin(),
fisher(),
inv_simpson(),
shannon()
Examples
simpson(ex_counts)
Soergel distance
Description
A distance metric related to the Bray-Curtis and Jaccard indices.
Usage
soergel(
counts,
margin = 1L,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Soergel distance is defined as:
\frac{\sum_{i=1}^{n} |X_i - Y_i|}{\sum_{i=1}^{n} \max(X_i, Y_i)}
Where:
-
X_i,Y_i: Absolute abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(abs(x - y)) / sum(pmax(x, y))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
soergel(ex_counts)
Dice-Sorensen dissimilarity
Description
A statistic used for comparing the similarity of two samples.
Usage
sorensen(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Dice-Sorensen dissimilarity is defined as:
\frac{2J}{(A + B)}
Where:
-
A,B: Number of features in each sample. -
J: Number of features in common (intersection).
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] 2 * sum(x & y) / sum(x>0, y>0)
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Sørensen, T. (1948). A method of establishing groups of equal amplitude in plant sociology based on similarity of species content. Kongelige Danske Videnskabernes Selskab, Biologiske Skrifter, 5, 1-34.
Dice, L. R. (1945). Measures of the amount of ecologic association between species. Ecology, 26(3), 297–302. doi:10.2307/1932409
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Presence/Absence metrics:
hamming(),
jaccard(),
ochiai()
Examples
sorensen(ex_counts)
Squared Chi-Squared distance
Description
The squared version of the Chi-Squared distance.
Usage
squared_chisq(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Squared \chi^2 distance is defined as:
\sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}
Where:
-
P_i,Q_i: Proportional abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) sum((p - q)^2 / (p + q))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
squared_chisq(ex_counts)
Squared Chord distance
Description
The squared version of the Chord distance.
Usage
squared_chord(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Squared Chord distance is defined as:
\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2
Where:
-
P_i,Q_i: Proportional abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) sum((sqrt(x) - sqrt(y)) ^ 2)
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Legendre, P., & Legendre, L. (2012). Numerical ecology. Elsevier.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_euclidean(),
topsoe(),
wave_hedges()
Examples
squared_chord(ex_counts)
Squared Euclidean distance
Description
The squared Euclidean distance between two vectors.
Usage
squared_euclidean(
counts,
margin = 1L,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Squared Euclidean distance is defined as:
\sum_{i=1}^{n} (X_i - Y_i)^2
Where:
-
X_i,Y_i: Absolute abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum((x-y)^2)
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Legendre, P., & Legendre, L. (2012). Numerical ecology. Elsevier.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
topsoe(),
wave_hedges()
Examples
squared_euclidean(ex_counts)
Squares Richness Estimator
Description
A richness estimator based on the concept of "squares" (counts of species observed once or twice).
Usage
squares(counts, margin = 1L, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Squares estimator is defined as:
n + \frac{(F_1)^2 \sum_{i=1}^{n} (X_i)^2}{X_T^2 - nF_1}
Where:
-
n: The number of observed features. -
X_T: Total of all counts. -
F_1: Number of features observed once (singletons). -
X_i: Integer count of thei-th feature.
Base R Equivalent:
x <- ex_counts[1,] N <- sum(x) # sampling depth S <- sum(x > 0) # observed features F1 <- sum(x == 1) # singletons S + ((sum(x^2) * (F1^2)) / ((N^2) - F1 * S))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Alroy, J. (2018). Limits to species richness estimates based on subsampling. Paleobiology, 44(2), 177-194. doi:10.1017/pab.2017.38
See Also
alpha_div(), vignette('adiv')
Other Richness metrics:
ace(),
chao1(),
margalef(),
menhinick(),
observed()
Examples
squares(ex_counts)
Topsoe distance
Description
A symmetric divergence measure based on the Jensen-Shannon divergence.
Usage
topsoe(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Topsoe distance is defined as:
\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)
Where:
-
P_i,Q_i: Proportional abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) sum(p * log(2 * p / (p+q)), q * log(2 * y / (p+q)))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Topsoe, F. (2000). Some inequalities for information divergence and related measures of discrimination. IEEE Transactions on Information Theory, 46(4), 1602-1609. doi:10.1109/18.850703
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
wave_hedges()
Examples
topsoe(ex_counts)
Unweighted UniFrac
Description
A phylogenetic distance metric that accounts for the presence/absence of lineages.
Usage
unweighted_unifrac(
counts,
tree = NULL,
margin = 1L,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
tree |
A |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Unweighted UniFrac distance is defined as:
\frac{1}{n}\sum_{i=1}^{n} L_i|A_i - B_i|
Where:
-
n: The number of branches in the tree. -
L_i: The length of thei-th branch. -
A_i,B_i: Binary values (0 or 1) indicating if descendants of branchiare present in sample A or B.
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Lozupone, C., & Knight, R. (2005). UniFrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology, 71(12), 8228-8235. doi:10.1128/AEM.71.12.8228-8235.2005
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Phylogenetic metrics:
faith(),
generalized_unifrac(),
normalized_unifrac(),
variance_adjusted_unifrac(),
weighted_unifrac()
Examples
unweighted_unifrac(ex_counts, tree = ex_tree)
Variance-Adjusted Weighted UniFrac
Description
A weighted UniFrac that adjusts for the expected variance of the metric.
Usage
variance_adjusted_unifrac(
counts,
tree = NULL,
margin = 1L,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
tree |
A |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Variance-Adjusted Weighted UniFrac distance is defined as:
\frac{\sum_{i=1}^{n} L_i\frac{|P_i - Q_i|}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }{\sum_{i=1}^{n} L_i\frac{P_i + Q_i}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }
Where:
-
n: The number of branches in the tree. -
L_i: The length of thei-th branch. -
P_i,Q_i: The proportion of the community descending from branchiin sample P and Q.
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Chang, Q., Luan, Y., & Sun, F. (2011). Variance adjusted weighted UniFrac: a powerful beta diversity measure for comparing communities based on phylogeny. BMC Bioinformatics, 12, 118. doi:10.1186/1471-2105-12-118
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Phylogenetic metrics:
faith(),
generalized_unifrac(),
normalized_unifrac(),
unweighted_unifrac(),
weighted_unifrac()
Examples
variance_adjusted_unifrac(ex_counts, tree = ex_tree)
Wave Hedges distance
Description
A distance metric derived from the Hedges' distance.
Usage
wave_hedges(
counts,
margin = 1L,
norm = "none",
pseudocount = NULL,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Wave Hedges distance is defined as:
\sum_{i=1}^{n}\frac{|X_i - Y_i|}{\max(X_i, Y_i)}
Where:
-
X_i,Y_i: Absolute abundances of thei-th feature. -
n: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(abs(x - y) / pmax(x, y))
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Pseudocount
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
References
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe()
Examples
wave_hedges(ex_counts)
Weighted UniFrac
Description
A phylogenetic distance metric that accounts for the relative abundance of lineages.
Usage
weighted_unifrac(
counts,
tree = NULL,
margin = 1L,
pairs = NULL,
cpus = n_cpus()
)
Arguments
counts |
A numeric matrix of count data (samples |
tree |
A |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Details
The Weighted UniFrac distance is defined as:
\sum_{i=1}^{n} L_i|P_i - Q_i|
Where:
-
n: The number of branches in the tree. -
L_i: The length of thei-th branch. -
P_i,Q_i: The proportion of the community descending from branchiin sample P and Q.
Input Types
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
-
phyloseq -
rbiom -
SummarizedExperiment -
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
References
Lozupone, C. A., Hamady, M., Kelley, S. T., & Knight, R. (2007). Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Applied and Environmental Microbiology, 73(5), 1576-1585. doi:10.1128/AEM.01996-06
See Also
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Phylogenetic metrics:
faith(),
generalized_unifrac(),
normalized_unifrac(),
unweighted_unifrac(),
variance_adjusted_unifrac()
Examples
weighted_unifrac(ex_counts, tree = ex_tree)