The bioinformatic evaluation of gene co-expression often begins with correlation-based analyses. However, this approach lacks statistical validity when applied to relative data. This includes, for example, biological count data generated by high-throughput RNA-sequencing, chromatin immunoprecipitation (ChIP), ChIP-sequencing, Methyl-Capture sequencing, and other techniques. This package provides a set of functions for measuring dependence between relative features using compositional data analysis. Specifically, this package implements two measures of proportionality, \(\phi\) and \(\rho\), as introduced in Lovell 2015 and expounded in Erb 2016. Unlike correlation, these metrics give the same result for both relative and absolute data. Pairs that are strongly proportional in relative space are also strongly correlated in absolute space. Proportionality avoids the pitfall of spurious correlation.
Let \(A_i\) and \(A_j\) each represent a log-ratio transformed feature vector (e.g., a transformed vector of \(D\) gene values measured across \(N\) samples). We then define the metrics \(\phi\) and \(\rho\) accordingly:
\[\phi(A_i, A_j) = \frac{var(A_i - A_j)}{var(A_i)}\]
\[\rho(A_i, A_j) = 1 - \frac{var(A_i - A_j)}{var(A_i) + var(A_j)}\]
Above, we use the log-ratio transformation in order to transform the data in a manner that respects the nature of relative data. In other words, log-ratio transformation yields the same result whether applied to absolute or relative data. In this package, we consider two log-ratio transformations of the subject vector \(x\), the centered log-ratio transformation (clr) and the additive log-ratio transformation (alr). We define these accordingly:
\[\textrm{clr(x)} = \left[\ln\frac{x_i}{g(\textrm{x})};...;\ln\frac{x_D}{g(\textrm{x})}\right]\]
\[\textrm{alr(x)} = \left[\ln\frac{x_i}{x_D};...;\ln\frac{x_{D-1}}{x_D}\right]\]
In clr-transformation, sample vectors undergo a transformation based on the logarithm of the ratio between the individual elements and the geometric mean of the vector, \(g(\textrm{x}) = \sqrt[D]{x_i...x_D}\). In alr-transformation, sample vectors undergo a transformation based on the logarithm of the ratio between the individual elements and chosen reference feature. Although these transformations differ in definition, we will sometimes will refer to them jointly with the acronym *lr.
We provide two principal functions for calculating proportionality. The first function, phit
, calculates \(\phi\) as described in Lovell et al (2015). This function makes use of the clr-transformation exclusively. The second function, perb
, calculates \(\rho\) as introduced in Lovell et al (2015) and expounded in Erb and Notredame (2016). This function makes use of either clr- or alr-transformation.
The first difference between \(\phi\) and \(\rho\) is scale. The values of \(\phi\) range from \([0, \infty)\), with lower \(\phi\) values indicating more proportionality. The values of \(\rho\) range from \([-1, 1]\), with greater \(|\rho|\) values indicating more proportionality and negative \(\rho\) values indicating inverse proportionality. A second difference is that \(\phi\) lacks symmetry. However, one can force symmetry by reflecting the lower left triangle of the matrix across the diagonal (toggled by the argument symmetrize = TRUE
). A third difference is that \(\rho\) corrects for the individual variance of each feature in the pair, rather than for just one of the features.
For now, we will focus on the implementations that use clr-transformation, saving a discussion of alr-transformation for later. Let us begin by building an arbitrary dataset of 4 features (e.g., genes) measured across 100 subjects. In this example, the feature pairs “a” and “b” will show proportional change as well as the feature pairs “c” and “d”.
set.seed(12345)
N <- 100
X <- data.frame(a=(1:N), b=(1:N) * rnorm(N, 10, 0.1),
c=(N:1), d=(N:1) * rnorm(N, 10, 1.0))
Let \(D\) represent any number of features measured across \(N\) observations exposed to a binary or continuous event \(E\). For example, \(E\) could represent differences in case-control status, treatment status, treatment dose, or time. The phit
and perb
functions ultimately convert a “count matrix” with \(N\) rows and \(D\) columns into a proportionality matrix of \(D\) rows and \(D\) columns containing a \(\phi\) or \(\rho\) value for each feature pair. One can think of this matrix as analogous to a dissimilarity matrix (in the case of \(\phi\)) or a correlation matrix (in the case of \(\rho\)). Both functions return the proportionality matrix bundled within an object of the class propr
. This object contains four slots:
@counts
A matrix. Stores the original “count matrix” input.@logratio
A matrix. Stores the log-ratio transformed “count matrix”.@matrix
A matrix. Stores the proportionality metrics, \(\phi\) or \(\rho\).@pairs
A vector. Indexes the proportionality metrics of interest.library(propr)
phi <- phit(X, symmetrize = TRUE)
rho <- perb(X, ivar = 0)
Note that log-ratio transformation, by its nature, fails if the input data contain any zero values. To avoid an error in this case, these functions automatically replace all zero values with 1. However, the topic of zero replacement is controversial. Proceed carefully when analyzing data that contain zero values.
We have provided methods for indexing and subsetting objects belonging to the propr
class. Using the familiar [
method, we can efficiently index the proportionality matrix (@matrix
) based on an inequality operator and a reference value.
In this first example, we use [
to index the matrix by \(\rho > .99\). This indexes the location of all values (i.e., in the lower left triangle of the matrix) satisfying that inequality, and saves those indices to the @pairs
slot. Indexing helps guide some of the bundled visualization methods in lieu of copy-on-modify subsetting. Note that indexing an already indexed object appends the new index to the previous index.
rho99 <- rho[">", .99]
rho99@pairs
## [1] 2 12
Alternatively, using the subset
method, we can subset an entire propr
object by a vector of feature indices or names. However, this method does copy-on-modify the proportionality matrix, making it potentially unsuitable for large datasets.
In this second example, we subset by the feature names “a” and “b”.
rhoab <- subset(rho, select = c("a", "b"))
rhoab@matrix
## [,1] [,2]
## [1,] 1.0000000 0.9999151
## [2,] 0.9999151 1.0000000
The convenience function, simplify
, can subset an entire propr
object based on the index saved in its @pairs
slot. This function converts the saved index into a paired list of coordinates and passes them along to the subset
method. As such, this method does copy-on-modify the proportionality matrix, making it potentially unsuitable for large datasets. Unlike subset
, simplify
returns an object with the @pairs
slot updated.
simplify(rho99)
## @counts summary: 100 subjects by 4 features
## @logratio summary: 100 subjects by 4 features
## @matrix summary: 4 features by 4 features
## @pairs summary: 2 feature pairs
Each feature belonging to a highly proportional data pair should show approximately fixed *lr-transformed expression with one another across all subjects. The method plot
(or, equivalently, smear
) provides a means by which to visually inspect whether this holds true. Since this function will plot all pairs unless indexed with the [
method, we recommend the user first index or subset the propr
object before plotting. A “noisy” relationship between some feature pairs could suggest that the proportionality cutoff is too lenient. We include this plot as a handy “sanity check” when working with high-dimensional datasets.
plot(rho99)
## Alert: Generating plot using indexed feature pairs.
High-throughput genomic sequencing has the ability to measure tens of thousands of features for each subject. Since calculating proportionality generates a matrix sized \(D^2\), this method uses a lot of RAM when applied to real biological datasets. To address this issue, the newest version of propr
harnesses the power of C++ (via the Rcpp
package) to achieve a near 100-fold increase in computational speed and an 80% reduction in RAM overhead. Below, we provide a small table that estimates the approximate amount of RAM needed to render a proportionality matrix based on the number of features studied. The user should account for up to 25% more MiB in additional RAM for subsequent [
indexing and visualization.
Features | Peak RAM (MiB) |
---|---|
1000 | 8 |
2000 | 31 |
4000 | 123 |
8000 | 491 |
16000 | 1959 |
24000 | 4405 |
32000 | 7829 |
64000 | 31301 |
100000 | 76406 |
We recognize that this package uses concepts that are not necessarily intuitive. Since the log-ratio transformation of relative data comprises a major portion of proportionality analysis, we decided to dedicate some extra space to this topic specifically. In this section, we discuss the centered log-ratio (clr) and its limitations in context of proportionality analysis. To this end, we begin by simulating count data for 5 features (e.g., genes) labeled “a”, “b”, “c”, “d”, and “e”, as measured across 100 subjects.
N <- 100
a <- seq(from = 5, to = 15, length.out = N)
b <- a * rnorm(N, mean = 1, sd = 0.1)
c <- rnorm(N, mean = 10)
d <- rnorm(N, mean = 10)
e <- rep(10, N)
X <- data.frame(a, b, c, d, e)
Let us assume that these data \(X\) represent absolute abundance counts (i.e., not relative data). We can build a relative dataset, \(Y\), by distorting \(X\) accordingly:
Y <- X / rowSums(X) * abs(rnorm(N))
As a “sanity check”, we will confirm that these new feature vectors do in fact contain relative quantities. We do this by calculating the ratio of the second feature vector to the first for both the absolute and relative datasets.
all(round(X[, 2] / X[, 1] - Y[, 2] / Y[, 1], 5) == 0)
## [1] TRUE
The following figures compare pairwise scatterplots for the absolute count data and the corresponding relative count data. We see quickly how these relative data suggest a spurious correlation: although genes “c” and “d” do not correlate with one another absolutely, their relative quantities do.
pairs(X)
pairs(Y)
Next, we will see that when we do calculate correlation, the coefficients differ for the absolute and relative datasets. This further demonstrates the spurious correlation.
suppressWarnings(cor(X))
## a b c d e
## a 1.00000000 0.9495487 -0.08429201 -0.1284406 NA
## b 0.94954870 1.0000000 -0.17278967 -0.1183455 NA
## c -0.08429201 -0.1727897 1.00000000 -0.1271698 NA
## d -0.12844062 -0.1183455 -0.12716985 1.0000000 NA
## e NA NA NA NA 1
cor(Y)
## a b c d e
## a 1.0000000 0.9918545 0.8606885 0.8700002 0.8630598
## b 0.9918545 1.0000000 0.8553602 0.8677473 0.8622694
## c 0.8606885 0.8553602 1.0000000 0.9857120 0.9923988
## d 0.8700002 0.8677473 0.9857120 1.0000000 0.9909547
## e 0.8630598 0.8622694 0.9923988 0.9909547 1.0000000
However, by calculating the variance of the log-ratios (VLR), defined as the variance of the logarithm of the ratio of two feature vectors, we can arrive at a single measure of dependence that (a) does not change with respect to the nature of the data (i.e., absolute or relative), and (b) does not change with respect to the number of features included in the computation. As such, the VLR, constituting the numerator portion of the \(\phi\) metric and a portion of the \(\rho\) metric as well, is considered sub-compositionally coherent. Yet, while VLR yields valid results for compositional data, it lacks a meaningful scale.
propr:::proprVLR(Y[, 1:4])
## a b c d
## a 0.000000000 0.009007394 0.11273963 0.11192702
## b 0.009007394 0.000000000 0.12431341 0.11769259
## c 0.112739635 0.124313413 0.00000000 0.01986009
## d 0.111927021 0.117692593 0.01986009 0.00000000
propr:::proprVLR(X)
## a b c d e
## a 0.000000000 0.009007394 0.112739635 0.111927021 0.097960496
## b 0.009007394 0.000000000 0.124313413 0.117692593 0.104219359
## c 0.112739635 0.124313413 0.000000000 0.019860086 0.009516737
## d 0.111927021 0.117692593 0.019860086 0.000000000 0.008167461
## e 0.097960496 0.104219359 0.009516737 0.008167461 0.000000000
In the calculation of proportionality, we use the variance about the clr-transformed data to adjust the variance of the log-ratios (VLR). In other words, we divide the arbitrary VLR by the variance of its individual constituents. In this way, the use of clr-transformed data shifts the VLR-matrix onto a “standardized” scale that compares across pairs.
In the next figures, we compare pairwise scatterplots for the clr-transformed absolute count data and the corresponding clr-transformed relative count data. While equivalent, we see a relationship between “c” and “d” that should not exist based on what we know from the non-transformed absolute count data. This relationship is ultimately reflected (at least partially) in the results of phit
and perb
alike.
pairs(propr:::proprCLR(Y[, 1:4]))
pairs(propr:::proprCLR(X))
However, division of the VLR by the variance of the clr lacks sub-compositional coherence. As such, neither \(\phi\) nor \(\rho\), at least when calculated via clr, yield the same in the setting of missing feature data. This may explain why these methods do not, per se, prevent the possible discovery of spurious proportionality.
phit(Y[, 1:4])@matrix
## [,1] [,2] [,3] [,4]
## [1,] 0.000000 0.328171 4.1075015 4.0778951
## [2,] 0.328171 0.000000 3.9114296 3.7031104
## [3,] 4.107501 3.911430 0.0000000 0.5971697
## [4,] 4.077895 3.703110 0.5971697 0.0000000
phit(X)@matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.0000000 0.2388549 2.9895895 2.9680409 2.5976815
## [2,] 0.2388549 0.0000000 2.9298206 2.7737810 2.4562436
## [3,] 2.9895895 2.9298206 0.0000000 0.8050362 0.3857646
## [4,] 2.9680409 2.7737810 0.8050362 0.0000000 0.3564512
## [5,] 2.5976815 2.4562436 0.3857646 0.3564512 0.0000000
perb(Y[, 1:4])@matrix
## [,1] [,2] [,3] [,4]
## [1,] 1.0000000 0.8479235 -0.8571942 -0.9020354
## [2,] 0.8479235 1.0000000 -0.9113638 -0.8627917
## [3,] -0.8571942 -0.9113638 1.0000000 0.6928331
## [4,] -0.9020354 -0.8627917 0.6928331 1.0000000
perb(X)@matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0000000 0.8876058 -0.8072883 -0.8462492 -0.8459643
## [2,] 0.8876058 1.0000000 -0.8526537 -0.8011329 -0.8035079
## [3,] -0.8072883 -0.8526537 1.0000000 0.5826229 0.7622388
## [4,] -0.8462492 -0.8011329 0.5826229 1.0000000 0.7865827
## [5,] -0.8459643 -0.8035079 0.7622388 0.7865827 1.0000000
Still, in comparing the dependence between “c” and “d” as calculated by \(cov(Y)\) with that of \(\rho(Y)\), it appears that proportionality analysis offers at least partial protection against spurious results.
cor(Y)
## a b c d e
## a 1.0000000 0.9918545 0.8606885 0.8700002 0.8630598
## b 0.9918545 1.0000000 0.8553602 0.8677473 0.8622694
## c 0.8606885 0.8553602 1.0000000 0.9857120 0.9923988
## d 0.8700002 0.8677473 0.9857120 1.0000000 0.9909547
## e 0.8630598 0.8622694 0.9923988 0.9909547 1.0000000
perb(Y)@matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0000000 0.8876058 -0.8072883 -0.8462492 -0.8459643
## [2,] 0.8876058 1.0000000 -0.8526537 -0.8011329 -0.8035079
## [3,] -0.8072883 -0.8526537 1.0000000 0.5826229 0.7622388
## [4,] -0.8462492 -0.8011329 0.5826229 1.0000000 0.7865827
## [5,] -0.8459643 -0.8035079 0.7622388 0.7865827 1.0000000
Unlike the centered log-ratio (clr) which adjusts each subject vector by the geometric mean of that vector, the additive log-ratio (alr) adjusts each subject vector by the value of one its own components, chosen as a reference. If we select as a reference some feature \(D\) with an a priori known fixed absolute count across all subjects, we can effectively “back-calculate” absolute data from relative data. When initially crafting the data \(X\), we included “e” as this fixed value.
The following figures compare pairwise scatterplots for alr-transformed relative count data (i.e., \(alr(Y)\) with “e” as the reference) and the corresponding absolute count data. We see here how alr-transformation eliminates the spurious correlation between “c” and “d”.
pairs(propr:::proprALR(Y, ivar = 5))
pairs(X[, 1:4])
Again, this gets reflected in the results of perb
when we select “e” as the reference.
perb(Y, ivar = 5)@matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.00000000 0.95544861 -0.04896295 -0.05464219 0
## [2,] 0.95544861 1.00000000 -0.09299877 -0.04720992 0
## [3,] -0.04896295 -0.09299877 1.00000000 -0.12304138 0
## [4,] -0.05464219 -0.04720992 -0.12304138 1.00000000 0
## [5,] 0.00000000 0.00000000 0.00000000 0.00000000 1
Now, let us assume these same data, \(X\), actually measure relative counts. In other words, \(X\) is already relative and we do not know the real quantities which correspond to \(X\) absolutely. Well, if we knew that “a” represented a known fixed quantity, we could use alr-transformation again to “back-calculate” the absolute abundances. In this case, we will see that “c”, “d”, and “e” actually do have proportional expression under these conditions. Although the measured quantity of “c”, “d”, and “e” do not change considerably across subjects, the measured quantity of the known fixed feature does change. As such. whenever “a” increases while “c”, “d”, and “e” remains the same, the latter three features have actually decreased. Since they all decreased together, they act as a highly proportional module.
pairs(propr:::proprALR(X, ivar = 1))
Again, this gets reflected in the results of perb
when we select “a” as the reference.
perb(X, ivar = 1)@matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0.00000000 0.00000000 0.00000000 0.00000000
## [2,] 0 1.00000000 -0.02107964 0.02680645 0.02569491
## [3,] 0 -0.02107964 1.00000000 0.91160199 0.95483279
## [4,] 0 0.02680645 0.91160199 1.00000000 0.96108648
## [5,] 0 0.02569491 0.95483279 0.96108648 1.00000000
We can visualize this module using the bundled visualization method dendrogram
.
dendrogram(perb(X, ivar = 1))
## Alert: Generating plot using all feature pairs.
## 'dendrogram' with 2 branches and 5 members total, at height 1
Resuming our initial claim that the matrix \(X\) contains absolute count data while the matrix \(Y\) contains relative count data, we can show that alr-transformation not only corrects for spurious proportionality, but it also serves as a sub-compositionally coherent measure of dependence. However, unlike the aforementioned VLR, \(\rho\) has a meaningful scale. In the example below, we calculate \(\rho\) using the alr-transformation about the reference “e” for four compositions of the relative count matrix, \(Y\), as well as for the absolute count matrix, \(X\). We see here that, unlike clr-transformed proportionality metrics, the alr-transformed metric \(\rho\) yields identical results regardless of the nature of the data explored. Of course, this assumes that one knows the identity of a feature fixed across all subjects. Still, at this point, one might also consider “back-calculating” the absolute abundances and measuring dependence through more conventional means.
perb(Y[, 2:5], ivar = 4)@matrix
## [,1] [,2] [,3] [,4]
## [1,] 1.00000000 -0.09299877 -0.04720992 0
## [2,] -0.09299877 1.00000000 -0.12304138 0
## [3,] -0.04720992 -0.12304138 1.00000000 0
## [4,] 0.00000000 0.00000000 0.00000000 1
perb(X, ivar = 5)@matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.00000000 0.95544861 -0.04896295 -0.05464219 0
## [2,] 0.95544861 1.00000000 -0.09299877 -0.04720992 0
## [3,] -0.04896295 -0.09299877 1.00000000 -0.12304138 0
## [4,] -0.05464219 -0.04720992 -0.12304138 1.00000000 0
## [5,] 0.00000000 0.00000000 0.00000000 0.00000000 1
Although we developed this package with biological count data in mind, many of the ostensibly compositional biological datasets do not behave in a truly compositional manner. For example, in the setting of gene expression data, measuring the expression of “Gene A” as 1 in one subject and the expression of “Gene B” as 2 in another subject (i.e., the feature vector \([1, 2]\)), does not carry the same information as measuring the expression of “Gene A” as 1000 in one subject and the expression of “Gene B” as 2000 in another subject (i.e., the feature vector \([1000, 2000]\)). As such, these data do not strictly meet the criteria for compositional data. Unfortunately, we do not yet have a model to adequately address this drawback. Therefore, we advise the investigator to proceed with caution when working with such “count compositional” data.
Erb, Ionas, and Cedric Notredame. “How Should We Measure Proportionality on Relative Gene Expression Data?” Theory in Biosciences = Theorie in Den Biowissenschaften 135, no. 1-2 (June 2016): 21-36. http://dx.doi.org/10.1007/s12064-015-0220-8.
Lovell, David, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, and Jürg Bähler. “Proportionality: A Valid Alternative to Correlation for Relative Data.” PLoS Computational Biology 11, no. 3 (March 2015): e1004075. http://dx.doi.org/10.1371/journal.pcbi.1004075.