The bioinformatic evaluation of gene co-expression often begins with correlation-based analyses. However, as demonstrated thoroughly in a recent publication, this approach lacks statistical validity when applied to relative data (Lovell 2015). This includes, for example, some of the most frequently studied biological count data, such as those produced by microarray assays or high-throughput RNA-sequencing. As an alternative to correlation, Lovell et al propose a proportionality metric, \(\phi\), as derived from compositional data (CoDa) analysis. A subsequent publication expounded this work by elaborating on another proportionality metric, \(\rho\) (Erb 2016). This package introduces a programmatic framework for the calculation of feature dependence through proportionality, as discussed in the cited publications.
Let \(A_i\) and \(A_j\) each represent a log-ratio transformed feature vector (e.g., a transformed vector of \(d\) gene values measured across \(n\) conditions). We then define the metrics \(\phi\) and \(\rho\) accordingly:
\[\phi(A_i, A_j) = \frac{var(A_i - A_j)}{var(A_i)}\]
\[\rho(A_i, A_j) = 1 - \frac{var(A_i - A_j)}{var(A_i) + var(A_j)}\]
Above, we use the log-ratio transformation in order to normalize the data in a manner that respects the nature of relative data. In other words, log-ratio transformation yields the same result whether applied to absolute or relative data. In this package, we consider two log-ratio transformations of the subject vector \(x\), the centered log-ratio transformation (clr) and the additive log-ratio transformation (alr). We define the metrics \(clr(x)\) and \(alr(x)\) accordingly:
\[\textrm{clr(x)} = \left[\ln\frac{x_i}{g(\textrm{x})};...;\ln\frac{x_D}{g(\textrm{x})}\right]\]
\[\textrm{alr(x)} = \left[\ln\frac{x_i}{x_D};...;\ln\frac{x_{D-1}}{x_D}\right]\]
In clr-transformation, sample vectors undergo normalization based on the logarithm of the ratio between the individual elements and the geometric mean of the vector, \(g(\textrm{x}) = \sqrt[D]{x_i...x_D}\). In alr-transformation, sample vectors undergo normalization based on the logarithm of the ratio between the individual elements and chosen reference feature. Although these transformations differ in definition, we will sometimes will refer to them jointly by the acronym *lr.
We provide two principal functions for calculating proportionality. The first function, phit, implements the calculation of \(\phi\) described in Lovell et al (2015). This function makes use of clr-transformation exclusively. The second function, perb, implements the calculation of \(\rho\) described initially in Lovell et al (2015) and expounded by Erb and Notredame (2016). This function makes use of either clr- or alr-transformation.
The first difference between \(\phi\) and \(\rho\) is scale. The values of \(\phi\) range from \([0, \infty)\), with lower \(\phi\) values indicating more proportionality. The values of \(\rho\) range from \([-1, 1]\), with greater \(|\rho|\) values indicating more proportionality and negative \(\rho\) values indicating inverse proportionality. A second difference is that \(\phi\) lacks symmetry. However, one can force symmetry by reflecting the lower left triangle of the matrix across the diagonal (toggled by the argument symmetrize = TRUE). A third difference is that \(\rho\) corrects for the individual variance of each feature in the pair, rather than for just one of the features as in \(\phi\).
For now, we will focus on the implementations that use clr-transformation, saving a discussion of alr-transformation for later. Let us begin by building an arbitrary dataset of 4 features (e.g., genes) measured across 100 subjects. In this example dataset, the feature pairs “a” and “b” will show proportional change as well as the feature pairs “c” and “d”.
set.seed(12345)
N <- 100
X <- data.frame(a=(1:N), b=(1:N) * rnorm(N, 10, 0.1),
                c=(N:1), d=(N:1) * rnorm(N, 10, 1.0))Let \(d\) represent any number of features measured across \(n\) observations undergoing a binary or continuous event \(E\). For example, \(n\) could represent subjects differing in case-control status, treatment status, treatment dose, or time. The phit and perb functions ultimately convert a “count matrix” with \(n\) rows and \(d\) columns into a proportionality matrix of \(d\) rows and \(d\) columns containing a \(\phi\) or \(\rho\) measurement for each feature pair. One can think of this matrix as analogous to a dissimilarity matrix (in the case of \(\phi\)) or a correlation matrix (in the case of \(\rho\)). Both functions return the proportionality matrix bundled within an object of the class propr. This object contains four slots:
@counts A matrix. Stores the original “count matrix” input.@logratio A matrix. Stores the log-ratio transformed “count matrix”.@matrix A matrix. Stores the proportionality metrics, \(\phi\) or \(\rho\).@pairs A vector. Indexes the proportionality metrics of interest.library(propr)
phi <- phit(X, symmetrize = TRUE)
rho <- perb(X, ivar = 0)We have provided methods for indexing and subsetting objects belonging to the propr class. Using the familiar [ method, we can efficiently index the proportionality matrix (@matrix) based on an inequality operator and a reference value. By design, this method never modifies the proportionality matrix, making that it scale well with large datasets. Alternatively, using the subset method, we can subset the entire propr object by a vector of feature indices or names. The subset method also provides a convenient way to re-order feature and subject vectors for downstream visualization tools (e.g., image). However, this method does copy-on-modify the proportionality matrix, making it unsuitable for large datasets.
In the first example below, we use [ to index the matrix by \(\rho > .99\). This saves to the @pairs slot the location of all values satisfying the inequality that fall within the lower left triangle of the matrix. Indexing works, in part, to guide bundled visualization methods in lieu of copy-on-modify subsetting.
rho99 <- rho[">", .99]
rho99@pairs## [1]  2 12In the second example, we subset by the feature names “a” and “b”.
rhoab <- subset(rho, select = c("a", "b"))
rhoab@matrix##           [,1]      [,2]
## [1,] 1.0000000 0.9999151
## [2,] 0.9999151 1.0000000Alternatively, we could subset by the index itself using the indexToCoord function as an intermediate. This function converts a vector of indices into a paired list of coordinates to subsequently use in the subset method. However, this function requires one additional argument, N, supplied as the size of each dimension in the proportionality matrix from which the indices derive.
coord <- propr:::indexToCoord(rho99@pairs, N = nrow(rho99@matrix))
coord.merge <- sort(union(coord[[1]], coord[[2]]))
subset(rho, select = coord.merge)## @counts summary: 100 subjects by 4 features
## @logratio summary: 100 subjects by 4 features
## @matrix summary: 4 features by 4 features
## @pairs summary: index with `[` methodEach feature belonging to a highly proportional data pair should show approximately linearly correlated *lr-transformed expression with one another across all subjects. The method plot provides a means by which to visually inspect whether this holds true. Since this function will plot all pairs unless indexed with the [ method, we recommend the user first index or subset the propr object before plotting. “Noisy” correlation between some feature pairs could suggest that the proportionality cutoff is too lenient. We include this plot as a handy “sanity check” when working with high-dimensional datasets.
plot(rho99)Both microarray technology and high-throughput genomic sequencing have the ability to measure tens of thousands of features for each subject. Since calculating proportionality generates a matrix sized \(d^2\), this method uses a lot of RAM when applied to real biological datasets. To address this issue, the newest version of propr harnesses the power of C++ (via the Rcpp package) to achieve a near 100-fold increase in computational speed and an 80% reduction in RAM overhead. Below, we provide a small table that estimates the approximate amount of RAM needed to render a proportionality matrix based on the number of features studied. The user should account for up to 25% more MiB in additional RAM for subsequent [ indexing and visualization.
| Features | Peak RAM (MiB) | 
|---|---|
| 1000 | 8 | 
| 2000 | 31 | 
| 4000 | 123 | 
| 8000 | 491 | 
| 16000 | 1959 | 
| 24000 | 4405 | 
| 32000 | 7829 | 
| 64000 | 31301 | 
| 100000 | 76406 | 
We recognize that this package utilizes concepts largely unintuitive to many. Since the log-ratio transformation of relative data comprises a major portion of proportionality analysis, we decided to dedicate some extra space to this topic specifically. In this section, we discuss the centered log-ratio (clr) and its limitations in context of proportionality analysis. To this end, we begin by simulating count data for 5 features (e.g., genes) labeled “a”, “b”, “c”, “d”, and “e”, as measured across 100 subjects.
N <- 100
a <- seq(from = 5, to = 15, length.out = N)
b <- a * rnorm(N, mean = 1, sd = 0.1)
c <- rnorm(N, mean = 10)
d <- rnorm(N, mean = 10)
e <- rep(10, N)
X <- data.frame(a, b, c, d, e)Let us assume that these data \(X\) represent absolute abundance counts (i.e., not relative data). We can build a relative dataset, \(Y\), by distorting \(X\) accordingly:
Y <- X / rowSums(X) * abs(rnorm(N))As a “sanity check”, we will confirm that these new feature vectors do in fact contain relative quantities. We do this by calculating the ratio of the second feature vector to the first for both the absolute and relative datasets.
all(round(X[, 2] / X[, 1] - Y[, 2] / Y[, 1], 5) == 0)## [1] TRUEThe following figures compare pairwise scatterplots for the absolute count data and the corresponding relative count data. We see quickly how these relative data suggest a spurious correlation: although genes “c” and “d” do not correlate with one another absolutely, their relative quantities do.
pairs(X)
pairs(Y)Next, we will see that when we do calculate correlation, the coefficients differ for the absolute and relative datasets. This further demonstrates the spurious correlation.
cor(X)##             a          b           c          d  e
## a  1.00000000  0.9495487 -0.08429201 -0.1284406 NA
## b  0.94954870  1.0000000 -0.17278967 -0.1183455 NA
## c -0.08429201 -0.1727897  1.00000000 -0.1271698 NA
## d -0.12844062 -0.1183455 -0.12716985  1.0000000 NA
## e          NA         NA          NA         NA  1cor(Y)##           a         b         c         d         e
## a 1.0000000 0.9918545 0.8606885 0.8700002 0.8630598
## b 0.9918545 1.0000000 0.8553602 0.8677473 0.8622694
## c 0.8606885 0.8553602 1.0000000 0.9857120 0.9923988
## d 0.8700002 0.8677473 0.9857120 1.0000000 0.9909547
## e 0.8630598 0.8622694 0.9923988 0.9909547 1.0000000However, by calculating the variance of the log-ratios (vlr), defined as the variance of the logarithm of the ratio of two feature vectors, we can arrive at a single measure of dependence that (a) does not change with respect to the nature of the data (i.e., absolute or relative), and (b) does not change with respect to the number of features included in the computation. As such, the vlr, constituting the numerator portion of the \(\phi\) metric and a portion of the \(\rho\) metric as well, is sub-compositionally coherent. Yet, while vlr yields valid results for compositional data, it lacks a meaningful scale.
propr:::proprVLR(Y[, 1:4])##             a           b          c          d
## a 0.000000000 0.009007394 0.11273963 0.11192702
## b 0.009007394 0.000000000 0.12431341 0.11769259
## c 0.112739635 0.124313413 0.00000000 0.01986009
## d 0.111927021 0.117692593 0.01986009 0.00000000propr:::proprVLR(X)##             a           b           c           d           e
## a 0.000000000 0.009007394 0.112739635 0.111927021 0.097960496
## b 0.009007394 0.000000000 0.124313413 0.117692593 0.104219359
## c 0.112739635 0.124313413 0.000000000 0.019860086 0.009516737
## d 0.111927021 0.117692593 0.019860086 0.000000000 0.008167461
## e 0.097960496 0.104219359 0.009516737 0.008167461 0.000000000Similarly, transformation of a counts matrix by clr also makes the data sub-compositionally coherent. In the calculation of proportionality coefficients, we use the variance about the clr-transformed data to normalize the variance of the log-ratios (vlr). In other words, we adjust the arbitrarily defined vlr by the variance of its individual constituents. In this way, the use of clr-transformed data shifts the vlr-matrix onto a “standardized” scale that compares across all feature pairs.
In the next figures, we compare pairwise scatterplots for the clr-transformed absolute count data and the corresponding clr-transformed relative count data. While equivalent, we see a relationship between “c” and “d” that should not exist based on what we know from the non-transformed absolute count data. This relationship is ultimately reflected (at least partially) in the results of phit and perb alike.
pairs(propr:::proprCLR(Y[, 1:4]))
pairs(propr:::proprCLR(X))However, division of the vlr by the variance of the clr lacks sub-compositional coherence. As such, neither \(\phi\) nor \(\rho\), at least when calculated via clr, yield the same result for absolute and relative data. This may explain why these methods do not, per se, prevent the possible discovery of spurious proportionality.
phit(Y[, 1:4])@matrix## Calculating phi from "count matrix".##          [,1]     [,2]      [,3]      [,4]
## [1,] 0.000000 0.328171 4.1075015 4.0778951
## [2,] 0.328171 0.000000 3.9114296 3.7031104
## [3,] 4.107501 3.911430 0.0000000 0.5971697
## [4,] 4.077895 3.703110 0.5971697 0.0000000phit(X)@matrix## Calculating phi from "count matrix".##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 0.0000000 0.2388549 2.9895895 2.9680409 2.5976815
## [2,] 0.2388549 0.0000000 2.9298206 2.7737810 2.4562436
## [3,] 2.9895895 2.9298206 0.0000000 0.8050362 0.3857646
## [4,] 2.9680409 2.7737810 0.8050362 0.0000000 0.3564512
## [5,] 2.5976815 2.4562436 0.3857646 0.3564512 0.0000000perb(Y[, 1:4])@matrix## Calculating rho from "count matrix".##            [,1]       [,2]       [,3]       [,4]
## [1,]  1.0000000  0.8479235 -0.8571942 -0.9020354
## [2,]  0.8479235  1.0000000 -0.9113638 -0.8627917
## [3,] -0.8571942 -0.9113638  1.0000000  0.6928331
## [4,] -0.9020354 -0.8627917  0.6928331  1.0000000perb(X)@matrix## Calculating rho from "count matrix".##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,]  1.0000000  0.8876058 -0.8072883 -0.8462492 -0.8459643
## [2,]  0.8876058  1.0000000 -0.8526537 -0.8011329 -0.8035079
## [3,] -0.8072883 -0.8526537  1.0000000  0.5826229  0.7622388
## [4,] -0.8462492 -0.8011329  0.5826229  1.0000000  0.7865827
## [5,] -0.8459643 -0.8035079  0.7622388  0.7865827  1.0000000Still, in comparing the dependence between “c” and “d” as calculated by \(cov(Y)\) with that of \(\rho(Y)\), it appears that proportionality analysis does offer at least partial protection against spurious results.
cor(Y)##           a         b         c         d         e
## a 1.0000000 0.9918545 0.8606885 0.8700002 0.8630598
## b 0.9918545 1.0000000 0.8553602 0.8677473 0.8622694
## c 0.8606885 0.8553602 1.0000000 0.9857120 0.9923988
## d 0.8700002 0.8677473 0.9857120 1.0000000 0.9909547
## e 0.8630598 0.8622694 0.9923988 0.9909547 1.0000000perb(Y)@matrix## Calculating rho from "count matrix".##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,]  1.0000000  0.8876058 -0.8072883 -0.8462492 -0.8459643
## [2,]  0.8876058  1.0000000 -0.8526537 -0.8011329 -0.8035079
## [3,] -0.8072883 -0.8526537  1.0000000  0.5826229  0.7622388
## [4,] -0.8462492 -0.8011329  0.5826229  1.0000000  0.7865827
## [5,] -0.8459643 -0.8035079  0.7622388  0.7865827  1.0000000Finally, the reader should note that in this contrived example, \(\phi(X) = \phi(Y)\) and \(\rho(X) = \rho(Y)\), but only because the sum of the feature parts in the relative dataset can explain the whole of absolute dataset. In other words, this comes from the fact that in crafting the relative dataset, we used information spanning the entire absolute dataset (i.e., rowSums). This is usually not the case when studying biological count data and alone does not imply sub-compositional coherence.
Unlike the centered log-ratio (clr) which adjusts each subject vector by the geometric mean of that vector, the additive log-ratio (alr) adjusts each subject vector by the value of one its own components, chosen as a reference. If we select as a reference some feature \(D\) with an a priori known fixed absolute count across all subjects, we can effectively “back-calculate” absolute data from relative data. When initially crafting the data \(X\), we included “e” as this fixed value.
The following figures compare pairwise scatterplots for alr-transformed relative count data (i.e., \(alr(Y)\) with “e” as the reference) and the corresponding absolute count data. We see here how alr-transformation eliminates the spurious correlation between “c” and “d”.
pairs(propr:::proprALR(Y, ivar = 5))
pairs(X[, 1:4])Again, this gets reflected in the results of perb when we select “e” as the reference.
perb(Y, ivar = 5)@matrix## Calculating rho from "count matrix".##             [,1]        [,2]        [,3]        [,4] [,5]
## [1,]  1.00000000  0.95544861 -0.04896295 -0.05464219    0
## [2,]  0.95544861  1.00000000 -0.09299877 -0.04720992    0
## [3,] -0.04896295 -0.09299877  1.00000000 -0.12304138    0
## [4,] -0.05464219 -0.04720992 -0.12304138  1.00000000    0
## [5,]  0.00000000  0.00000000  0.00000000  0.00000000    1Now, let us assume these same data, \(X\), actually measure relative counts. In other words, \(X\) is already relative and we do not know the real quantities which correspond to \(X\) absolutely. Well, if we knew that “a” represented a known fixed quantity, we could use alr-transformation again to “back-calculate” the absolute abundances. In this case, we will see that “c”, “d”, and “e” actually do have proportional expression under these conditions. Although the measured quantity of “c”, “d”, and “e” do not change considerably across subjects, the measured quantity of the known fixed feature does change. As such. whenever “a” increases while “c”, “d”, and “e” remains the same, the latter three features have actually decreased. Since they all decreased together, they act as a highly proportional module.
pairs(propr:::proprALR(X, ivar = 1))Again, this gets reflected in the results of perb when we select “a” as the reference.
perb(X, ivar = 1)@matrix## Calculating rho from "count matrix".##      [,1]        [,2]        [,3]       [,4]       [,5]
## [1,]    1  0.00000000  0.00000000 0.00000000 0.00000000
## [2,]    0  1.00000000 -0.02107964 0.02680645 0.02569491
## [3,]    0 -0.02107964  1.00000000 0.91160199 0.95483279
## [4,]    0  0.02680645  0.91160199 1.00000000 0.96108648
## [5,]    0  0.02569491  0.95483279 0.96108648 1.00000000We can visualize this module using the bundled visualization method dendrogram.
dendrogram(perb(X, ivar = 1))## Calculating rho from "count matrix".
## Alert: Generating plot using all feature pairs.## 'dendrogram' with 2 branches and 5 members total, at height 1Resuming our initial claim that the matrix \(X\) contains absolute count data while the matrix \(Y\) contains relative count data, we can show that alr-transformation not only corrects for spurious proportionality, but it also serves as a sub-compositionally coherent metric of dependence. However, unlike the aforementioned vlr, \(\rho\) has a meaningful scale. In the example below, we calculate \(\rho\) using the alr-transformation about the reference “e” for four compositions of the relative count matrix, \(Y\), as well as for the absolute count matrix, \(X\). We see here that, unlike clr-transformed proportionality metrics, the alr-transformed metric \(\rho\) yields identical results regardless of the nature of the data explored. Of course, this assumes that one knows the identity of a feature fixed across all subjects. Still, at this point, one might also consider “back-calculating” the absolute abundances and measuring dependence through more conventional means.
perb(Y[, 2:5], ivar = 4)@matrix## Calculating rho from "count matrix".##             [,1]        [,2]        [,3] [,4]
## [1,]  1.00000000 -0.09299877 -0.04720992    0
## [2,] -0.09299877  1.00000000 -0.12304138    0
## [3,] -0.04720992 -0.12304138  1.00000000    0
## [4,]  0.00000000  0.00000000  0.00000000    1perb(X, ivar = 5)@matrix## Calculating rho from "count matrix".##             [,1]        [,2]        [,3]        [,4] [,5]
## [1,]  1.00000000  0.95544861 -0.04896295 -0.05464219    0
## [2,]  0.95544861  1.00000000 -0.09299877 -0.04720992    0
## [3,] -0.04896295 -0.09299877  1.00000000 -0.12304138    0
## [4,] -0.05464219 -0.04720992 -0.12304138  1.00000000    0
## [5,]  0.00000000  0.00000000  0.00000000  0.00000000    1Although we developed this package with biological count data in mind, many of the ostensibly compositional biological datasets do not behave in a truly compositional manner. For example, in the setting of gene expression data, measuring the expression of “Gene A” as 1 in one subject and the expression of “Gene B” as 2 in another subject (i.e., the feature vector \([1, 2]\)), does not carry the same information as measuring the expression of “Gene A” as 1000 in one subject and the expression of “Gene B” as 2000 in another subject (i.e., the feature vector \([1000, 2000]\)). As such, these data do not strictly meet the criteria for compositional data.
Unfortunately, we do not yet have a model to adequately address this drawback. Therefore, we advise the investigator to proceed with caution when working with such “count compositional” data.
Erb, I. & Notredame, C. 2016. How should we measure proportionality on relative gene expression data? Theory Biosci.
Lovell, D. et al. 2015. Proportionality: A Valid Alternative to Correlation for Relative Data. PLoS Comput Biol 11.