Data whitening is a widely used preprocessing step to remove correlation structure since statistical models often assume independence (Kessy, et al. 2018). The typical procedures transforms the observed data by an inverse square root of the sample correlation matrix (Figure 1). For low dimension data (i.e. \(n > p\)), this transformation produces transformed data with an identity sample covariance matrix. This procedure assumes either that the true covariance matrix is know, or is well estimated by the sample covariance matrix. Yet the use of the sample covariance matrix for this transformation can be problematic since 1) the complexity is \(\mathcal{O}(p^3)\) and 2) it is not applicable to the high dimensional (i.e. \(n \ll p\)) case since the sample covariance matrix is no longer full rank.
Here we use a probabilistic model of the observed data to apply a whitening transformation. Our Gaussian Inverse Wishart Empirical Bayes (GIW-EB) 1) model substantially reduces computational complexity, and 2) regularizes the eigen-values of the sample covariance matrix to improve out-of-sample performance.
Figure 1: Intuition for data whitening transformation. Given the singular value decomposition of \(Y/\sqrt{n}\) after centering columns is \(UDV^T\), the steps in the transformation are: A) Original data, B) Data rotated along principal components, C) Data rotated and scaled, D) Data rotated, scaled and rotated back to original axes. Green arrows indicate principal axes and lengths indicate eigen-values.
library(decorrelate)
library(Rfast)
n <- 500 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, 0.9)
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma * 5.1, seed = 1)
rownames(Y) <- paste0("sample_", 1:n)
colnames(Y) <- paste0("gene_", 1:p)
# eclairs decomposition implements GIW-EB method: *E*stimate
# *c*ovariance/correlation with *l*ow *r*ank and *s*hrinkage
ecl <- eclairs(Y)
# decorrelate data using eclairs decomposition
Y_whitened <- decorrelate(Y, ecl)
# the same whitening can be performed with one command where the eigen-value
# shrinkage is performed internally
Y_whitened2 <- whiten(Y)
Here plot A) the eigen-values of the covariance matrix before and after shrinkage. B) The correlation of the observed and C) whitened data.
oldpar <- par(mfrow = c(1, 3))
# plot shrinkage of eigen-values
plot(ecl)
# correlation between variables in observed data
image(cor(Y), axes = FALSE, main = "Correlation of observed data")
# decorrelate data using eclairs decomposition
image(cor(Y_whitened), axes = FALSE, main = "Correlation of whitened data")
par(oldpar)
The decorrelate
package has advanced features to examine
details of the whitening transformation.
While eclairs()
, decorrelate()
, and
whiten()
perform the probabilistic whitening transformation
efficiently without directly computing the whitening matrix,
getWhiteningMatrix()
can directly compute the matrix.
# compute whitening matrix from eclairs decomposition
W <- getWhiteningMatrix(ecl)
# transform observed data using whitening matrix
Z <- tcrossprod(Y, W)
# evalute difference between whitened computed 2 ways
max(abs(Z - Y_whitened))
## [1] 4.374279e-14
The difference between function is due only to machine precision.
The full covariance for correlation matrix implied by the
eclairs()
decomposition can be computed explicitly. Note
that computing and storing these matries is \(O(p^2)\), it may not feasable for large
datasets.
# compute correlation matrix from eclairs
getCor(ecl)
# compute covariance matrix from eclairs
getCov(ecl)
The form of the eclairs()
decomposition can be used to
efficiently sample from a multivariage normal distribution with
specified covariance.
# draw from multivariate normal
n <- 1000
mu <- rep(0, ncol(Y))
# using eclairs decomposition
X.draw1 <- rmvnorm_eclairs(n, mu, ecl)
A low rank eclairs()
decomposition can be computed more
efficiently when \(k\) is small
relative to \(min(n,p)\). Importantly,
the emprical Bayes estimate of the shrinkage parameter \(\lambda\) can still be computed accurately
for sufficiently large \(k\). Note that
the low rank method trades computational efficientcy for accuracy in the
whitening transform.
# use low rank decomposition with 50 components
ecl <- eclairs(Y, k = 60)
# decorrelate data using eclairs decomposition
Y_whitened <- decorrelate(Y, ecl)
In this case, the low rank whitening produces transformed features that are approximately independent. The approximation improves as the rank increases.
Compute the condition number (i.e. the ratio between the largest and
smallest eigen-value) of the correlation/covariance matrix from the
eclairs()
decomposition.
kappa(ecl)
## [1] 88.64936
By default eclairs()
computes the covariance
between columns by using the default
compute = "covariance"
. Running decorrelate()
using this removes the covariance between columns. Setting
compute = "correlation"
, evaluates and then removes
correlation between columns while retaining the variance.
library(clusterGeneration)
# generate covariance matrix, where the diagonals (i.e. variances) vary
Sigma <- genPositiveDefMat(p, rangeVar = c(1, 1e+06))$Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma)
# examine variances of the first 5 variables
apply(Y, 2, var)[1:5]
## [1] 6.279349 6.171918 2.320736 1.306302 10.851267
# transform removes covariance between columns so variance of transformed
# features are *approximately* equal
ecl_cov <- eclairs(Y, compute = "covariance")
Z1 <- decorrelate(Y, ecl_cov)
# variance are *approximately* equal
apply(Z1, 2, var)[1:5]
## [1] 0.9757741 0.9929288 0.9928508 0.9924159 0.9902491
# transform removes **correlation** between columns but variables are not
# scaled
ecl_cor <- eclairs(Y, compute = "correlation")
Z2 <- decorrelate(Y, ecl_cor)
# variances are not standardized
apply(Z2, 2, var)[1:5]
## [1] 6.173119 6.131217 2.296492 1.286559 10.779611
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin23.6.0
## Running under: macOS Sonoma 14.7.1
##
## Matrix products: default
## BLAS: /Users/gabrielhoffman/prog/R-4.4.2/lib/libRblas.dylib
## LAPACK: /opt/homebrew/Cellar/r/4.5.1/lib/R/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Rfast_2.1.5.1 RcppParallel_5.1.10 zigg_0.0.2
## [4] RcppZiggurat_0.1.8 Rcpp_1.0.14 clusterGeneration_1.3.8
## [7] MASS_7.3-65 latex2exp_0.9.6 colorRamps_2.3.4
## [10] cowplot_1.1.3 ggplot2_3.5.2 mvtnorm_1.3-3
## [13] decorrelate_0.1.6.3
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4 stringi_1.8.7 lattice_0.22-7
## [5] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.4 grid_4.4.2
## [9] RColorBrewer_1.1-3 fastmap_1.2.0 jsonlite_2.0.0 Matrix_1.7-3
## [13] formatR_1.14 scales_1.4.0 codetools_0.2-20 jquerylib_0.1.4
## [17] cli_3.6.5 rlang_1.1.6 CholWishart_1.1.4 withr_3.0.2
## [21] cachem_1.1.0 yaml_2.3.10 tools_4.4.2 parallel_4.4.2
## [25] dplyr_1.1.4 corpcor_1.6.10 vctrs_0.6.5 R6_2.6.1
## [29] lifecycle_1.0.4 stringr_1.5.1 irlba_2.3.5.1 pkgconfig_2.0.3
## [33] pillar_1.10.2 bslib_0.9.0 gtable_0.3.6 glue_1.8.0
## [37] whitening_1.4.0 xfun_0.52 tibble_3.3.0 tidyselect_1.2.1
## [41] knitr_1.50 farver_2.1.2 htmltools_0.5.8.1 rmarkdown_2.29
## [45] labeling_0.4.3 compiler_4.4.2