Data whitening is a widely used preprocessing step to remove correlation structure since statistical models often assume independence (Kessy, et al. 2018). The typical procedures transforms the observed data by an inverse square root of the sample correlation matrix (Figure 1). For low dimension data (i.e. \(n > p\)), this transformation produces transformed data with an identity sample covariance matrix. This procedure assumes either that the true covariance matrix is know, or is well estimated by the sample covariance matrix. Yet the use of the sample covariance matrix for this transformation can be problematic since 1) the complexity is \(\mathcal{O}(p^3)\) and 2) it is not applicable to the high dimensional (i.e. \(n \ll p\)) case since the sample covariance matrix is no longer full rank.

Here we use a probabilistic model of the observed data to apply a whitening transformation. Our Gaussian Inverse Wishart Empirical Bayes (GIW-EB) 1) model substantially reduces computational complexity, and 2) regularizes the eigen-values of the sample covariance matrix to improve out-of-sample performance.

**Figure 1: Intuition for data whitening transformation**. Given the singular value decomposition of $Y/\sqrt{n}$ after centering columns is $UDV^T$, the steps in the transformation are: **A)** Original data, **B)** Data rotated along principal components, **C)** Data rotated and scaled, **D)** Data rotated, scaled and rotated back to original axes. Green arrows indicate principal axes and lengths indicate eigen-values.

Figure 1: Intuition for data whitening transformation. Given the singular value decomposition of \(Y/\sqrt{n}\) after centering columns is \(UDV^T\), the steps in the transformation are: A) Original data, B) Data rotated along principal components, C) Data rotated and scaled, D) Data rotated, scaled and rotated back to original axes. Green arrows indicate principal axes and lengths indicate eigen-values.

Basic usage

library(decorrelate)
library(Rfast)

n <- 500  # number of samples
p <- 200  # number of features

# create correlation matrix
Sigma <- autocorr.mat(p, 0.9)

# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma * 5.1, seed = 1)
rownames(Y) <- paste0("sample_", 1:n)
colnames(Y) <- paste0("gene_", 1:p)

# eclairs decomposition implements GIW-EB method: *E*stimate
# *c*ovariance/correlation with *l*ow *r*ank and *s*hrinkage
ecl <- eclairs(Y)

# decorrelate data using eclairs decomposition
Y_whitened <- decorrelate(Y, ecl)

# the same whitening can be performed with one command where the eigen-value
# shrinkage is performed internally
Y_whitened2 <- whiten(Y)

Here plot A) the eigen-values of the covariance matrix before and after shrinkage. B) The correlation of the observed and C) whitened data.

oldpar <- par(mfrow = c(1, 3))

# plot shrinkage of eigen-values
plot(ecl)

# correlation between variables in observed data
image(cor(Y), axes = FALSE, main = "Correlation of observed data")

# decorrelate data using eclairs decomposition
image(cor(Y_whitened), axes = FALSE, main = "Correlation of whitened data")

par(oldpar)

Advanced usage

The decorrelate package has advanced features to examine details of the whitening transformation.

Directly compute whitening matrix

While eclairs(), decorrelate(), and whiten() perform the probabilistic whitening transformation efficiently without directly computing the whitening matrix, getWhiteningMatrix() can directly compute the matrix.

# compute whitening matrix from eclairs decomposition
W <- getWhiteningMatrix(ecl)

# transform observed data using whitening matrix
Z <- tcrossprod(Y, W)

# evalute difference between whitened computed 2 ways
max(abs(Z - Y_whitened))
## [1] 4.374279e-14

The difference between function is due only to machine precision.

Explicit covariance or correlation

The full covariance for correlation matrix implied by the eclairs() decomposition can be computed explicitly. Note that computing and storing these matries is \(O(p^2)\), it may not feasable for large datasets.

# compute correlation matrix from eclairs
getCor(ecl)

# compute covariance matrix from eclairs
getCov(ecl)

Sample from multivariate normal

The form of the eclairs() decomposition can be used to efficiently sample from a multivariage normal distribution with specified covariance.

# draw from multivariate normal
n <- 1000
mu <- rep(0, ncol(Y))

# using eclairs decomposition
X.draw1 <- rmvnorm_eclairs(n, mu, ecl)

Low-rank models

A low rank eclairs() decomposition can be computed more efficiently when \(k\) is small relative to \(min(n,p)\). Importantly, the emprical Bayes estimate of the shrinkage parameter \(\lambda\) can still be computed accurately for sufficiently large \(k\). Note that the low rank method trades computational efficientcy for accuracy in the whitening transform.

# use low rank decomposition with 50 components
ecl <- eclairs(Y, k = 60)

# decorrelate data using eclairs decomposition
Y_whitened <- decorrelate(Y, ecl)

In this case, the low rank whitening produces transformed features that are approximately independent. The approximation improves as the rank increases.

Computing condition number

Compute the condition number (i.e. the ratio between the largest and smallest eigen-value) of the correlation/covariance matrix from the eclairs() decomposition.

kappa(ecl)
## [1] 88.64936

Removing correlation vs covariance

By default eclairs() computes the covariance between columns by using the default compute = "covariance". Running decorrelate() using this removes the covariance between columns. Setting compute = "correlation", evaluates and then removes correlation between columns while retaining the variance.

library(clusterGeneration)

# generate covariance matrix, where the diagonals (i.e. variances) vary
Sigma <- genPositiveDefMat(p, rangeVar = c(1, 1e+06))$Sigma

Y <- rmvnorm(n, rep(0, p), sigma = Sigma)

# examine variances of the first 5 variables
apply(Y, 2, var)[1:5]
## [1]  6.279349  6.171918  2.320736  1.306302 10.851267
# transform removes covariance between columns so variance of transformed
# features are *approximately* equal
ecl_cov <- eclairs(Y, compute = "covariance")
Z1 <- decorrelate(Y, ecl_cov)

# variance are *approximately* equal
apply(Z1, 2, var)[1:5]
## [1] 0.9757741 0.9929288 0.9928508 0.9924159 0.9902491
# transform removes **correlation** between columns but variables are not
# scaled
ecl_cor <- eclairs(Y, compute = "correlation")
Z2 <- decorrelate(Y, ecl_cor)

# variances are not standardized
apply(Z2, 2, var)[1:5]
## [1]  6.173119  6.131217  2.296492  1.286559 10.779611

Session Info

## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin23.6.0
## Running under: macOS Sonoma 14.7.1
## 
## Matrix products: default
## BLAS:   /Users/gabrielhoffman/prog/R-4.4.2/lib/libRblas.dylib 
## LAPACK: /opt/homebrew/Cellar/r/4.5.1/lib/R/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] Rfast_2.1.5.1           RcppParallel_5.1.10     zigg_0.0.2             
##  [4] RcppZiggurat_0.1.8      Rcpp_1.0.14             clusterGeneration_1.3.8
##  [7] MASS_7.3-65             latex2exp_0.9.6         colorRamps_2.3.4       
## [10] cowplot_1.1.3           ggplot2_3.5.2           mvtnorm_1.3-3          
## [13] decorrelate_0.1.6.3    
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     stringi_1.8.7      lattice_0.22-7    
##  [5] digest_0.6.37      magrittr_2.0.3     evaluate_1.0.4     grid_4.4.2        
##  [9] RColorBrewer_1.1-3 fastmap_1.2.0      jsonlite_2.0.0     Matrix_1.7-3      
## [13] formatR_1.14       scales_1.4.0       codetools_0.2-20   jquerylib_0.1.4   
## [17] cli_3.6.5          rlang_1.1.6        CholWishart_1.1.4  withr_3.0.2       
## [21] cachem_1.1.0       yaml_2.3.10        tools_4.4.2        parallel_4.4.2    
## [25] dplyr_1.1.4        corpcor_1.6.10     vctrs_0.6.5        R6_2.6.1          
## [29] lifecycle_1.0.4    stringr_1.5.1      irlba_2.3.5.1      pkgconfig_2.0.3   
## [33] pillar_1.10.2      bslib_0.9.0        gtable_0.3.6       glue_1.8.0        
## [37] whitening_1.4.0    xfun_0.52          tibble_3.3.0       tidyselect_1.2.1  
## [41] knitr_1.50         farver_2.1.2       htmltools_0.5.8.1  rmarkdown_2.29    
## [45] labeling_0.4.3     compiler_4.4.2