Type: | Package |
Title: | Decorrelation Projection Scalable to High Dimensional Data |
Version: | 0.1.6.3 |
Date: | 2025-06-14 |
Description: | Data whitening is a widely used preprocessing step to remove correlation structure since statistical models often assume independence. Here we use a probabilistic model of the observed data to apply a whitening transformation. Our Gaussian Inverse Wishart Empirical Bayes model substantially reduces computational complexity, and regularizes the eigen-values of the sample covariance matrix to improve out-of-sample performance. |
License: | Artistic-2.0 |
Encoding: | UTF-8 |
URL: | https://gabrielhoffman.github.io/decorrelate/ |
BugReports: | https://github.com/GabrielHoffman/decorrelate/issues |
Suggests: | knitr, pander, whitening, CCA, yacca, mvtnorm, ggplot2, cowplot, colorRamps, RUnit, latex2exp, clusterGeneration, rmarkdown |
Depends: | R (≥ 4.2.0), methods |
Imports: | Rfast, irlba, graphics, Rcpp, CholWishart, Matrix, utils, stats |
RoxygenNote: | 7.3.2 |
VignetteBuilder: | knitr |
LinkingTo: | Rcpp, RcppArmadillo |
NeedsCompilation: | yes |
Packaged: | 2025-07-14 18:42:00 UTC; gabrielhoffman |
Author: | Gabriel Hoffman |
Maintainer: | Gabriel Hoffman <gabriel.hoffman@mssm.edu> |
Repository: | CRAN |
Date/Publication: | 2025-07-17 20:40:02 UTC |
Create auto-correlation matrix
Description
Create auto-correlation matrix
Usage
autocorr.mat(p, rho)
Arguments
p |
dimension of matrix |
rho |
autocorrelation value |
Value
auto-matrix of size p with parameter rho
Examples
# Create 4x4 matrix with correlation between adjacent enties is 0.9
autocorr.mat(4, .9)
Summarize correlation matrix
Description
Summarize correlation matrix as a scalar scalar value, given its SVD and shrinkage parameter. averageCorr()
computes the average correlation, and averageCorrSq()
computes the average squared correlation, where both exclude the diagonal terms. sumInverseCorr()
computes the sum of entries in the inverse correlation matrix to give the 'effective number of independent features'. effVariance()
evaluates effective variance of the correlation (or covariance) matrix. These values can be computed using the correlation matrix using standard MLE, or EB shrinkage.
Usage
averageCorr(ecl, method = c("EB", "MLE"))
averageCorrSq(ecl, method = c("EB", "MLE"))
sumInverseCorr(ecl, method = c("EB", "MLE"), absolute = TRUE)
effVariance(ecl, method = c("EB", "MLE"))
tr(ecl, method = c("EB", "MLE"))
Arguments
ecl |
estimate of correlation matrix from |
method |
compute average correlation for either the empirical Bayes (EB) shinken correlation matrix or the MLE correlation matrix |
absolute |
if |
Details
tr()
: trace of the matrix. Sum of diagonals is the same as the sum of the eigen-values.
averageCorr()
: The average correlation is computed by summing the off-diagonal values in the correlation matrix. The sum of all elements in a matrix is g = \sum_{i,j} C_{i,j} = 1^T C 1
, where 1
is a vector of p
elements with all entries 1. This last term is a quadratic form of the correlation matrix that can be computed efficiently using the SVD and shrinkage parameter from eclairs()
. Given the value of g
, the average is computed by subtracting the diagonal values and dividing by the number of off-diagonal values: (g - p) / (p(p-1))
.
averageCorrSq()
: The average squared correlation is computed using only the eigen-values. Surprisingly, this is a function of the variance of the eigen-values. The is reviewed by Watanabe (2022) and Durand and Le Roux (2017). Letting \lambda_i
be the i^{th}
sample or shrunk eigen-value, and \tilde{\lambda}
be the mean eigen-value, then \sum_i (\lambda_i - \tilde{\lambda})^2 / p(p-1)\tilde{\lambda}^2
.
sumInverseCorr()
: The 'effective number of independent features' is computed by summing the entires of the inverse covariance matrix. This has the form \sum_{i,j} C^{-1}_{i,j} = 1^T C^{-1} 1
. This last term is a quadratic form of the correlation matrix that can be computed efficiently using the SVD and shrinkage parameter from eclairs()
as described above.
effVariance()
: Compute a metric of the amount of variation represented by a correlation (or covariance) matrix that is comparable across matrices of difference sizes. Proposed by Peña and Rodriguez (2003), the 'effective variance' is |C|^\frac{1}{p}
where C
is a correlation (or covariance matrix) between p
variables. The effective variance is the mean of the log eigen-values.
Value
value of summary statistic
References
Durand, J. L., & Le Roux, B. (2017). Linkage index of variables and its relationship with variance of eigenvalues in PCA and MCA. Statistica Applicata-Italian Journal of Applied Statistics, (2-3), 123-135.
Peña, D., & Rodriguez, J. (2003). Descriptive measures of multivariate scatter and linear dependence. Journal of Multivariate Analysis, 85(2), 361-374.
Watanabe, J. (2022). Statistics of eigenvalue dispersion indices: Quantifying the magnitude of phenotypic integration. Evolution, 76(1), 4-28.
Examples
library(Rfast)
n <- 200 # number of samples
p <- 800 # number of features
# create correlation matrix
Sigma <- matrix(.2, p, p)
diag(Sigma) <- 1
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma, seed = 1)
rownames(Y) <- paste0("sample_", seq(n))
colnames(Y) <- paste0("gene_", seq(p))
# eclairs decomposition
ecl <- eclairs(Y, compute = "cor")
# Average correlation value
averageCorr(ecl)
# Average squared correlation value
averageCorrSq(ecl)
# Sum elements in inverse correlation matrix
# Gives the effective number of independent features
sumInverseCorr(ecl)
# Effective variance
effVariance(ecl)
Canonical correlation analysis
Description
Canonical correlation analysis that is scalable to high dimensional data. Uses covariance shrinkage and algorithmic speed ups to be linear time in p when p > n.
Usage
cca(X, Y, k = min(dim(X), dim(Y)), lambda.x = NULL, lambda.y = NULL)
Arguments
X |
first matrix (n x p1) |
Y |
first matrix (n x p2) |
k |
number of canonical components to return |
lambda.x |
optional shrinkage parameter for estimating covariance of X. If NULL, estimate from data. |
lambda.y |
optional shrinkage parameter for estimating covariance of Y. If NULL, estimate from data. |
Details
Results from standard CCA are based on the SVD of \Sigma_{xx}^{-\frac{1}{2}} \Sigma_{xy} \Sigma_{yy}^{-\frac{1}{2}}
.
Avoids computation of \Sigma_{xx}^{-\frac{1}{2}}
by using eclairs. Avoids cov(X,Y) by framing this as a matrix product that can be distributed. Uses low rank SVD.
Other regularized CCA adds lambda to covariance like Ridge. Here it is a mixture
Value
statistics summarizing CCA
Estimate covariance matrix after applying transformation
Description
Given covariance between features in the original data, estimate the covariance matrix after applying a transformation to each feature. Here we use the eclairs decomposition of the original covariance matrix, perform a parametric bootstrap and return the eclairs decomposition of the covariance matrix of the transformed data.
Usage
cov_transform(
ecl,
f,
n.boot,
lambda = NULL,
compute = c("covariance", "correlation"),
seed = NULL
)
Arguments
ecl |
covariance/correlation matrix as an eclairs object |
f |
function specifying the transformation. |
n.boot |
number of parametric bootstrap samples. Increasing n gives more precise estimates. |
lambda |
shrinkage parameter. If not specified, it is estimated from the data. |
compute |
evaluate either the |
seed |
If you want the same to be generated again use a seed for the generator, an integer number |
Details
When the transformation is linear, these covariance matrices are the same.
Value
eclairs decomposition representing correlation/covariance on the transformed data
Examples
library(Rfast)
n <- 800 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# sample matrix from MVN with covariance Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma, seed = 1)
# perform eclairs decomposition
ecl <- eclairs(Y)
# Parametric boostrap to estimate covariance
# after transformation
# transformation function
f <- function(x) log(x^2 + 1e-3)
# number of bootstrap samples
n_boot <- 5000
# Evaluate eclairs decomposition on boostrap samples
ecl2 <- cov_transform(ecl, f = f, n_boot, lambda = 1e-4)
# Get full covariance matrix from eclairs decomposition
C1 <- getCov(ecl2)
# Parametric boostrap samples directly from full covariance matrix
X <- rmvnorm(n_boot, rep(0, p), getCov(ecl))
# get covariance of transformed data
C2 <- cov(f(X))
# Evaluate differences
# small differences are due to Monte Carlo error from boostrap sampling
range(C1 - C2)
# Plot entries from two covariance estimates
par(pty = "s")
plot(C1, C2, main = "Concordance between covariances")
abline(0, 1, col = "red")
# Same above but compute eclairs for correlation matrix
#-------------------------------------------------------
# Evaluate eclairs decomposition on boostrap samples
ecl2 <- cov_transform(ecl, f = f, n_boot, compute = "correlation", lambda = 1e-4)
# Get full covariance matrix from eclairs decomposition
C1 <- getCor(ecl2)
# Parametric boostrap samples directly from full covariance matrix
X <- rmvnorm(n_boot, rep(0, p), getCov(ecl))
# get correlation of transformed data
C2 <- cor(f(X))
# Evaluate differences
# small differences are due to Monte Carlo error from boostrap sampling
range(C1 - C2)
# Plot entries from two correlation estimates
oldpar <- par(pty = "s")
plot(C1, C2, main = "Correlation between covariances")
abline(0, 1, col = "red")
par(oldpar)
Decorrelation projection
Description
Efficient decorrelation projection using eclairs decomposition
Usage
decorrelate(X, ecl, lambda, transpose = FALSE, alpha = -1/2)
Arguments
X |
matrix to be transformed so *columns* are independent |
ecl |
estimate of covariance/correlation matrix from eclairs storing |
lambda |
specify lambda and override value from |
transpose |
logical, (default FALSE) indicating if X should be transposed first |
alpha |
default = -1/2. Exponent of eigen-values |
Details
Apply a decorrelation transform using the implicit covariance approach to avoid directly evaluating the covariance matrix
Value
a matrix following the decorrelation transformation
Examples
library(Rfast)
n <- 800 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma * 5.1, seed = 1)
# eclairs decomposition
ecl <- eclairs(Y)
# whitened Y
Y.transform <- decorrelate(Y, ecl)
#
Multiply by diagonal matrix
Description
Multiply by diagonal matrix using efficient algorithm
Usage
dmult(M, v, side = c("left", "right"))
Arguments
M |
matrix |
v |
vector with entries forming a diagonal matrix matching the dimensions of |
side |
is the matrix |
Details
Naively multiplying by a diagonal matrix with p
entries takes \mathcal{O}(p^2)
, even though must values in the diagonal matrix are zero. R has built in syntax to scale columns, so diag(v) %*% M
can be evaluated with v * M
. This is often difficult to remember, hard to look up, and scaling rows requires t(t(M) * v)
. This is hard to read and write, and requires two transpose operations.
Here, dmult()
uses Rcpp
to evaluate the right multiplication efficiently, and uses v * M
for left multiplication. This gives good performance and readability.
In principle, the Rcpp
code can be modified to use BLAS for better performance by treating a NumericMatrix
as a C
array. But this is not currently a bottleneck
Value
matrix product
Examples
# right multiply
# mat %*% diag(v)
n <- 30
p <- 20
mat <- matrix(n * p, n, p)
v <- rnorm(p)
A <- dmult(mat, v, side = "right")
B <- mat %*% diag(v)
range(A - B)
# left multiply
# diag(v) %*% mat
n <- 30
p <- 20
mat <- matrix(n * p, p, n)
v <- rnorm(p)
A <- dmult(mat, v, side = "left")
B <- diag(v) %*% mat
range(A - B)
Estimate covariance/correlation with low rank and shrinkage
Description
Estimate the covariance/correlation between columns as the weighted sum of a low rank matrix and a scaled identity matrix. The weight acts to shrink the sample correlation matrix towards the identity matrix or the sample covariance matrix towards a scaled identity matrix with constant variance. An estimate of this form is useful because it is fast, and enables fast operations downstream. The method is based on the Gaussian Inverse Wishart Empirical Bayes (GIW-EB) model.
Usage
eclairs(
X,
k,
lambda = NULL,
compute = c("covariance", "correlation"),
n.samples = nrow(X)
)
Arguments
X |
data matrix with n samples as rows and p features as columns |
k |
the rank of the low rank component |
lambda |
shrinkage parameter. If not specified, it is estimated from the data. |
compute |
evaluate either the |
n.samples |
number of samples data is from. Usually |
Details
Compute U
, d^2
to approximate the correlation matrix between columns of data matrix X by U diag(d^2 (1-\lambda)) U^T + I\nu\lambda
. When computing the covariance matrix, scale by the standard deviation of each feature.
Value
eclairs object storing:
- U:
orthonormal matrix with k columns representing the low rank component
- dSq:
eigen-values so that
U diag(d^2) U^T
is the low rank component- lambda:
shrinkage parameter
\lambda
for the scaled diagonal component- sigma:
standard deviations of input columns
- nu:
diagonal value,
\nu
, of target matrix in shrinkage- n:
number of samples (i.e. rows) in the original data
- p:
number of features (i.e. columns) in the original data
- k:
rank of low rank component
- rownames:
sample names from the original matrix
- colnames:
features names from the original matrix
- method:
method used for decomposition
- call:
the function call
Examples
library(Rfast)
n <- 800 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma * 5.1, seed = 1)
rownames(Y) <- paste0("sample_", seq(n))
colnames(Y) <- paste0("gene_", seq(p))
# eclairs decomposition: covariance
ecl <- eclairs(Y, compute = "covariance")
ecl
# eclairs decomposition: correlation
ecl <- eclairs(Y, compute = "correlation")
ecl
Class eclairs
Description
Class eclairs
Details
Object storing:
- U:
orthonormal matrix with k columns representing the low rank component
- dSq:
eigen-values so that
U diag(d^2) U^T
is the low rank component- lambda:
shrinkage parameter
\lambda
for the scaled diagonal component- sigma:
standard deviations of input columns
- nu:
diagonal value,
\nu
, of target matrix in shrinkage- n:
number of samples (i.e. rows) in the original data
- p:
number of features (i.e. columns) in the original data
- k:
rank of low rank component
- rownames:
sample names from the original matrix
- colnames:
features names from the original matrix
- method:
method used for decomposition
- call:
the function call
Estimate covariance/correlation with low rank and shrinkage
Description
Estimate covariance/correlation with low rank and shrinkage from the correlation matrix
Usage
eclairs_corMat(C, n, k = min(n, nrow(C)), lambda = NULL)
Arguments
C |
sample correlation matrix between features |
n |
number of samples used to estimate the sample correlation matrix |
k |
the rank of the low rank component. Defaults to min of sample size and feature number, |
lambda |
shrinkage parameter. If not specified, it is estimated from the data. |
Value
eclairs object storing:
- U:
orthonormal matrix with k columns representing the low rank component
- dSq:
eigen-values so that
U diag(d^2) U^T
is the low rank component- lambda:
shrinkage parameter
\lambda
for the scaled diagonal component- sigma:
standard deviations of input columns
- nu:
diagonal value,
\nu
, of target matrix in shrinkage- n:
number of samples (i.e. rows) in the original data
- p:
number of features (i.e. columns) in the original data
- k:
rank of low rank component
- rownames:
sample names from the original matrix
- colnames:
features names from the original matrix
- method:
method used for decomposition
- call:
the function call
Examples
library(Rfast)
n <- 800 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma * 5.1, seed = 1)
rownames(Y) <- paste0("sample_", 1:n)
colnames(Y) <- paste0("gene_", 1:p)
# eclairs decomposition
eclairs(Y, compute = "correlation")
# eclairs decomposition from correlation matrix
eclairs_corMat(cor(Y), n = n)
Compute eclairs decomp of squared correlation matrix
Description
Given the eclairs decomp of C, compute the eclairs decomp of C^2
Usage
eclairs_sq(ecl, rank1 = ecl$k, rank2 = Inf, varianceFraction = 1)
Arguments
ecl |
estimate of correlation matrix from |
rank1 |
use the first 'rank' singular vectors from the SVD. Using increasing rank1 will increase the accuracy of the estimation. But note that the computationaly complexity is O(P choose(rank, 2)), where P is the number of features in the dataset |
rank2 |
rank of |
varianceFraction |
fraction of variance to retain after truncating trailing eigen values |
Details
Consider a data matrix X_{N x P}
of P
features and N
samples where N << P
. Let the columns of X be scaled so that C_{P x P} = XX^T
. C is often too big to compute directly since it is O(P^2) and O(P^3) to invert. But we can compute the SVD of X in O(PN^2).
The goal is to compute the SVD of the matrix C^2, given only the SVD of C in less than O(P^2)
time. Here we compute this SVD of C^2 in O(PN^4)
time, which is tractible for small N.
Moreover, if we use an SVD X = UDV^T with of rank R, we can approximate the SVD of C^2 in O(PR^4) using only D and V
Value
compute the eclairs of C^2
Examples
# Compute correlations directly and using eclairs decomp
n <- 600 # number of samples
p <- 100 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# draw data from correlation matrix Sigma
Y <- Rfast::rmvnorm(n, rep(0, p), sigma = Sigma, seed = 1)
rownames(Y) <- paste0("sample_", seq(n))
colnames(Y) <- paste0("gene_", seq(p))
# correlation computed directly
C <- cor(Y)
# correlation from eclairs decomposition
ecl <- eclairs(Y, compute = "cor")
C.eclairs <- getCor(ecl, lambda = 0)
all.equal(C, C.eclairs)
# Correlation of Y^2
#-------------------
# exact quadratic way
C <- 2 * cor(Y)^2
# faster low rank
ecl2 <- eclairs_sq(ecl)
C.eclairs <- 2 * getCor(ecl2, lambda = 0)
all.equal(C.eclairs, C)
Fast canonical correlation analysis
Description
Fast Canonical correlation analysis that is scalable to high dimensional data. Uses covariance shrinkage and algorithmic speed ups to be linear time in p when p > n.
Usage
fastcca(X, Y, k = min(dim(X), dim(Y)), lambda.x = NULL, lambda.y = NULL)
Arguments
X |
first matrix (n x p1) |
Y |
first matrix (n x p2) |
k |
number of canonical components to return |
lambda.x |
optional shrinkage parameter for estimating covariance of X. If NULL, estimate from data. |
lambda.y |
optional shrinkage parameter for estimating covariance of Y. If NULL, estimate from data. |
Details
summary statistics of CCA
Results from standard CCA are based on the SVD of \Sigma_{xx}^{-\frac{1}{2}} \Sigma_{xy} \Sigma_{yy}^{-\frac{1}{2}}
.
Uses eclairs()
and empirical Bayes covariance regularization, and applies speed up of RCCA (Tuzhilina, et al. 2023) to perform CCA on n PCs and instead of p features. Memory usage is \mathcal{O}(np)
instead of \mathcal{O}(p^2)
. Computation is \mathcal{O}(n^2p)
instead of \mathcal{O}(p^3)
or \mathcal{O}(np^2)
Value
fastcca
object
References
Tuzhilina, E., Tozzi, L., & Hastie, T. (2023). Canonical correlation analysis in high dimensions with structured regularization. Statistical modelling, 23(3), 203-227.
Class fastcca
Description
Class fastcca
Details
Object storing:
- n.comp:
number of canonical components
- cors:
canonical correlations
- x.coefs:
canonical coefficients for X
- x.vars:
canonical variates for X
- y.coefs:
canonical coefficients for Y
- y.vars:
canonical variates for Y
- lambdas:
shrinkage parameters from
eclairs
Get full covariance/correlation matrix from eclairs
Description
Get full covariance/correlation matrix from eclairs decomposition
Usage
getCov(ecl, lambda, ...)
getCor(ecl, lambda, ...)
## S4 method for signature 'eclairs'
getCov(ecl, lambda, ...)
## S4 method for signature 'eclairs'
getCor(ecl, lambda, ...)
Arguments
ecl |
eclairs decomposition |
lambda |
shrinkage parameter for the convex combination. |
... |
other arguments |
Details
The full matrix is computationally expensive to compute and uses a lot of memory for large p. So it is better to use decorrelate or mult_eclairs to perform projections in O(np)
time.
Value
p x p covariance/correlation matrix
Examples
library(Rfast)
n <- 800 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma * 5.1, seed = 1)
rownames(Y) <- paste0("sample_", seq(n))
colnames(Y) <- paste0("gene_", seq(p))
# eclairs decomposition
ecl <- eclairs(Y)
# extract covariance implied by eclairs decomposition
getCov(ecl)[1:3, 1:3]
Estimate shrinkage parameter by empirical Bayes
Description
Estimate shrinkage parameter by empirical Bayes
Usage
getShrinkageParams(ecl, k = ecl$k, minimum = 1e-04, lambda = NULL)
Arguments
ecl |
|
k |
number of singular vectors to use |
minimum |
minimum value of lambda |
lambda |
(default: NULL) If NULL, estimate lambda from data. Else evaluate logML using specified lambda value. |
Details
Estimate shrinkage parameter for covariance matrix estimation using empirical Bayes method (Hannart and Naveau, 2014; Leday and Richardson 2019). The shrinage estimate of the covariance matrix is (1-\lambda)\hat\Sigma + \lambda \nu I
, where \hat\Sigma
is the sample covariance matrix, given a value of \lambda
. A large value of \lambda
indicates more weight on the prior.
Value
value \lambda
and \nu
indicating the shrinkage between sample and prior covariance matrices.
References
Hannart, A., & Naveau, P. (2014). Estimating high dimensional covariance matrices: A new look at the Gaussian conjugate framework. Journal of Multivariate Analysis, 131, 149-162.
Leday, G. G., & Richardson, S. (2019). Fast Bayesian inference in large Gaussian graphical models. Biometrics, 75(4), 1288-1298.
Examples
library(Rfast)
n <- 800 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma * 5.1, seed = 1)
rownames(Y) <- paste0("sample_", seq(n))
colnames(Y) <- paste0("gene_", seq(p))
# eclairs decomposition: covariance
ecl <- eclairs(Y, compute = "correlation")
# For full SVD
getShrinkageParams(ecl)
# For truncated SVD at k = 20
getShrinkageParams(ecl, k = 20)
Get whitening matrix
Description
Get whitening matrix implied by eclairs decompostion
Usage
getWhiteningMatrix(ecl, lambda)
Arguments
ecl |
estimate of covariance/correlation matrix from eclairs storing |
lambda |
specify lambda and override value from |
Value
whitening matrix
Examples
library(Rfast)
n <- 2000
p <- 3
Y <- matrnorm(n, p, seed = 1) * 10
# decorrelate with implicit whitening matrix
# give same result as explicity whitening matrix
ecl <- eclairs(Y, compute = "covariance")
# get explicit whitening matrix
W <- getWhiteningMatrix(ecl)
# apply explicit whitening matrix
Z1 <- tcrossprod(Y, W)
# use implicit whitening matrix
Z2 <- decorrelate(Y, ecl)
range(Z1 - Z2)
Compute condition number
Description
Compute condition number of matrix from eclairs
decomposition
Usage
## S4 method for signature 'eclairs'
kappa(z, lambda = NULL)
Arguments
z |
|
lambda |
specify lambda to override value from |
Value
condition number of the correlation matrix. If z
is a covariance matrix, kappa is only computed for the correlation component
Examples
library(Rfast)
n <- 800 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma * 5.1, seed = 1)
rownames(Y) <- paste0("sample_", seq(n))
colnames(Y) <- paste0("gene_", seq(p))
# eclairs decomposition
ecl <- eclairs(Y, compute = "correlation")
# compute condition number
kappa(ecl)
Fit linear model on each feature after decorrelating
Description
Fit linear model on each feature after applying decorrelation projection to response and predictors.
Usage
lm_each_eclairs(
formula,
data,
X,
ecl,
subset,
weights,
na.action,
method = "qr",
model = TRUE,
x = FALSE,
y = FALSE,
qr = TRUE,
singular.ok = TRUE,
contrasts = NULL,
offset,
...
)
Arguments
formula |
an object of class 'formula' (or one that can be coerced to that class): a symbolic description of the model to be fitted. |
data |
a matrix or data.frame containing the variables in the model |
X |
matrix or data.frame where each column stores a predictor to be evaluated by the regression model one at a time. The |
ecl |
estimate of covariance/correlation matrix from eclairs storing |
subset |
same as for lm |
weights |
same as for lm |
na.action |
same as for lm |
method |
same as for lm |
model |
same as for lm |
x |
same as for lm |
y |
same as for lm |
qr |
same as for lm |
singular.ok |
same as for lm |
contrasts |
same as for lm |
offset |
same as for lm |
... |
other arguments passed to |
Value
data.frame with columns beta
, se
, tsat
, pvalue
storing results for regression model fit for each feature
Examples
library(Rfast)
n <- 800 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma * 5.1, seed = 1)
# eclairs decomposition
ecl <- eclairs(Y)
# simulate covariates
data <- data.frame(matrnorm(p, 2, seed = 1))
colnames(data) <- paste0("v", 1:2)
# simulate response
y <- rnorm(p)
# Simulate 1000 features to test
X <- matrnorm(p, 1000, seed = 1)
colnames(X) <- paste0("set_", seq(ncol(X)))
# Use linear model to test each feature stored as columns in X
res <- lm_each_eclairs(y ~ v1 + v2, data, X, ecl)
head(res)
# Analysis after non-linear transform
#------------------------------------
# Apply function to transforme data
f <- function(x) log(x^2 + 0.001)
# evaluate covariance of transformed data
ecl_transform <- cov_transform(ecl, f, 100)
# Use linear model to test each feature stored as columns in X
# in data transformed by f()
res2 <- lm_each_eclairs(f(y) ~ v1 + v2, data, X, ecl_transform)
head(res)
Fit linear model after decorrelating
Description
Fit linear model after applying decorrelation projection to response and predictors.
Usage
lm_eclairs(
formula,
data,
ecl,
subset,
weights,
na.action,
method = "qr",
model = TRUE,
x = FALSE,
y = FALSE,
qr = TRUE,
singular.ok = TRUE,
contrasts = NULL,
offset,
...
)
Arguments
formula |
an object of class 'formula' (or one that can be coerced to that class): a symbolic description of the model to be fitted. |
data |
a matrix or data.frame containing the variables in the model |
ecl |
estimate of covariance/correlation matrix from eclairs storing |
subset |
same as for lm |
weights |
same as for lm |
na.action |
same as for lm |
method |
same as for lm |
model |
same as for lm |
x |
same as for lm |
y |
same as for lm |
qr |
same as for lm |
singular.ok |
same as for lm |
contrasts |
same as for lm |
offset |
same as for lm |
... |
same as for lm |
Details
This function fit a linear regression to the transformed response, and transformed design matrix. Note that the design matrix, not just the data.frame of variables is transformed so that 1) factors are transformed and 2) the intercept term is transformed.
Value
Object of class lm
returned by function lm
Examples
library(Rfast)
n <- 800 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma * 5.1, seed = 1)
# eclairs decomposition
ecl <- eclairs(Y)
# simulate covariates
data <- data.frame(matrnorm(p, 2, seed = 1))
colnames(data) <- paste0("v", 1:2)
# simulate response
y <- rnorm(p)
# fit linear model on transformed data
lm_eclairs(y ~ v1 + v2, data, ecl)
Evaluate the log determinant
Description
Evaluate the log determinant of the matrix
Usage
logDet(ecl, alpha = 1)
Arguments
ecl |
estimate of covariance/correlation matrix from |
alpha |
exponent to be applied to eigen-values |
Value
log determinant
Examples
library(Rfast)
n <- 800 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma * 5.1, seed = 1)
rownames(Y) <- paste0("sample_", seq(n))
colnames(Y) <- paste0("gene_", seq(p))
# eclairs decomposition
ecl <- eclairs(Y)
logDet(ecl)
Mahalanobis Distance
Description
Mahalanobis Distance using eclairs()
decomposition
Usage
mahalanobisDistance(ecl, X, lambda, center = FALSE)
Arguments
ecl |
estimate of covariance/correlation matrix from |
X |
data matrix |
lambda |
specify lambda and override value from ‘ecl’ |
center |
logical: should columns be centered internally |
Details
Evaluate quadratic form (X-\mu)^T \Sigma^{-1} (X-\mu)
where covariance is estimated from finite sample
Value
array of distances
Examples
library(Rfast)
n <- 800 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma * 5.1, seed = 1)
# eclairs decomposition
ecl <- eclairs(Y)
# Mahalanobis distance after mean centering
Y_center <- scale(Y, scale = FALSE)
mu <- colMeans(Y)
# Standard R method
a <- mahalanobis(Y, mu, cov = cov(Y))
# distance using eclairs decomposition, no shrinage
b <- mahalanobisDistance(ecl, Y_center, lambda = 0)
range(a - b)
# with shrinkage
d <- mahalanobisDistance(ecl, Y_center)
# centering internally
e <- mahalanobisDistance(ecl, Y, center = TRUE)
range(d - e)
#
Multiply by eclairs matrix
Description
Multiply by eclairs matrix using special structure to achieve linear instead of cubic time complexity.
Usage
mult_eclairs(X, U1, dSq1, lambda, nu, alpha, sigma, transpose = FALSE)
Arguments
X |
matrix to be transformed so *columns* are independent |
U1 |
orthonormal matrix with k columns representing the low rank component |
dSq1 |
eigen values so that |
lambda |
shrinkage parameter for the convex combination. |
nu |
diagonal value of target matrix in shrinkage |
alpha |
exponent to be evaluated |
sigma |
standard deviation of each feature |
transpose |
logical, (default FALSE) indicating if X should be transposed first |
Details
Let \Sigma = U_1 diag(d_1^2) U_1^T * (1-\lambda) + diag(\nu\lambda, p)
, where \lambda
shrinkage parameter for the convex combination between a low rank matrix and the diagonal matrix with values \nu
.
Evaluate X \Sigma^\alpha
using special structure of the eclairs decomposition in O(k^2p)
when there are k
components in the decomposition.
Value
a matrix product
Optimal Hard Threshold for Singular Values
Description
A function for the calculation of the coefficient determining optimal location of hard threshold for matrix denoising by singular values hard thresholding when noise level is known or unknown. Recreation of MATLAB code by Matan Gavish and David Donoho.
Usage
optimal_SVHT_coef(beta, sigma_known = FALSE)
Arguments
beta |
A single value or a vector that represents aspect ratio m/n of the matrix to
be denoised. 0< |
sigma_known |
A logical value. TRUE if noise level known, FALSE if unknown. |
Value
Optimal location of hard threshold, up the median data singular value (sigma
unknown) or up to sigma*sqrt(n)
(sigma known); a vector of the same dimension
as beta
, where coef[i]
is the coefficient corresponding to beta[i]
.
References
Gavish, M., & Donoho, D. L. (2014). The optimal hard threshold for singular values is 4/sqrt(3). IEEE Transactions on Information Theory, 60(8), 5040-5053.
Plot eclairs object
Description
Plot eclairs object
Usage
## S4 method for signature 'eclairs'
plot(x, y, ...)
Arguments
x |
eclairs object |
y |
extra argument, not used |
... |
additional arguments |
Value
plot
Evaluate quadratic form
Description
Evaluate quadratic form
Usage
quadForm(ecl, A, alpha = -1/2)
Arguments
ecl |
estimate of covariance/correlation matrix from |
A |
matrix |
alpha |
default = -1/2. Exponent of eigen-values |
Details
Evaluate quadratic form A^T \Sigma^{2\alpha} A
Value
scalar value
Examples
library(Rfast)
n <- 800 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma * 5.1, seed = 1)
# eclairs decomposition
ecl <- eclairs(Y)
# return scalar
quadForm(ecl, Y[1, ])
# return matrix
quadForm(ecl, Y[1:2, ])
Recompute eclairs after dropping features
Description
Recompute eclairs after dropping features
Usage
reform_decomp(ecl, k = ecl$k, drop = NULL)
Arguments
ecl |
covariance/correlation matrix as an eclairs object |
k |
the rank of the low rank component |
drop |
array of variable names to drop. |
Details
Reform the dataset from the eclairs decomposition, drop features, then recompute the eclairs decomposition. If the original SVD/eigen was truncated, then the reconstruction of the original data will be approximate. Note that the target shrinkage matrix is the same as in ecl
, so \nu
is not recomputed from the retained features.
Value
eclairs decomposition for a subset of features
Examples
library(Rfast)
n <- 800 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma * 5.1, seed = 1)
rownames(Y) <- paste0("sample_", seq(n))
colnames(Y) <- paste0("gene_", seq(p))
# Correlation
#------------
# eclairs decomposition
Sigma.eclairs <- eclairs(Y, compute = "correlation")
# features to drop
drop <- paste0("gene_", 1:100)
# Compute SVD on subset of eclairs decomposition
ecl1 <- reform_decomp(Sigma.eclairs, drop = drop)
ecl1
Draw from multivariate normal and t distributions
Description
Draw from multivariate normal and t distributions using eclairs decomposition
Usage
rmvnorm_eclairs(n, mu, ecl, v = Inf, seed = NULL)
Arguments
n |
sample size |
mu |
mean vector |
ecl |
covariance matrix as an eclairs object |
v |
degrees of freedom, defaults to Inf. If finite, uses a multivariate t distribution |
seed |
If you want the same to be generated again use a seed for the generator, an integer number |
Details
Draw from multivariate normal and t distributions using eclairs decomposition. If the (implied) covariance matrix is p \times p
, the standard approach is O(p^3)
. Taking advantage of the previously computed eclairs decomposition of rank k
, this can be done in O(pk^2)
.
Value
matrix where rows are samples from multivariate normal or t distribution where columns have covariance specified by ecl
Examples
library(Rfast)
n <- 800 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma, seed = 1)
# perform eclairs decomposition
ecl <- eclairs(Y)
# draw from multivariate normal
n <- 10000
mu <- rep(0, ncol(Y))
# using eclairs decomposition
X.draw1 <- rmvnorm_eclairs(n, mu, ecl)
# using full covariance matrix implied by eclairs model
X.draw2 <- rmvnorm(n, mu, getCov(ecl))
# assess difference betweeen covariances from two methods
range(cov(X.draw1) - cov(X.draw2))
# compare covariance to the covariance matrix used to simulated the data
range(cov(X.draw1) - getCov(ecl))
Singular value thresholding
Description
Singular value thresholding evaluates the optimal number of singular values to retain
Usage
sv_threshold(n, p, d)
Arguments
n |
number of samples |
p |
number of features |
d |
singular values |
Value
Number of singular values to retain
References
Gavish, M., & Donoho, D. L. (2014). The optimal hard threshold for singular values is 4/sqrt(3). IEEE Transactions on Information Theory, 60(8), 5040-5053.
Examples
# simulate data
n <- 500
p <- 5000
Y <- Rfast::matrnorm(n, p, seed = 1)
# SVD
dcmp <- svd(Y)
# how many components to retain
sv_threshold(n, p, dcmp$d)
# in this case the data has no structure, so no components are retained
Decorrelation projection + eclairs
Description
Efficient decorrelation projection using eclairs decomposition
Usage
whiten(X, k = ncol(X), lambda = NULL)
Arguments
X |
matrix to be transformed so *columns* are independent |
k |
the rank of the low rank component |
lambda |
specify lambda and override value estimated by |
Value
data rotated and scaled according to the regularized sample covariance of the input data
Examples
library(Rfast)
n <- 800 # number of samples
p <- 200 # number of features
# create correlation matrix
Sigma <- autocorr.mat(p, .9)
# draw data from correlation matrix Sigma
Y <- rmvnorm(n, rep(0, p), sigma = Sigma * 5.1, seed = 1)
# eclairs decomposition
ecl <- eclairs(Y)
# whitened Y
Y.transform <- decorrelate(Y, ecl)
# Combine eclairs and decorrelate into one step
Y.transform2 <- whiten(Y)