--- title: 'Contrastive PCA: Finding What''s Different Between Groups' output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Contrastive PCA: Finding What's Different Between Groups} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} params: family: red css: albers.css resource_files: - albers.css - albers.js includes: in_header: |- --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4) library(multivarious) library(ggplot2) ``` ## What is Contrastive PCA? Imagine you're studying two groups: patients with a disease and healthy controls. Both groups show variation in their measurements, but you're specifically interested in what makes the patients *different*. Standard PCA would find the largest sources of variation across all samples, which might be dominated by age, sex, or other factors common to both groups. **Contrastive PCA (cPCA++) finds patterns that are enriched in one group (foreground) compared to another (background).** ## A Simple Example Let's start with a practical example to see why contrastive PCA is useful: ```{r motivation_example} set.seed(123) n_samples <- 100 n_features <- 50 # Create background data (e.g., healthy controls) # Main variation is in features 1-10 background <- matrix(rnorm(n_samples * n_features), n_samples, n_features) background[, 1:10] <- background[, 1:10] * 3 # Strong common variation # Create foreground data (e.g., patients) # Has the same common variation PLUS disease-specific signal in features 20-25 foreground <- background[1:60, ] # Start with same structure foreground[, 20:25] <- foreground[, 20:25] + matrix(rnorm(60 * 6, sd = 2), 60, 6) # Standard PCA on combined data all_data <- rbind(background, foreground) regular_pca <- pca(all_data, ncomp = 2) # Contrastive PCA cpca_result <- cPCAplus(X_f = foreground, X_b = background, ncomp = 2) # Compare what each method finds loadings_df <- rbind( data.frame( feature = factor(1:30), value = abs(regular_pca$v[1:30, 1]), method = "Standard PCA" ), data.frame( feature = factor(1:30), value = abs(cpca_result$v[1:30, 1]), method = "Contrastive PCA" ) ) ggplot(loadings_df, aes(x = feature, y = value)) + geom_col(fill = "#1f78b4") + facet_wrap(~method, nrow = 1) + theme_minimal(base_size = 12) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs( x = "Feature", y = "|Loading|", title = "Top loading coefficients for PC1" ) ``` Notice how standard PCA focuses on features 1-10 (the common variation), while contrastive PCA correctly identifies features 20-25 (the group-specific signal). ## Using cPCAplus() The `cPCAplus()` function makes contrastive PCA easy to use: ```{r basic_usage} # Basic usage cpca_fit <- cPCAplus( X_f = foreground, # Your group of interest (foreground) X_b = background, # Your reference group (background) ncomp = 5 # Number of components to extract ) # The result is a bi_projector object with familiar methods print(cpca_fit) # Project new data new_samples <- matrix(rnorm(10 * n_features), 10, n_features) new_scores <- project(cpca_fit, new_samples) # Reconstruct using top components reconstructed <- reconstruct(cpca_fit, comp = 1:2) ``` ## Understanding the Output `cPCAplus()` returns a `bi_projector` object containing: - **`v`**: Loadings (feature weights) for each component - **`s`**: Scores (sample projections) for the foreground data - **`sdev`**: Standard deviations explaining the "contrastive variance" - **`values`**: The eigenvalue ratios (foreground variance / background variance) ```{r understanding_output} # Which features contribute most to the first contrastive component? top_features <- order(abs(cpca_fit$v[, 1]), decreasing = TRUE)[1:10] print(paste("Top contributing features:", paste(top_features, collapse = ", "))) # How much more variable is each component in foreground vs background? print(paste("Variance ratios:", paste(round(cpca_fit$values[1:3], 2), collapse = ", "))) ``` ## Common Applications ### 1. Biomedical Studies ```{r biomedical_example, eval=FALSE} # Identify disease-specific patterns tumor_cpca <- cPCAplus( X_f = tumor_samples, X_b = healthy_tissue, ncomp = 10 ) ``` ### 2. Technical Variation Removal ```{r technical_example, eval=FALSE} # Use technical replicates as background to find biological signal bio_cpca <- cPCAplus( X_f = biological_samples, X_b = technical_replicates, ncomp = 5 ) ``` ### 3. Time-Based Contrasts ```{r time_example, eval=FALSE} # Find patterns specific to treatment timepoint treatment_cpca <- cPCAplus( X_f = after_treatment, X_b = before_treatment, ncomp = 5 ) ``` ## Advanced Options ### Handling High-Dimensional Data When you have more features than samples (p >> n), use the efficient sample-space strategy: ```{r high_dim} # Create high-dimensional example n_f <- 50; n_b <- 80; p <- 1000 X_background_hd <- matrix(rnorm(n_b * p), n_b, p) X_foreground_hd <- X_background_hd[1:n_f, ] + matrix(c(rnorm(n_f * 20, sd = 2), rep(0, n_f * (p-20))), n_f, p) # Use sample-space strategy for efficiency cpca_hd <- cPCAplus(X_f = X_foreground_hd, X_b = X_background_hd, ncomp = 5, strategy = "sample") ``` ### Regularization for Unstable Background If your background covariance is nearly singular, add regularization: ```{r regularization} # Small background sample size can lead to instability small_background <- matrix(rnorm(20 * 100), 20, 100) small_foreground <- matrix(rnorm(30 * 100), 30, 100) # Add regularization cpca_regularized <- cPCAplus( X_f = small_foreground, X_b = small_background, ncomp = 5, lambda = 0.1 # Regularization parameter for background covariance ) ``` ## When to Use Contrastive PCA ✓ **Use contrastive PCA when:** - You have two groups and want to find patterns specific to one - Background variation obscures your signal of interest - You want to remove technical/batch effects captured by control samples ✗ **Don't use contrastive PCA when:** - You only have one group (use standard PCA) - Groups differ mainly in mean levels (use t-tests or LDA) - The interesting variation is non-linear (consider kernel methods) ## Technical Details

Click for mathematical details

Contrastive PCA++ solves the generalized eigenvalue problem: $$\mathbf{R}_f \mathbf{v} = \lambda \mathbf{R}_b \mathbf{v}$$ where: - $\mathbf{R}_f$ is the foreground covariance matrix - $\mathbf{R}_b$ is the background covariance matrix - $\lambda$ represents the variance ratio (foreground/background) - $\mathbf{v}$ are the contrastive directions This finds directions that maximize the ratio of foreground to background variance, effectively highlighting patterns enriched in the foreground group. The `geneig()` function provides the underlying solver with multiple algorithm options: - `"geigen"`: General purpose, handles non-symmetric matrices - `"robust"`: Fast for well-conditioned problems - `"primme"`: Efficient for very large sparse matrices

## References - Abid, A., Zhang, M. J., Bagaria, V. K., & Zou, J. (2018). Exploring patterns enriched in a dataset with contrastive principal component analysis. *Nature Communications*, 9(1), 2134. - Salloum, R., & Kuo, C. C. J. (2022). cPCA++: An efficient method for contrastive feature learning. *Pattern Recognition*, 124, 108378. - Wu, M., Sun, Q., & Yang, Y. (2025). PCA++: How Uniformity Induces Robustness to Background Noise in Contrastive Learning. *arXiv preprint arXiv:2511.12278*. - Woller, J. P., Menrath, D., & Gharabaghi, A. (2025). Generalized contrastive PCA is equivalent to generalized eigendecomposition. *PLOS Computational Biology*, 21(10), e1013555. ## See Also - `pca()` for standard principal component analysis - `discriminant_projector()` for supervised dimensionality reduction - `geneig()` for solving generalized eigenvalue problems directly