---
title: 'Extending multiblock: CCA and glmnet Examples'
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Extending multiblock: CCA and glmnet Examples}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
params:
family: red
css: albers.css
resource_files:
- albers.css
- albers.js
includes:
in_header: |-
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width = 7,
fig.height = 5
)
# Load necessary packages for the examples
# Ensure 'multiblock' package itself is loaded for testing vignettes,
# e.g., via devtools::load_all() or library(multiblock)
library(tibble)
library(dplyr)
library(stats) # For cancor
library(glmnet) # For glmnet example
library(multivarious)
# Helper k-fold function (replace with package internal if available)
kfold_split <- function(n, k = 5) {
idx <- sample(rep(1:k, length.out = n))
lapply(1:k, function(j) list(train = which(idx != j),
test = which(idx == j)))
}
```
This vignette demonstrates how to wrap existing statistical models within the `multiblock` framework, specifically using `cross_projector` for Canonical Correlation Analysis (CCA) and `projector` for `glmnet` (penalized regression).
# 1. Canonical Correlation Analysis with `cross_projector`
## 1.1 Data Setup
We use the classic Iris dataset and split its features into two blocks:
* **X-block**: Sepal.Length, Sepal.Width
* **Y-block**: Petal.Length, Petal.Width
```{r data_cca}
data(iris)
X <- as.matrix(iris[, 1:2])
Y <- as.matrix(iris[, 3:4])
# Show first few rows of combined data
head(cbind(X, Y))
```
## 1.2 Wrap `stats::cancor()` into a `cross_projector`
We create a fitting function that runs standard CCA using `stats::cancor()` and then stores the resulting coefficients (loadings) and preprocessing steps in a `cross_projector` object.
```{r fit_cca_wrapper}
fit_cca <- function(Xtrain, Ytrain, ncomp = 2, ...) {
# Step 1: Define and fit preprocessors with training data
# Use center()+z-score for both blocks (swap to center() only if preferred)
preproc_x <- fit(colscale(center(), type = "z"), Xtrain)
preproc_y <- fit(colscale(center(), type = "z"), Ytrain)
# Step 2: Transform training data with the fitted preprocessors
Xp <- transform(preproc_x, Xtrain)
Yp <- transform(preproc_y, Ytrain)
# Step 3: Run CCA on the preprocessed data (no additional centering here)
cc <- stats::cancor(Xp, Yp, xcenter = FALSE, ycenter = FALSE)
# Step 4: Store loadings and preprocessors in a cross_projector
cp <- cross_projector(
vx = cc$xcoef[, 1:ncomp, drop = FALSE],
vy = cc$ycoef[, 1:ncomp, drop = FALSE],
preproc_x = preproc_x,
preproc_y = preproc_y,
classes = "cca_cross_projector"
)
attr(cp, "can_cor") <- cc$cor[1:ncomp]
cp
}
# Fit the model
cp_cca <- fit_cca(X, Y)
print(cp_cca)
attr(cp_cca, "can_cor") # Show canonical correlations
```
**What does this `cross_projector` enable?**
* **Projection:** Map new X or Y data into the shared latent space defined by CCA (`project(cp_cca, newX, from = "X")`).
* **Transfer/Prediction:** Predict the Y block from the X block, or vice-versa (`transfer(cp_cca, newX, from = "X", to = "Y")`).
* **Partial Features:** Project or transfer using only a subset of features from X or Y (`partial_project`).
* **Integration:** Use standardized `multiblock` tools like cross-validation (`cv`) and permutation testing (`perm_test`).
## 1.3 Bidirectional Prediction (Transfer)
Let's predict the Y-block (Petal measurements) using the X-block (Sepal measurements).
```{r transfer_cca}
Y_hat <- transfer(cp_cca, X, from = "X", to = "Y")
head(round(Y_hat, 2))
measure_reconstruction_error(Y, Y_hat, metrics = c("rmse", "r2"))
```
## 1.4 Using Partial Features
What if we only have Sepal.Length (the first column of X) at prediction time? We can still project into the latent space:
```{r partial_project_cca}
new_x_partial <- X[, 1, drop = FALSE]
scores_partial <- partial_project(cp_cca, new_x_partial,
colind = 1,
source = "X")
head(round(scores_partial, 3))
```
These scores represent the best estimate of position in the latent space given only partial input. This is useful for missing data scenarios or when only some measurements are available.
## 1.5 Cross-validated Component Selection
Cross-validation for two-block models requires wrapper functions that handle both X and Y blocks. The pattern is:
```{r cv_cca, eval=FALSE}
set.seed(1)
folds <- kfold_split(nrow(X), k = 5)
cv_fit <- function(train_data, ncomp) {
fit_cca(train_data$X, train_data$Y, ncomp = ncomp)
}
cv_measure <- function(model, test_data) {
measure_interblock_transfer_error(
test_data$X, test_data$Y, model,
metrics = c("x2y.rmse", "y2x.rmse")
)
}
cv_res <- cv_generic(
data = list(X = X, Y = Y),
folds = folds,
.fit_fun = cv_fit,
.measure_fun = cv_measure,
fit_args = list(ncomp = 2)
)
```
The key is packaging X and Y together so fold indices apply to both blocks simultaneously.
## 1.6 Permutation Test
We can test the significance of the X-Y relationship using `perm_test()`. The default shuffles rows of Y relative to X and measures whether the observed transfer error is lower than expected by chance.
```{r perm_test_cca, eval=FALSE}
perm_res <- perm_test(
cp_cca, X, Y = Y,
nperm = 199,
alternative = "less"
)
print(perm_res)
```
The default statistic is `x2y.mse` (mean squared error when predicting Y from X). A significant result (low p-value) indicates the CCA relationship is unlikely due to chance.
## 1.7 CCA Take-aways
* `cross_projector` provides a convenient wrapper for two-block methods like CCA.
* It enables using standard `multiblock` verbs (`project`, `transfer`, `partial_project`, `cv`, `perm_test`) with the wrapped model.
* Preprocessing is handled internally, reducing risk of data leakage.
* Other methods (PLS, O2PLS) can be integrated by providing a different `fit_fun`.
---
# 2. Example: Wrapping `glmnet` as a `projector`
The `projector` object maps data from its original space to a lower-dimensional space defined by its components (`v`). This is useful for dimensionality reduction (like PCA) but can also represent coefficients from supervised models like penalized regression.
Here, we fit LASSO using `glmnet` and wrap the resulting coefficients (excluding the intercept) into a `projector`.
## 2.1 Data and Model Fitting
```{r setup_glmnet}
# Generate sample data
set.seed(123)
n_obs <- 100
n_pred <- 50
X_glm <- matrix(rnorm(n_obs * n_pred), n_obs, n_pred)
# True coefficients (sparse)
true_beta <- matrix(0, n_pred, 1)
true_beta[1:10, 1] <- runif(10, -1, 1)
# Response variable with noise
y_glm <- X_glm %*% true_beta + rnorm(n_obs, sd = 0.5)
# Fit glmnet (LASSO, alpha=1)
# Typically use cv.glmnet to find lambda, but using a fixed one for simplicity
glm_fit <- glmnet::glmnet(X_glm, y_glm, alpha = 1) # alpha=1 is LASSO
# Choose a lambda (e.g., one near the end of the path)
chosen_lambda <- glm_fit$lambda[length(glm_fit$lambda) * 0.8]
# Get coefficients for this lambda
beta_hat <- coef(glm_fit, s = chosen_lambda)
print(paste("Number of non-zero coefficients:", sum(beta_hat != 0)))
# Extract coefficients, excluding the intercept
v_glm <- beta_hat[-1, 1, drop = FALSE] # Drop intercept, ensure it's a matrix
dim(v_glm) # Should be n_pred x 1
```
## 2.2 Create the `projector`
We define a `projector` using these coefficients. The projection `X %*% v` gives a single score per observation, representing the LASSO linear predictor. We also include centering and scaling as preprocessing, often recommended for `glmnet`.
```{r wrap_glmnet}
# Define preprocessor
# Define and fit preprocessor with training data
preproc_glm <- fit(colscale(center(), type = "z"), X_glm)
# Create the projector
proj_glm <- projector(
v = v_glm,
preproc = preproc_glm,
classes = "glmnet_projector"
)
print(proj_glm)
```
## 2.3 Project New Data
We can now use this `projector` to calculate the LASSO linear predictor score for new data.
```{r project_glmnet}
# Generate some new test data
X_glm_test <- matrix(rnorm(20 * n_pred), 20, n_pred)
# Project the test data
# project() handles applying the centering/scaling from preproc_glm
lasso_scores <- project(proj_glm, X_glm_test)
head(round(lasso_scores, 3))
dim(lasso_scores) # Should be n_test x 1 (since v_glm has 1 column)
# Compare with direct calculation using transform
# Apply the same preprocessing used within project()
X_glm_test_processed <- transform(preproc_glm, X_glm_test)
# Calculate scores directly using the processed data and coefficients
direct_scores <- X_glm_test_processed %*% v_glm
head(round(direct_scores, 3))
# Check they are close
all.equal(c(lasso_scores), c(direct_scores))
```