--- title: 'Extending multiblock: CCA and glmnet Examples' output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Extending multiblock: CCA and glmnet Examples} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} params: family: red css: albers.css resource_files: - albers.css - albers.js includes: in_header: |- --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) # Load necessary packages for the examples # Ensure 'multiblock' package itself is loaded for testing vignettes, # e.g., via devtools::load_all() or library(multiblock) library(tibble) library(dplyr) library(stats) # For cancor library(glmnet) # For glmnet example library(multivarious) # Helper k-fold function (replace with package internal if available) kfold_split <- function(n, k = 5) { idx <- sample(rep(1:k, length.out = n)) lapply(1:k, function(j) list(train = which(idx != j), test = which(idx == j))) } ``` This vignette demonstrates how to wrap existing statistical models within the `multiblock` framework, specifically using `cross_projector` for Canonical Correlation Analysis (CCA) and `projector` for `glmnet` (penalized regression). # 1. Canonical Correlation Analysis with `cross_projector` ## 1.1 Data Setup We use the classic Iris dataset and split its features into two blocks: * **X-block**: Sepal.Length, Sepal.Width * **Y-block**: Petal.Length, Petal.Width ```{r data_cca} data(iris) X <- as.matrix(iris[, 1:2]) Y <- as.matrix(iris[, 3:4]) # Show first few rows of combined data head(cbind(X, Y)) ``` ## 1.2 Wrap `stats::cancor()` into a `cross_projector` We create a fitting function that runs standard CCA using `stats::cancor()` and then stores the resulting coefficients (loadings) and preprocessing steps in a `cross_projector` object. ```{r fit_cca_wrapper} fit_cca <- function(Xtrain, Ytrain, ncomp = 2, ...) { # Step 1: Define and fit preprocessors with training data # Use center()+z-score for both blocks (swap to center() only if preferred) preproc_x <- fit(colscale(center(), type = "z"), Xtrain) preproc_y <- fit(colscale(center(), type = "z"), Ytrain) # Step 2: Transform training data with the fitted preprocessors Xp <- transform(preproc_x, Xtrain) Yp <- transform(preproc_y, Ytrain) # Step 3: Run CCA on the preprocessed data (no additional centering here) cc <- stats::cancor(Xp, Yp, xcenter = FALSE, ycenter = FALSE) # Step 4: Store loadings and preprocessors in a cross_projector cp <- cross_projector( vx = cc$xcoef[, 1:ncomp, drop = FALSE], vy = cc$ycoef[, 1:ncomp, drop = FALSE], preproc_x = preproc_x, preproc_y = preproc_y, classes = "cca_cross_projector" ) attr(cp, "can_cor") <- cc$cor[1:ncomp] cp } # Fit the model cp_cca <- fit_cca(X, Y) print(cp_cca) attr(cp_cca, "can_cor") # Show canonical correlations ``` **What does this `cross_projector` enable?** * **Projection:** Map new X or Y data into the shared latent space defined by CCA (`project(cp_cca, newX, from = "X")`). * **Transfer/Prediction:** Predict the Y block from the X block, or vice-versa (`transfer(cp_cca, newX, from = "X", to = "Y")`). * **Partial Features:** Project or transfer using only a subset of features from X or Y (`partial_project`). * **Integration:** Use standardized `multiblock` tools like cross-validation (`cv`) and permutation testing (`perm_test`). ## 1.3 Bidirectional Prediction (Transfer) Let's predict the Y-block (Petal measurements) using the X-block (Sepal measurements). ```{r transfer_cca} Y_hat <- transfer(cp_cca, X, from = "X", to = "Y") head(round(Y_hat, 2)) measure_reconstruction_error(Y, Y_hat, metrics = c("rmse", "r2")) ``` ## 1.4 Using Partial Features What if we only have Sepal.Length (the first column of X) at prediction time? We can still project into the latent space: ```{r partial_project_cca} new_x_partial <- X[, 1, drop = FALSE] scores_partial <- partial_project(cp_cca, new_x_partial, colind = 1, source = "X") head(round(scores_partial, 3)) ``` These scores represent the best estimate of position in the latent space given only partial input. This is useful for missing data scenarios or when only some measurements are available. ## 1.5 Cross-validated Component Selection Cross-validation for two-block models requires wrapper functions that handle both X and Y blocks. The pattern is: ```{r cv_cca, eval=FALSE} set.seed(1) folds <- kfold_split(nrow(X), k = 5) cv_fit <- function(train_data, ncomp) { fit_cca(train_data$X, train_data$Y, ncomp = ncomp) } cv_measure <- function(model, test_data) { measure_interblock_transfer_error( test_data$X, test_data$Y, model, metrics = c("x2y.rmse", "y2x.rmse") ) } cv_res <- cv_generic( data = list(X = X, Y = Y), folds = folds, .fit_fun = cv_fit, .measure_fun = cv_measure, fit_args = list(ncomp = 2) ) ``` The key is packaging X and Y together so fold indices apply to both blocks simultaneously. ## 1.6 Permutation Test We can test the significance of the X-Y relationship using `perm_test()`. The default shuffles rows of Y relative to X and measures whether the observed transfer error is lower than expected by chance. ```{r perm_test_cca, eval=FALSE} perm_res <- perm_test( cp_cca, X, Y = Y, nperm = 199, alternative = "less" ) print(perm_res) ``` The default statistic is `x2y.mse` (mean squared error when predicting Y from X). A significant result (low p-value) indicates the CCA relationship is unlikely due to chance. ## 1.7 CCA Take-aways * `cross_projector` provides a convenient wrapper for two-block methods like CCA. * It enables using standard `multiblock` verbs (`project`, `transfer`, `partial_project`, `cv`, `perm_test`) with the wrapped model. * Preprocessing is handled internally, reducing risk of data leakage. * Other methods (PLS, O2PLS) can be integrated by providing a different `fit_fun`. --- # 2. Example: Wrapping `glmnet` as a `projector` The `projector` object maps data from its original space to a lower-dimensional space defined by its components (`v`). This is useful for dimensionality reduction (like PCA) but can also represent coefficients from supervised models like penalized regression. Here, we fit LASSO using `glmnet` and wrap the resulting coefficients (excluding the intercept) into a `projector`. ## 2.1 Data and Model Fitting ```{r setup_glmnet} # Generate sample data set.seed(123) n_obs <- 100 n_pred <- 50 X_glm <- matrix(rnorm(n_obs * n_pred), n_obs, n_pred) # True coefficients (sparse) true_beta <- matrix(0, n_pred, 1) true_beta[1:10, 1] <- runif(10, -1, 1) # Response variable with noise y_glm <- X_glm %*% true_beta + rnorm(n_obs, sd = 0.5) # Fit glmnet (LASSO, alpha=1) # Typically use cv.glmnet to find lambda, but using a fixed one for simplicity glm_fit <- glmnet::glmnet(X_glm, y_glm, alpha = 1) # alpha=1 is LASSO # Choose a lambda (e.g., one near the end of the path) chosen_lambda <- glm_fit$lambda[length(glm_fit$lambda) * 0.8] # Get coefficients for this lambda beta_hat <- coef(glm_fit, s = chosen_lambda) print(paste("Number of non-zero coefficients:", sum(beta_hat != 0))) # Extract coefficients, excluding the intercept v_glm <- beta_hat[-1, 1, drop = FALSE] # Drop intercept, ensure it's a matrix dim(v_glm) # Should be n_pred x 1 ``` ## 2.2 Create the `projector` We define a `projector` using these coefficients. The projection `X %*% v` gives a single score per observation, representing the LASSO linear predictor. We also include centering and scaling as preprocessing, often recommended for `glmnet`. ```{r wrap_glmnet} # Define preprocessor # Define and fit preprocessor with training data preproc_glm <- fit(colscale(center(), type = "z"), X_glm) # Create the projector proj_glm <- projector( v = v_glm, preproc = preproc_glm, classes = "glmnet_projector" ) print(proj_glm) ``` ## 2.3 Project New Data We can now use this `projector` to calculate the LASSO linear predictor score for new data. ```{r project_glmnet} # Generate some new test data X_glm_test <- matrix(rnorm(20 * n_pred), 20, n_pred) # Project the test data # project() handles applying the centering/scaling from preproc_glm lasso_scores <- project(proj_glm, X_glm_test) head(round(lasso_scores, 3)) dim(lasso_scores) # Should be n_test x 1 (since v_glm has 1 column) # Compare with direct calculation using transform # Apply the same preprocessing used within project() X_glm_test_processed <- transform(preproc_glm, X_glm_test) # Calculate scores directly using the processed data and coefficients direct_scores <- X_glm_test_processed %*% v_glm head(round(direct_scores, 3)) # Check they are close all.equal(c(lasso_scores), c(direct_scores)) ```