--- title: Pre-processing pipelines in multiblock output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Pre-processing pipelines in multiblock} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} params: family: red css: albers.css resource_files: - albers.css - albers.js includes: in_header: |- --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width=6, fig.height=4) library(multivarious) library(dplyr) # Needed for %>% and tibble manipulation library(tibble) library(ggplot2) ``` # 1. Why a pipeline at all? Code that mutates data in place (e.g. `scale(X)`) is convenient in a script but dangerous inside reusable functions: * **Data-leak avoidance**: Fitted means/SDs live inside the pre-processor object, calculated only once (typically on training data). * **Reversibility**: `inverse_transform()` gives you proper back-transforms (handy for reconstruction error or publication plots). * **Composability**: You can nest simple steps together (e.g., `colscale(center())`). * **Partial input**: The same pipeline can process just the columns you pass (`transform(..., colind = 1:3)`), perfect for region-of-interest or block workflows. The grammar is tiny: | Verb | Role | Typical Call | |---------------|--------------------------------|------------------------------------| | `pass()` | do nothing (placeholder) | `fit(pass(), X)` | | `center()` | subtract column means | `fit(center(), X)` | | `standardize()` | centre and scale to unit SD | `fit(standardize(), X)` | | `colscale()` | user-supplied weights/scaling | `fit(colscale(type="z"), X)` | | `...` | (write your own) | any function returning a node | The `fit()` verb is the bridge between defining your preprocessing steps (the *recipe*) and actually applying them. You call `fit()` on your recipe, providing your training dataset. `fit()` calculates and stores the necessary parameters (e.g., column means, standard deviations) from this data, returning a *fitted pre-processor* object. Once you have a fitted preprocessor object, it exposes three key methods: | Method | Role | Typical Use Case | |-----------------------|------------------------------------------------|------------------| | `fit_transform(prep, X)` | fits parameters *and* transforms `X` | Training set (convenience) | | `transform(pp, Xnew)`| applies stored parameters to new data | Test/new data | | `inverse_transform(pp, Y)` | back-transforms data using stored parameters | Interpreting results | # 2. The 60-second tour ## 2.1 No-op and sanity check ```{r setup_data_preproc} set.seed(0) X <- matrix(rnorm(10*4), 10, 4) pp_pass <- fit(pass(), X) # == do nothing Xp_pass <- transform(pp_pass, X) # applies nothing, just copies X all.equal(Xp_pass, X) # TRUE ``` ## 2.2 Centre → standardise ```{r standardize_example} # Fit the preprocessor (calculates means & SDs from X) and transform pp_std <- fit(standardize(), X) Xs <- transform(pp_std, X) # Check results all(abs(colMeans(Xs)) < 1e-12) # TRUE: data is centered round(apply(Xs, 2, sd), 6) # ~1: data is scaled # Check back-transform all.equal(inverse_transform(pp_std, Xs), X) # TRUE ``` ## 2.3 Partial input (region-of-interest) Imagine a sensor fails and you only observe columns 2 and 4: ```{r partial_transform} X_cols24 <- X[, c(2,4), drop=FALSE] # Keep as matrix # Apply the *already fitted* standardizer using only columns 2 & 4 Xs_cols24 <- transform(pp_std, X_cols24, colind = c(2,4)) # Compare original columns 2, 4 with their transformed versions head(cbind(X_cols24, Xs_cols24)) # Back-transform works too X_rev_cols24 <- inverse_transform(pp_std, Xs_cols24, colind = c(2,4)) all.equal(X_rev_cols24, X_cols24) # TRUE ``` # 3. Composing preprocessing steps Because preprocessing steps nest, you can build pipelines by composing them: ```{r pipe_example} # Define a pipeline: center, then scale to unit variance # Fit the pipeline to the data pp_pipe <- fit(standardize(), X) # Apply the pipeline Xp_pipe <- transform(pp_pipe, X) ``` ## 3.1 Quick visual ```{r plot_pipeline} # Compare first column before and after pipeline df_pipe <- tibble(raw = X[,1], processed = Xp_pipe[,1]) ggplot(df_pipe) + geom_density(aes(raw), colour = "red", linewidth = 1) + geom_density(aes(processed), colour = "blue", linewidth = 1) + ggtitle("Column 1 Density: Before (red) and After (blue) Pipeline") + theme_minimal() ``` # 4. Block-wise concatenation Large multiblock models often want different preprocessing per block. `concat_pre_processors()` glues several *already fitted* pipelines into one wide transformer that understands global column indices. ```{r concat_example} # Two fake blocks with distinct scales X1 <- matrix(rnorm(10*5 , 10 , 5), 10, 5) # block 1: high mean X2 <- matrix(rnorm(10*7 , 2 , 7), 10, 7) # block 2: low mean # Fit separate preprocessors for each block p1 <- fit(center(), X1) p2 <- fit(standardize(), X2) # Transform each block X1p <- transform(p1, X1) X2p <- transform(p2, X2) # Concatenate the *fitted* preprocessors block_indices_list = list(1:5, 6:12) pp_concat <- concat_pre_processors( list(p1, p2), block_indices = block_indices_list ) # Apply the concatenated preprocessor to the combined data X_combined <- cbind(X1, X2) X_combined_p <- transform(pp_concat, X_combined) # Check means (block 1 only centered, block 2 standardized) round(colMeans(X_combined_p), 2) # Need only block 1 processed later? Use colind with global indices X1_later_p <- transform(pp_concat, X1, colind = block_indices_list[[1]]) all.equal(X1_later_p, X1p) # TRUE # Need block 2 processed? X2_later_p <- transform(pp_concat, X2, colind = block_indices_list[[2]]) all.equal(X2_later_p, X2p) # TRUE ``` ### Check reversibility of concatenated pipeline ```{r concat_reversibility} back_combined <- inverse_transform(pp_concat, X_combined_p) # Compare first few rows/cols of original vs round-trip knitr::kable( head(cbind(orig = X_combined[, 1:6], recon = back_combined[, 1:6]), 3), digits = 2, caption = "First 3 rows, columns 1-6: Original vs Reconstructed" ) all.equal(X_combined, back_combined) # TRUE ``` # 5. Inside the weeds (for authors & power users) | Helper | Purpose | |---------------------------|----------------------------------------------------------------------| | `fresh(pp)` | return the un-fitted recipe skeleton. **Crucial for tasks like cross-validation (CV)**, as it allows you to re-`fit()` the pipeline using *only* the current training fold's data, preventing data leakage from other folds or the test set. | | `concat_pre_processors()` | build one big transformer out of already-fitted pieces. | | `pass()` vs `fit(pass(), X)` | `pass()` is a recipe; `fit(pass(), X)` is a fitted identity transformer. | | caching | Fitted preprocessor objects store parameters (means, SDs) for fast re-application. | You rarely need to interact with these helpers directly; they exist so model-writers (e.g. new PCA flavours) can avoid boiler-plate. # 6. Key take-aways * **Write once**: Define a preprocessing recipe (e.g., `colscale(center())`) and reuse it safely across CV folds using `fit()` on each fold's training data. * **No data leakage**: Parameters live inside the fitted preprocessor object, calculated only from training data. * **Composable & reversible**: Nest preprocessing steps, extract the original recipe with `fresh()`, and back-transform whenever you need results in original units using `inverse_transform()`. * **Block-aware**: The same mechanism powers multiblock PCA, CCA, ComDim… Happy projecting! --- # Session info ```{r session_info_preproc} sessionInfo() ```