---
title: Pre-processing pipelines in multiblock
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Pre-processing pipelines in multiblock}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
params:
family: red
css: albers.css
resource_files:
- albers.css
- albers.js
includes:
in_header: |-
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width=6, fig.height=4)
library(multivarious)
library(dplyr) # Needed for %>% and tibble manipulation
library(tibble)
library(ggplot2)
```
# 1. Why a pipeline at all?
Code that mutates data in place (e.g. `scale(X)`) is convenient in a script
but dangerous inside reusable functions:
* **Data-leak avoidance**: Fitted means/SDs live inside the pre-processor object, calculated only once (typically on training data).
* **Reversibility**: `inverse_transform()` gives you proper back-transforms (handy for reconstruction error or publication plots).
* **Composability**: You can nest simple steps together (e.g., `colscale(center())`).
* **Partial input**: The same pipeline can process just the columns you pass (`transform(..., colind = 1:3)`), perfect for region-of-interest or block workflows.
The grammar is tiny:
| Verb | Role | Typical Call |
|---------------|--------------------------------|------------------------------------|
| `pass()` | do nothing (placeholder) | `fit(pass(), X)` |
| `center()` | subtract column means | `fit(center(), X)` |
| `standardize()` | centre and scale to unit SD | `fit(standardize(), X)` |
| `colscale()` | user-supplied weights/scaling | `fit(colscale(type="z"), X)` |
| `...` | (write your own) | any function returning a node |
The `fit()` verb is the bridge between defining your preprocessing steps (the *recipe*) and actually applying them. You call `fit()` on your recipe, providing your training dataset. `fit()` calculates and stores the necessary parameters (e.g., column means, standard deviations) from this data, returning a *fitted pre-processor* object.
Once you have a fitted preprocessor object, it exposes three key methods:
| Method | Role | Typical Use Case |
|-----------------------|------------------------------------------------|------------------|
| `fit_transform(prep, X)` | fits parameters *and* transforms `X` | Training set (convenience) |
| `transform(pp, Xnew)`| applies stored parameters to new data | Test/new data |
| `inverse_transform(pp, Y)` | back-transforms data using stored parameters | Interpreting results |
# 2. The 60-second tour
## 2.1 No-op and sanity check
```{r setup_data_preproc}
set.seed(0)
X <- matrix(rnorm(10*4), 10, 4)
pp_pass <- fit(pass(), X) # == do nothing
Xp_pass <- transform(pp_pass, X) # applies nothing, just copies X
all.equal(Xp_pass, X) # TRUE
```
## 2.2 Centre → standardise
```{r standardize_example}
# Fit the preprocessor (calculates means & SDs from X) and transform
pp_std <- fit(standardize(), X)
Xs <- transform(pp_std, X)
# Check results
all(abs(colMeans(Xs)) < 1e-12) # TRUE: data is centered
round(apply(Xs, 2, sd), 6) # ~1: data is scaled
# Check back-transform
all.equal(inverse_transform(pp_std, Xs), X) # TRUE
```
## 2.3 Partial input (region-of-interest)
Imagine a sensor fails and you only observe columns 2 and 4:
```{r partial_transform}
X_cols24 <- X[, c(2,4), drop=FALSE] # Keep as matrix
# Apply the *already fitted* standardizer using only columns 2 & 4
Xs_cols24 <- transform(pp_std, X_cols24, colind = c(2,4))
# Compare original columns 2, 4 with their transformed versions
head(cbind(X_cols24, Xs_cols24))
# Back-transform works too
X_rev_cols24 <- inverse_transform(pp_std, Xs_cols24, colind = c(2,4))
all.equal(X_rev_cols24, X_cols24) # TRUE
```
# 3. Composing preprocessing steps
Because preprocessing steps nest, you can build pipelines by composing them:
```{r pipe_example}
# Define a pipeline: center, then scale to unit variance
# Fit the pipeline to the data
pp_pipe <- fit(standardize(), X)
# Apply the pipeline
Xp_pipe <- transform(pp_pipe, X)
```
## 3.1 Quick visual
```{r plot_pipeline}
# Compare first column before and after pipeline
df_pipe <- tibble(raw = X[,1], processed = Xp_pipe[,1])
ggplot(df_pipe) +
geom_density(aes(raw), colour = "red", linewidth = 1) +
geom_density(aes(processed), colour = "blue", linewidth = 1) +
ggtitle("Column 1 Density: Before (red) and After (blue) Pipeline") +
theme_minimal()
```
# 4. Block-wise concatenation
Large multiblock models often want different preprocessing per block.
`concat_pre_processors()` glues several *already fitted* pipelines into one
wide transformer that understands global column indices.
```{r concat_example}
# Two fake blocks with distinct scales
X1 <- matrix(rnorm(10*5 , 10 , 5), 10, 5) # block 1: high mean
X2 <- matrix(rnorm(10*7 , 2 , 7), 10, 7) # block 2: low mean
# Fit separate preprocessors for each block
p1 <- fit(center(), X1)
p2 <- fit(standardize(), X2)
# Transform each block
X1p <- transform(p1, X1)
X2p <- transform(p2, X2)
# Concatenate the *fitted* preprocessors
block_indices_list = list(1:5, 6:12)
pp_concat <- concat_pre_processors(
list(p1, p2),
block_indices = block_indices_list
)
# Apply the concatenated preprocessor to the combined data
X_combined <- cbind(X1, X2)
X_combined_p <- transform(pp_concat, X_combined)
# Check means (block 1 only centered, block 2 standardized)
round(colMeans(X_combined_p), 2)
# Need only block 1 processed later? Use colind with global indices
X1_later_p <- transform(pp_concat, X1, colind = block_indices_list[[1]])
all.equal(X1_later_p, X1p) # TRUE
# Need block 2 processed?
X2_later_p <- transform(pp_concat, X2, colind = block_indices_list[[2]])
all.equal(X2_later_p, X2p) # TRUE
```
### Check reversibility of concatenated pipeline
```{r concat_reversibility}
back_combined <- inverse_transform(pp_concat, X_combined_p)
# Compare first few rows/cols of original vs round-trip
knitr::kable(
head(cbind(orig = X_combined[, 1:6], recon = back_combined[, 1:6]), 3),
digits = 2,
caption = "First 3 rows, columns 1-6: Original vs Reconstructed"
)
all.equal(X_combined, back_combined) # TRUE
```
# 5. Inside the weeds (for authors & power users)
| Helper | Purpose |
|---------------------------|----------------------------------------------------------------------|
| `fresh(pp)` | return the un-fitted recipe skeleton. **Crucial for tasks like cross-validation (CV)**, as it allows you to re-`fit()` the pipeline using *only* the current training fold's data, preventing data leakage from other folds or the test set. |
| `concat_pre_processors()` | build one big transformer out of already-fitted pieces. |
| `pass()` vs `fit(pass(), X)` | `pass()` is a recipe; `fit(pass(), X)` is a fitted identity transformer. |
| caching | Fitted preprocessor objects store parameters (means, SDs) for fast re-application. |
You rarely need to interact with these helpers directly; they exist so
model-writers (e.g. new PCA flavours) can avoid boiler-plate.
# 6. Key take-aways
* **Write once**: Define a preprocessing recipe (e.g., `colscale(center())`) and reuse it safely across CV folds using `fit()` on each fold's training data.
* **No data leakage**: Parameters live inside the fitted preprocessor object, calculated only from training data.
* **Composable & reversible**: Nest preprocessing steps, extract the original recipe with `fresh()`, and back-transform whenever you need results in original units using `inverse_transform()`.
* **Block-aware**: The same mechanism powers multiblock PCA, CCA, ComDim…
Happy projecting!
---
# Session info
```{r session_info_preproc}
sessionInfo()
```