---
title: 'Partial projection: working with incomplete feature sets'
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Partial projection: working with incomplete feature sets}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
params:
family: red
css: albers.css
resource_files:
- albers.css
- albers.js
includes:
in_header: |-
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width = 7,
fig.height = 4
)
library(multivarious)
library(dplyr)
library(ggplot2)
```
# 1. Why partial projection?
Assume you trained a dimensionality-reduction model (PCA, PLS ...)
on p variables but, at prediction time,
* one sensor is broken,
* a block of variables is too expensive to measure,
* you need a quick first pass while the "heavy" data arrive later.
You still want the latent scores in the same component space,
so downstream models, dashboards, alarms, ... keep running.
That's exactly what
```
partial_project(model, new_data_subset, colind = which.columns)
```
does:
`new_data_subset` (n × q) ─► project into latent space (n × k)
with q ≤ p.
If the loading vectors are orthonormal this is a simple dot product;
otherwise a ridge-regularised least-squares solve is used.
---
# 2. Walk-through with a toy PCA
```{r data_partial_proj}
set.seed(1)
n <- 100
p <- 8
X <- matrix(rnorm(n * p), n, p)
# Fit a centred 3-component PCA (via SVD)
# Manually center the data and create fitted preprocessor
Xc <- scale(X, center = TRUE, scale = FALSE)
svd_res <- svd(Xc, nu = 0, nv = 3)
# Create a fitted centering preprocessor
preproc_fitted <- fit(center(), X)
pca <- bi_projector(
v = svd_res$v,
s = Xc %*% svd_res$v,
sdev = svd_res$d[1:3] / sqrt(n-1), # Correct scaling for sdev
preproc = preproc_fitted
)
```
## 2.1 Normal projection (all variables)
```{r project_full}
scores_full <- project(pca, X) # n × 3
head(round(scores_full, 2))
```
## 2.2 Missing two variables ➜ partial projection
Suppose columns 7 and 8 are unavailable for a new batch.
```{r project_partial}
X_miss <- X[, 1:6] # keep only first 6 columns
col_subset <- 1:6 # their positions in the **original** X
scores_part <- partial_project(pca, X_miss, colind = col_subset)
# How close are the results?
plot_df <- tibble(
full = scores_full[,1],
part = scores_part[,1]
)
ggplot(plot_df, aes(full, part)) +
geom_point() +
geom_abline(col = "red") +
coord_equal() +
labs(title = "Component 1: full vs. partial projection") +
theme_minimal()
```
Even with two variables missing, the
ridge LS step recovers latent scores that lie almost on the 1:1 line.
---
# 3. Caching the operation with a partial projector
If you expect many rows with the same subset of features, create a
specialised projector once and reuse it:
```{r partial_projector_cache}
# Assuming partial_projector is available
pca_1to6 <- partial_projector(pca, 1:6) # keeps a reference + cache
# project 1000 new observations that only have the first 6 vars
new_batch <- matrix(rnorm(1000 * 6), 1000, 6)
scores_fast <- project(pca_1to6, new_batch)
dim(scores_fast) # 1000 × 3
```
Internally, `partial_projector()` stores the mapping
`v[1:6, ]` and a pre-computed inverse, so calls to `project()` are
as cheap as a matrix multiplication.
---
# 4. Block-wise convenience
For multiblock fits (created with `multiblock_projector()`),
`project_block()` provides a convenient wrapper around `partial_project()`:
```{r multiblock_example}
# Create a multiblock projector from our PCA
# Suppose columns 1-4 are "Block A" (block 1) and columns 5-8 are "Block B" (block 2)
block_indices <- list(1:4, 5:8)
mb <- multiblock_projector(
v = pca$v,
preproc = pca$preproc,
block_indices = block_indices
)
# Now we can project using only Block 2's data (columns 5-8)
X_block2 <- X[, 5:8]
scores_block2 <- project_block(mb, X_block2, block = 2)
# Compare to full projection
head(round(cbind(full = scores_full[,1], block2 = scores_block2[,1]), 2))
```
This is equivalent to calling `partial_project(mb, X_block2, colind = 5:8)`
but reads more naturally when working with block structures.
---
# 5. Not only "missing data": regions-of-interest & nested designs
Partial projection is handy even when all measurements exist:
1. **Region of interest (ROI).**
In neuro-imaging you might have 50,000 voxels but care only about the
motor cortex. Projecting just those columns shows how a
participant scores within that anatomical region without
refitting the whole PCA/PLS.
2. **Nested / multi-subject studies.**
For multi-block PCA (e.g. "participant × sensor"), you can ask
"where would subject i lie if I looked at block B only?"
Simply supply that block to `project_block()`.
3. **Feature probes or "what-if" analysis.**
Engineers often ask "What is the latent position if I vary only
temperature and hold everything else blank?" Pass a matrix that
contains the chosen variables and zeros elsewhere.
## 5.1 Mini-demo: projecting an ROI
Assume columns 1–5 (instead of 50 for brevity) of `X` form our ROI.
```{r roi_project}
roi_cols <- 1:5 # pretend these are the ROI voxels
X_roi <- X[, roi_cols] # same matrix from Section 2
roi_scores <- partial_project(pca, X_roi, colind = roi_cols)
# Compare component 1 from full vs ROI
df_roi <- tibble(
full = scores_full[,1],
roi = roi_scores[,1]
)
ggplot(df_roi, aes(full, roi)) +
geom_point(alpha = .6) +
geom_abline(col = "red") +
coord_equal() +
labs(title = "Component 1 scores: full data vs ROI") +
theme_minimal()
```
**Interpretation:**
If the two sets of scores align tightly, the ROI variables are
driving this component. A strong deviation would reveal that other
variables dominate the global pattern.
## 5.2 Single-subject positioning in a multiblock design
Using the multiblock projector from Section 4, we can see how individual
observations score when viewed through just one block:
```{r block_single_subject}
# Get scores for observation 1 using only Block 1 variables (columns 1-4)
subject1_block1 <- project_block(mb, X[1, 1:4, drop = FALSE], block = 1)
# Get scores for the same observation using only Block 2 variables (columns 5-8)
subject1_block2 <- project_block(mb, X[1, 5:8, drop = FALSE], block = 2)
# Compare: do both blocks tell the same story about this observation?
cat("Subject 1 scores from Block 1:", round(subject1_block1, 2), "\n")
cat("Subject 1 scores from Block 2:", round(subject1_block2, 2), "\n")
cat("Subject 1 scores from full data:", round(scores_full[1,], 2), "\n")
```
This lets you assess whether an observation's position in the latent space
is consistent across blocks, or whether one block tells a different story.
---
# 6. Cheat-sheet: why you might call `partial_project()`
| Scenario | What you pass | Typical call |
|-------------------------------------|-----------------------------------|---------------------------------------------|
| Sensor outage / missing features | matrix with observed cols only | `partial_project(mod, X_obs, colind = idx)` |
| Region of interest (ROI) | ROI columns of the data | `partial_project(mod, X[, ROI], ROI)` |
| Block-specific latent scores | full block matrix | `project_block(mb, blkData, block = b)` |
| "What-if": vary a single variable set | varied cols + zeros elsewhere | `partial_project()` with matching `colind` |
The component space stays identical throughout, so downstream
analytics, classifiers, or control charts continue to work with no
re-training.
---
# Session info
```{r session-info-extra}
sessionInfo()
```