Ready to unlock the full potential of glyrepr? This vignette is for those who want to peek under the hood and master the art of efficient glycan computation. If you’re writing custom functions for glycan analysis or building the next great glycomics tool, you’re in the right place!
Fair warning: This guide assumes you’re comfortable with R programming and graph theory concepts. If you’re just getting started, check out our “Getting Started with glyrepr” vignette first.
library(glyrepr)Before we dive into the smap functions, let’s understand why they exist and why they’re game-changing for glycan analysis.
Working with glycan structures means working with graphs, and graph operations are computationally expensive. When you’re analyzing thousands of glycans from a large-scale study, this becomes a real bottleneck.
glyrepr implements a clever optimization called unique structure storage. Instead of storing thousands of identical graphs, it stores only the unique ones and keeps track of which original positions they belong to.
Let’s see this in action:
# Our test data: some common glycan structures
iupacs <- c(
"Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-", # N-glycan core
"Gal(b1-3)GalNAc(a1-", # O-glycan core 1
"Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-", # O-glycan core 2
"Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-", # Branched mannose
"GlcNAc6Ac(b1-4)Glc3Me(a1-" # With decorations
)
struc <- as_glycan_structure(iupacs)
# Now let's create a realistic dataset with lots of repetition
large_struc <- rep(struc, 1000) # 5,000 total structures
large_struc
#> <glycan_structure[5000]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [2] Gal(b1-3)GalNAc(a1-
#> [3] Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-
#> [4] Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-
#> [5] GlcNAc6Ac(b1-4)Glc3Me(a1-
#> [6] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [7] Gal(b1-3)GalNAc(a1-
#> [8] Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-
#> [9] Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-
#> [10] GlcNAc6Ac(b1-4)Glc3Me(a1-
#> ... (4990 more not shown)
#> # Unique structures: 5Notice that magical “# Unique structures: 5”? That’s your performance booster right there!
Let’s verify this optimization is real:
# Only 5 unique graphs are stored internally
length(attr(large_struc, "structures"))
#> [1] 5
# But we have 5,000 total elements
length(large_struc)
#> [1] 5000library(lobstr)
obj_sizes(struc, large_struc)
#> * 14.33 kB
#> * 80.72 kB80 kB vs 15 MB? That’s a 200x memory efficiency! But the real magic happens with computation speed…
smap Universe 🌌Now here’s the problem: if you try to use regular lapply() or purrr::map() functions on glycan structures, you’ll hit a wall:
# This won't work and will throw an error
tryCatch(
purrr::map_int(large_struc, ~ igraph::vcount(.x)),
error = function(e) cat("đź’Ą Error:", rlang::cnd_message(e))
)
#> 💥 Error: ℹ In index: 1.
#> Caused by error in `ensure_igraph()`:
#> ! Must provide a graph object (provided wrong object type).Why does this fail? Because purrr functions don’t understand the internal structure optimization of glycan_structure objects.
smap Family to the Rescue!The smap functions (think “structure map”) are drop-in replacements for purrr functions that are glycan-aware. They understand the unique structure optimization and work directly with the underlying graph objects.
# This works beautifully!
vertex_counts <- smap_int(large_struc, ~ igraph::vcount(.x))
vertex_counts[1:10]
#> [1] 5 2 3 5 2 5 2 3 5 2The “s” stands for “structure” — these functions operate on the underlying igraph objects that represent your glycan structures.
smap Toolkit 🛠️The smap family provides glycan-aware equivalents for virtually all purrr functions:
| purrr | smap | purrr | smap |
|---|---|---|---|
map() |
smap() |
map2() |
smap2() |
map_lgl() |
smap_lgl() |
map2_lgl() |
smap2_lgl() |
map_int() |
smap_int() |
map2_int() |
smap2_int() |
map_dbl() |
smap_dbl() |
map2_dbl() |
smap2_dbl() |
map_chr() |
smap_chr() |
map2_chr() |
smap2_chr() |
some() |
ssome() |
pmap() |
spmap() |
every() |
severy() |
pmap_*() |
spmap_*() |
none() |
snone() |
imap() |
simap() |
imap_*() |
simap_*() |
Simple rule: Replace map with smap, pmap with spmap, and imap with simap. Everything else works exactly like purrr!
Count vertices in each structure:
vertex_counts <- smap_int(large_struc, igraph::vcount)
summary(vertex_counts)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 2.0 2.0 3.0 3.4 5.0 5.0Find structures with more than 4 vertices:
has_many_vertices <- smap_lgl(large_struc, ~ igraph::vcount(.x) > 4)
sum(has_many_vertices)
#> [1] 2000Get the degree sequence of each structure:
degree_sequences <- smap(large_struc, ~ igraph::degree(.x))
degree_sequences[1:3] # Show first 3
#> [[1]]
#> 1 2 3 4 5
#> 1 1 3 2 1
#>
#> [[2]]
#> 1 2
#> 1 1
#>
#> [[3]]
#> 1 2 3
#> 1 1 2Check if any structure has isolated vertices:
ssome(large_struc, ~ any(igraph::degree(.x) == 0))
#> [1] FALSEVerify all structures are connected:
severy(large_struc, ~ igraph::is_connected(.x))
#> [1] TRUEsmap()Quick examples of the extended family:
# smap2: Apply function with additional parameters
thresholds <- c(3, 4, 5)
large_enough <- smap2_lgl(struc[1:3], thresholds, function(g, threshold) {
igraph::vcount(g) >= threshold
})
large_enough
#> [1] TRUE FALSE FALSE# simap: Include position information
indexed_report <- simap_chr(large_struc[1:3], function(g, i) {
paste0("#", i, ": ", igraph::vcount(g), " vertices")
})
indexed_report
#> [1] "#1: 5 vertices" "#2: 2 vertices" "#3: 3 vertices"⚠️ Performance Warning: simap functions don’t benefit from the unique structure optimization! Since each element has a different index, the combination of (structure, index) is always unique, breaking the deduplication that makes other smap functions fast. Use simap only when you truly need position information.
The beauty of smap functions lies in automatic deduplication:
# Create a large dataset with high redundancy
huge_struc <- rep(struc, 5000) # 25,000 structures, only 5 unique
cat("Dataset size:", length(huge_struc), "structures\n")
#> Dataset size: 25000 structures
cat("Unique structures:", length(attr(huge_struc, "structures")), "\n")
#> Unique structures: 5
cat("Redundancy factor:", length(huge_struc) / length(attr(huge_struc, "structures")), "x\n")
#> Redundancy factor: 5000 x
library(tictoc)
# Optimized approach: smap only processes 5 unique structures
tic("smap_int (optimized)")
vertex_counts_optimized <- smap_int(huge_struc, igraph::vcount)
toc()
#> smap_int (optimized): 0.001 sec elapsed
# Naive approach: extract all graphs and process each one
tic("Naive approach (all graphs)")
all_graphs <- get_structure_graphs(huge_struc) # Extracts all 25,000 graphs
vertex_counts_naive <- purrr::map_int(all_graphs, igraph::vcount)
toc()
#> Naive approach (all graphs): 0.089 sec elapsed
# Verify results are equivalent (though data types may differ)
all.equal(vertex_counts_optimized, vertex_counts_naive)
#> [1] TRUEThe higher the redundancy, the bigger the performance gain! In real glycoproteomics datasets with repeated structures, this optimization can provide about 10x speedups.
The function you pass to smap must accept an igraph object as its first argument. You can use purrr-style lambda notation:
# Calculate clustering coefficient for each structure
clustering_coeffs <- smap_dbl(large_struc, ~ igraph::transitivity(.x, type = "global"))
summary(clustering_coeffs)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 0 0 0 0 0 0 2000# Create a comprehensive analysis
structure_analysis <- smap(large_struc, function(g) {
list(
vertices = igraph::vcount(g),
edges = igraph::ecount(g),
diameter = ifelse(igraph::is_connected(g), igraph::diameter(g), NA),
clustering = igraph::transitivity(g, type = "global")
)
})
# Convert to a more usable format
analysis_df <- do.call(rbind, lapply(structure_analysis, data.frame))
head(analysis_df)
#> vertices edges diameter clustering
#> 1 5 4 3 0
#> 2 2 1 1 NaN
#> 3 3 2 1 0
#> 4 5 4 2 0
#> 5 2 1 1 NaN
#> 6 5 4 3 0# Find only structures with exactly 5 vertices
has_five_vertices <- smap_lgl(large_struc, ~ igraph::vcount(.x) == 5)
five_vertex_structures <- large_struc[has_five_vertices]
cat("Found", sum(has_five_vertices), "structures with exactly 5 vertices\n")
#> Found 2000 structures with exactly 5 verticessmap FunctionsUse smap functions when:
igraph-based functions to glycan structuresStick with regular R functions when:
⚠️ Special note on simap:
While simap functions are convenient for position-aware operations, they don’t provide performance benefits over regular imap functions. The inclusion of index information breaks the unique structure optimization, making each (structure, index) pair unique even when structures are identical.
Here’s how you might build a custom glycan analysis pipeline using smap functions:
# Custom motif detector
detect_branching <- function(g) {
degrees <- igraph::degree(g)
any(degrees >= 3)
}
# Apply to large dataset - blazingly fast due to unique structure optimization
has_branching <- smap_lgl(large_struc, detect_branching)
cat("Structures with branching:", sum(has_branching), "out of", length(large_struc), "\n")
#> Structures with branching: 2000 out of 5000
# Use smap2 to check structures against complexity thresholds
complexity_thresholds <- rep(c(3, 4, 5, 2, 4), 1000) # Thresholds for each structure
meets_threshold <- smap2_lgl(large_struc, complexity_thresholds, function(g, threshold) {
igraph::vcount(g) >= threshold
})
cat("Structures meeting complexity threshold:", sum(meets_threshold), "out of", length(large_struc), "\n")
#> Structures meeting complexity threshold: 2000 out of 5000Congratulations! You now understand the core optimization that makes glyrepr blazingly fast and how to leverage it with the smap family of functions.
Key takeaways: - 🧠Unique structure optimization is the secret sauce behind glyrepr’s performance - 🚀 smap functions are drop-in replacements for purrr that understand glycan structures - ⚡ Performance gains are dramatic with large datasets containing repeated structures - 🛠️ Use smap for structures, regular R functions for everything else
You’re now equipped to build the next generation of glycomics analysis tools. Go forth and analyze! 🌟
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS 15.6.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Asia/Shanghai
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] lobstr_1.1.2 dplyr_1.1.4 tibble_3.3.0 tictoc_1.2.1 purrr_1.1.0
#> [6] glyrepr_0.7.4
#>
#> loaded via a namespace (and not attached):
#> [1] jsonlite_1.8.8 compiler_4.4.1 highr_0.11 tidyselect_1.2.1
#> [5] stringr_1.5.2 jquerylib_0.1.4 yaml_2.3.10 fastmap_1.2.0
#> [9] R6_2.6.1 generics_0.1.4 igraph_2.1.4 knitr_1.48
#> [13] backports_1.5.0 checkmate_2.3.3 rstackdeque_1.1.1 bslib_0.8.0
#> [17] pillar_1.11.0 rlang_1.1.6 utf8_1.2.6 cachem_1.1.0
#> [21] stringi_1.8.7 xfun_0.46 sass_0.4.9 cli_3.6.5
#> [25] magrittr_2.0.4 digest_0.6.37 lifecycle_1.0.4 prettyunits_1.2.0
#> [29] vctrs_0.6.5 evaluate_1.0.3 glue_1.8.0 rmarkdown_2.27
#> [33] tools_4.4.1 pkgconfig_2.0.3 htmltools_0.5.8.1