--- title: "Shao2019_analysis" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Shao2019_analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 12, fig.height = 6 ) options(tibble.print_min = 6L, tibble.print.max = 6L, digits = 3) ``` # Introduction In this vignette a PARAFAC model is created for the `Shao2019` data. This is done by first processing the count data. Subsequently, the appropriate number of components are determined. Then the PARAFAC model is created and visualized. ```{r setup} library(parafac4microbiome) library(dplyr) library(ggplot2) library(ggpubr) ``` # Processing the data cube The data cube in Shao2019$data contains unprocessed counts. The function `processDataCube()` performs the processing of these counts with the following steps: * It performs feature selection based on the sparsityThreshold setting. Sparsity is here defined as the fraction of samples where a microbial abundance (ASV/OTU or otherwise) is zero. Here, we do this by taking the sparsity per subject group into account by setting considerGroups=TRUE and telling the function that it should stratify based on the "Delivery_mode" factor. This then automatically applies the sparsityThreshold per group. As a result, microbiota are kept that are abundant in at least one group. * It performs a centered log-ratio transformation of each sample using the `compositions::clr()` function with a pseudo-count of one (on all features, prior to selection based on sparsity). * It centers and scales the three-way array. This is a complex subject, for which we refer to a [paper by Rasmus Bro and Age Smilde](https://doi.org/10.1002/cem.773). By centering across the subject mode, we make the subjects comparable to each other within each time point. Scaling within the feature mode avoids the PARAFAC model focusing on features with abnormally high variation. The outcome of processing is a new version of the dataset called `processedShao`. Please refer to the documentation of `processDataCube()` for more information. ```{r data processing} processedShao = processDataCube(Shao2019, sparsityThreshold=0.9, considerGroups=TRUE, groupVariable="Delivery_mode", CLR=TRUE, centerMode=1, scaleMode=2) ``` # Determining the correct number of components A critical aspect of PARAFAC modelling is to determine the correct number of components. We have developed the functions `assessModelQuality()` and `assessModelStability()` for this purpose. First, we will assess the model quality and specify the minimum and maximum number of components to investigate and the number of randomly initialized models to try for each number of components. Note: this vignette reflects a minimum working example for analyzing this dataset due to computational limitations in automatic vignette rendering. Hence, we only look at 1-4 components with 5 random initializations each. These settings are not ideal for real datasets. Please refer to the documentation of `assessModelQuality()` for more information. ```{r Shao2019 num comp selection} # Setup # For computational purposes we deviate from the default settings minNumComponents = 1 maxNumComponents = 4 numRepetitions = 5 # number of randomly initialized models numFolds = 5 # number of jack-knifed models maxit = 200 ctol= 1e-6 #1e-4 this is a really bad setting but is needed to save computational time numCores = 1 colourCols = c("Delivery_mode", "phylum", "") legendTitles = c("Delivery mode", "Phylum", "") xLabels = c("Subject index", "Feature index", "Time index") legendColNums = c(3,5,0) arrangeModes = c(TRUE, TRUE, FALSE) continuousModes = c(FALSE,FALSE,TRUE) # Assess the metrics to determine the correct number of components qualityAssessment = assessModelQuality(processedShao$data, minNumComponents, maxNumComponents, numRepetitions, ctol=ctol, maxit=maxit, numCores=numCores) ``` The overview plot showcases the number of iterations, the sum-of-squared error, the CORCONDIA and the variance explained for 1-4 components. ```{r overview plot} qualityAssessment$plots$overview ``` The overview plots shows that we can explain 8-10% of the variation in a three-component model. That is quite low. The CORCONDIA for that number of components is well above the minimum requirement of 60. However, a four-component model yields much lower CORCONDIA values. # Jack-knifed models Next, we investigate the stability of the models when jack-knifing out samples using `assessModelStability()`. This will give us more information to choose between 3 or 4 components. ```{r model stability} stabilityAssessment = assessModelStability(processedShao, minNumComponents=1, maxNumComponents=4, numFolds=numFolds, considerGroups=TRUE, groupVariable="Delivery_mode", colourCols, legendTitles, xLabels, legendColNums, arrangeModes, ctol=ctol, maxit=maxit, numCores=numCores) stabilityAssessment$modelPlots[[1]] stabilityAssessment$modelPlots[[2]] stabilityAssessment$modelPlots[[3]] stabilityAssessment$modelPlots[[4]] ``` The model is stable for 1-4 components. Hence a three-component is the appropriate number of components based on the CORCONDIA score. # Model selection We have decided that a three-component model is the most appropriate for the `Shao2019` dataset. We can now select one of the random initializations from the `assessNumComponents()` output as our final model. We're going to select the random initialisation that corresponded the maximum amount of variation explained for three components. ```{r model selection} numComponents = 3 modelChoice = which(qualityAssessment$metrics$varExp[,numComponents] == max(qualityAssessment$metrics$varExp[,numComponents])) finalModel = qualityAssessment$models[[numComponents]][[modelChoice]] ``` Finally, we visualize the model using `plotPARAFACmodel()`. ```{r model plot} plotPARAFACmodel(finalModel$Fac, processedShao, 3, colourCols, legendTitles, xLabels, legendColNums, arrangeModes, continuousModes = c(FALSE,FALSE,TRUE), overallTitle = "Shao PARAFAC model") ``` You will observe that the loadings for some modes in some components are all negative. This is due to sign flipping: two modes having negative loadings cancel out but describe the same thing as two positive loadings. The `flipLoadings()` function automatically performs this procedure and also sorts the components by how much variation they describe. ```{r flip loadings} finalModel = flipLoadings(finalModel, processedShao$data) plotPARAFACmodel(finalModel$Fac, processedShao, 3, colourCols, legendTitles, xLabels, legendColNums, arrangeModes, continuousModes = c(FALSE,FALSE,TRUE), overallTitle = "Shao PARAFAC model") ```