--- title: "Linear modeling in EZbakR" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Linear-modeling} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.dim = c(6, 4) ) ``` ## Introduction Typically, you will have multiple replicates of NR-seq data from two or more "conditions". In this case, it is common to want to compare kinetic parameter estimates across the various conditions. In addition, sometimes you will need to account for complex experimental designs and potential batch effects. This is the task of generalized linear modeling, and in this vignette, we will see how to perform this kind of modeling in EZbakR. EZbakR significantly improves upon its predecessor, bakR, by allowing users to specify any identifiable design matrix, thus supporting a wider arrange of comparative analyses than was possible in bakR. ```{r setup} library(EZbakR) library(dplyr) ``` ## Example scenarios In this section, I will blaze through some common examples. I will eventually write later sections to get into some of the details of linear regression in case you need to more deeply understand how to generalize beyond what I show in this section. In all cases, the specification of the linear model that you would like to fit will be done through the `formula_mean` argument of EZbakR's `AverageAndRegularize()` function. ### The simplest case: a single "condition" factor Often, the difference between samples can be described with a single factor. For example, you may have performed NR-seq in wild-type (WT) cells, and a variety of knock-out (KO) cell lines. In this, case, there will likely be a single experimental detail factor in your metadf: ```{r, echo = F, results = 'hide'} # Example metadf: metadf <- tibble( sample = c("sampleA", "sampleB", "sampleC", "sampleD", "sampleE", "sampleF"), tl = c(0, 2, 2, 0, 2, 2), genotype = c("WT", "WT", "WT", "KO", "KO", "KO") ) metadf ``` ```{r, echo = FALSE, warning = FALSE} knitr::kable(metadf, "pipe") ``` Here, I simulate and then fit the relevant model for both the log(kdeg)'s as well as the log(ksyn)'s: ```{r} metadf <- tibble(sample = c("sampleA", "sampleB", "sampleC", "sampleD", "sampleE", "sampleF"), tl = c(0, 2, 2, 0, 2, 2), genotype = c("WT", "WT", "WT", "KO", "KO", "KO")) # Simulate kdeg and ksyn differences set.seed(43) simdata <- EZSimulate(500, metadf = metadf %>% dplyr::rename(label_time = tl), mean_formula = ~genotype -1, pdiff_ks = 0.1) ezbdo <- EZbakRData(simdata$cB, metadf) ezbdo <- EstimateFractions(ezbdo) ezbdo <- EstimateKinetics(ezbdo) ### Linear modeling in EZbakR: # Estimate average log(kdeg) for each genotype ezbdo <- AverageAndRegularize(ezbdo, formula_mean = ~ genotype) # Estimate average log(ksyn) for each genotype ezbdo <- AverageAndRegularize(ezbdo, parameter = "log_ksyn", formula_mean = ~ genotype) # Calculate difference in log(kdeg) between the two genotypes ezbdo <- CompareParameters(ezbdo, parameter = "log_kdeg", design_factor = "genotype", reference = "WT", experimental = "KO") # Calculate differe in log(ksyn) between the two genotypes ezbdo <- CompareParameters(ezbdo, parameter = "log_ksyn", design_factor = "genotype", reference = "WT", experimental = "KO") ``` Everything is default EZbakR stuff discussed in other vignettes until we get to `AverageAndRegularize()`. There, I am using the `formula_mean` parameter to describe how I want to relate the parameter being averaged to factors in the metadf. `~ genotype` means that I want to calculate average values of the parameter for each value of `genotype` that appears in the metadf. `CompareParameters()` in this case can be used to get comparisons of the parameter averages in the two genotypes. `design_factor` describes the multi-leveled factor you included in your `formula_mean` that you would like to compare two levels of. The difference in parameters is calculated as `experimental` - `reference`. In this case, these differences will represent log-fold differences in the kdegs and ksyns (since a difference in logs is a log of a ratio; log(b) - log(a) = log(b/a)). If you have more than two conditions in your data, nothing changes in how you would run `AverageAndRegularize()`, but you would need to run `CompareParameters()` for each comparison you want to make. You can visualize the comparative analysis results using MA plots or volcano plots like so: ```{r, results = 'hold'} # Assess log(kdeg) differences EZVolcanoPlot(ezbdo, parameter = "log_kdeg") EZMAPlot(ezbdo, parameter = "log_kdeg") ``` ```{r, results = 'hold'} # Assess log(ksyn) differences EZVolcanoPlot(ezbdo, parameter = "log_ksyn") EZMAPlot(ezbdo, parameter = "log_ksyn") ``` ### Dealing with multiple factors: simple interaction model What if there is more than one variable you tweak from replicate to replicate? Maybe you have multiple cell lines with distinct genotypes, but you also treat these cell lines with a drug in some cases? Your metadf may look like: ```{r, echo = F, results = 'hide'} # Example metadf: metadf <- tibble( sample = c("sampleA", "sampleB", "sampleC", "sampleD", "sampleE", "sampleF", "sampleG", "sampleH", "sampleI", "sampleJ", "sampleK", "sampleL"), tl = c(0, 2, 2, 0, 2, 2, 0, 2, 2, 0, 2, 2), genotype = c("WT", "WT", "WT", "WT", "WT", "WT", "KO", "KO", "KO", "KO", "KO", "KO"), treatment = c("nodrug", "nodrug", "nodrug", "drug", "drug", "drug", "nodrug", "nodrug", "nodrug", "drug", "drug", "drug") ) metadf ``` ```{r, echo = FALSE, warning = FALSE} knitr::kable(metadf, "pipe") ``` The simplest thing to do in this case is to estimate average parameter values for each unique combination of your two (or more) factors. This can be done with a simple `formula_mean` of `~ genotype:treatment`: ```{r} metadf <- tibble( sample = c("sampleA", "sampleB", "sampleC", "sampleD", "sampleE", "sampleF", "sampleG", "sampleH", "sampleI", "sampleJ", "sampleK", "sampleL"), tl = c(0, 2, 2, 0, 2, 2, 0, 2, 2, 0, 2, 2), genotype = c("WT", "WT", "WT", "WT", "WT", "WT", "KO", "KO", "KO", "KO", "KO", "KO"), treatment = c("nodrug", "nodrug", "nodrug", "drug", "drug", "drug", "nodrug", "nodrug", "nodrug", "drug", "drug", "drug") ) # Simulate kdeg and ksyn differences set.seed(43) simdata <- EZSimulate(500, metadf = metadf %>% dplyr::rename(label_time = tl), mean_formula = ~genotype:treatment -1, pdiff_ks = 0.1) ezbdo <- EZbakRData(simdata$cB, metadf) ezbdo <- EstimateFractions(ezbdo) ezbdo <- EstimateKinetics(ezbdo) ### Linear modeling in EZbakR: # Estimate average log(kdeg) for each genotype ezbdo <- AverageAndRegularize(ezbdo, formula_mean = ~ genotype:treatment) # Estimate average log(ksyn) for each genotype ezbdo <- AverageAndRegularize(ezbdo, parameter = "log_ksyn", formula_mean = ~ genotype:treatment) # Calculate difference in log(kdeg) between WT drug and no drug ezbdo <- CompareParameters(ezbdo, parameter = "log_kdeg", design_factor = "", reference = "genotypeWT:treatmentnodrug", experimental = "genotypeWT:treatmentdrug") # Calculate difference in log(kdeg) between KO drug and no drug ezbdo <- CompareParameters(ezbdo, parameter = "log_kdeg", design_factor = "", reference = "genotypeKO:treatmentnodrug", experimental = "genotypeKO:treatmentdrug") # Calculate difference in log(kdeg) between KO and WT nodrug ezbdo <- CompareParameters(ezbdo, parameter = "log_kdeg", design_factor = "", reference = "genotypeWT:treatmentnodrug", experimental = "genotypeKO:treatmentnodrug") ### You could do same for log_ksyn ``` The quirk in this case is how to use `CompareParameters`. You have to set `design_factor` to a blank string, and then list the full parameter names in `reference` and `experimental`. You can see the parameter names by looking at the metadata for the relevant `AverageAndRegularize` object: ```{r} ezbdo$metadata$averages$logkdeg_feature$fit_params ``` It's a bit of a hack, and I hope to have a more intuitive option for this case in the near future. I'll note though, that there was an easier way in this case. This interaction model works just like a single parameter model with a factor with one level per unique combination of factors. So you could have defined a metadf that looks like: ```{r, results = 'hide'} metadf$condition <- factor(paste0(metadf$genotype, metadf$treatment)) metadf ``` ```{r, echo = FALSE, warning = FALSE} knitr::kable(metadf, "pipe") ``` and then set `formula_mean` in `AverageAndRegularize()` to `~condition`. This hack comes from the [DESeq2 documentation](https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions), so check that out for an alternate discussion. ### Dealing with multiple factors: estimating an actual interaction term Ok, but what if in the two or more factor case you want to specify and regress out an actual interaction effect? The interpretation of the parameters in such a model is a bit more complicated, but here's how you would fit such a model: ```{r} # factor() ensures that WT + nodrug is treated as the reference # which probably makes more intuitive sense metadf <- tibble( sample = c("sampleA", "sampleB", "sampleC", "sampleD", "sampleE", "sampleF", "sampleG", "sampleH", "sampleI", "sampleJ", "sampleK", "sampleL"), tl = c(0, 2, 2, 0, 2, 2, 0, 2, 2, 0, 2, 2), genotype = factor(c("WT", "WT", "WT", "WT", "WT", "WT", "KO", "KO", "KO", "KO", "KO", "KO"), levels = c("WT", "KO")), treatment = factor(c("nodrug", "nodrug", "nodrug", "drug", "drug", "drug", "nodrug", "nodrug", "nodrug", "drug", "drug", "drug"), levels = c("nodrug", "drug")) ) # Simulate kdeg and ksyn differences set.seed(43) simdata <- EZSimulate(500, metadf = metadf %>% dplyr::rename(label_time = tl), mean_formula = ~genotype*treatment -1, pdiff_ks = 0.1) ezbdo <- EZbakRData(simdata$cB, metadf) ezbdo <- EstimateFractions(ezbdo) ezbdo <- EstimateKinetics(ezbdo) ### Linear modeling in EZbakR: # Estimate average log(kdeg) for each genotype ezbdo <- AverageAndRegularize(ezbdo, formula_mean = ~ genotype*treatment) # Estimate average log(ksyn) for each genotype ezbdo <- AverageAndRegularize(ezbdo, parameter = "log_ksyn", formula_mean = ~ genotype*treatment) # Viewing parameters will be easier in the future ezbdo$metadata$averages$logkdeg_feature$fit_params ``` This model is a bit quirky and has the following 4 parameters: 1. `genotypeWT`: This represents the average WT, nodrug parameter value. 2. `genotypeKO`: This represents the average KO, nodrug parameter value. 3. `treatmentdrug`: This represents the average effect of drug treatment **on just the reference level genotype (WT in this case)**. 4. `genotypeKO:treatmentdrug`: This is the "interaction term", but more intuitively represents the difference in the KO and WT drug effects. With this in mind, some things you can do with `CompareParameters()` includes: ```{r} # Assess significance of interaction term ezbdo <- CompareParameters(ezbdo, parameter = "log_kdeg", param_name = "genotypeKO:treatmentdrug") # Assess significance of drug effect on WT ezbdo <- CompareParameters(ezbdo, parameter = "log_kdeg", param_name = "treatmentdrug") # Assess difference of WT and KO ezbdo <- CompareParameters(ezbdo, parameter = "log_kdeg", design_factor = "genotype", reference = "WT", experimental = "KO") ``` Assessing the impact of drug treatment on the non-reference genotype is currently not possible, but will be through the currently inactive `param_function` parameter in `CompareParameters()`. This is because this situation is unique in that you need to assess the sum of three parameters (`genotypeKO`, `treatmentdrug`, and `genotypeKO:treatmentdrug`). Alternatively, the drug impact on KO can be estimated more simply using the model discussed in the last section. ### Regressing out potential batch effects The final common situation to discuss is when you have multiple factors in your metadf, but some of them represent technical factors rather than biological ones. These technical factors constitute what are often called "batches", which are technical factors that can influence parameter estimates and thus risk being a confounder in your analyses. Linear modeling offers a simple strategy for attempting to deal with this problem. Consider the following metadf: ```{r, echo = F, hide = T} metadf <- tibble(sample = c("sampleA", "sampleB", "sampleC", "sampleD", "sampleE", "sampleF", "sampleG", "sampleH", "sampleI", "sampleJ"), tl = c(0, 2, 2, 2, 2, 0, 2, 2, 2, 2), genotype = c("WT","WT", "WT", "WT", "WT", "KO","KO", "KO", "KO", "KO"), batch = c("A", "A", "B", "A", "B", "A", "A", "B", "A", "B")) ``` ```{r, echo = FALSE, warning = FALSE} knitr::kable(metadf, "pipe") ``` There is a technical factor I am calling "batch" that we suspect may influence our NR-seq data, independent of any biological difference between the two genotypes. We would like to "regress out" this technical effect. This can be done by specifying `~genotype + batch` in the `formula_mean` of `AverageAndRegularize()`.: ```{r} # factor() ensures that WT is treated as the reference # which probably makes more intuitive sense metadf <- tibble(sample = c("sampleA", "sampleB", "sampleC", "sampleD", "sampleE", "sampleF", "sampleG", "sampleH", "sampleI", "sampleJ"), tl = c(0, 2, 2, 2, 2, 0, 2, 2, 2, 2), genotype = factor(c("WT", "WT", "WT", "WT", "WT", "KO", "KO", "KO", "KO", "KO"), levels = c("WT", "KO")), batch = c("A", "A", "B", "A", "B", "A", "A", "B", "A", "B")) # Simulate kdeg and ksyn differences set.seed(43) simdata <- EZSimulate(500, metadf = metadf %>% dplyr::rename(label_time = tl), mean_formula = ~genotype + batch -1, pdiff_ks = 0.1) ezbdo <- EZbakRData(simdata$cB, metadf) ezbdo <- EstimateFractions(ezbdo) ezbdo <- EstimateKinetics(ezbdo) ### Linear modeling in EZbakR: # Estimate average log(kdeg) for each genotype ezbdo <- AverageAndRegularize(ezbdo, formula_mean = ~ genotype + batch) # Estimate average log(ksyn) for each genotype ezbdo <- AverageAndRegularize(ezbdo, parameter = "log_ksyn", formula_mean = ~ genotype + batch) # Viewing parameters will be easier in the future ezbdo$metadata$averages$logkdeg_feature$fit_params ``` The three parameters are: 1. `genotypeWT`: Average WT parameter value, with the batch averages "regressed out". 1. `genotypeKO`: Average KO parameter value, with batch averages regressed out. 1. `batchB`: Average difference between the batches, when accounting for apparent biological differences between the batches. Unlike in the interaction model (`~genotype*treatment`), the parameters `genotypeWT` and `genotypeKO` are no longer simple averages of samples with the relevant factor value. Rather, they represent an average that accounts for differences between batches in the same biological conditions. **MAJOR NOTE:** `AverageAndRegularize()` will fail with an error if your model is not "identifiable". That means, if you don't have enough independent data points to estimate all parameters in your desired linear model, EZbakR won't let you fit said model. Consider the following metadf, which contains a technical factor "batch" as in the other examples discussed in this section: ```{r, echo = F, hide = T} metadf <- tibble(sample = c("sampleA", "sampleB", "sampleC", "sampleD", "sampleE", "sampleF", "sampleG", "sampleH", "sampleI", "sampleJ"), tl = c(0, 2, 2, 2, 2, 0, 2, 2, 2, 2), genotype = c("WT","WT", "WT", "WT", "WT", "KO","KO", "KO", "KO", "KO"), batch = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B")) ``` ```{r, echo = FALSE, warning = FALSE} knitr::kable(metadf, "pipe") ``` If you try and fit the model described in this section (`formula_mean = ~genotype + batch`), EZbakR will fail with an error. Why wasn't it able to fit the same model with this slightly different metadf? To fit this model, you need to be able to estimate the impact of "batch" independent of the impact of "genotype". In this case though, all batch A's are instances of "WT" datasets, and all batch B's are instances of "KO" datasets. Thus, the model cannot distinguish batch effects from real biological differences. It would be like me asking you to solve the equation 5 = x + z for x and z. There are an infinite number of combinations of x and z that satisfy this equation. **Conclusion:** If you suspect that batch effects could influence your measurements (e.g., the day a sample was collected on, who collected the sample, etc.) you need to ensure that these technical effects are evenly distributed among your various biological conditions. Don't process all "WT" datasets on one day and all "KO" on a different day. Make sure every batch of samples includes all biologically distinct conditions you are testing.