--- title: "Universal first step: EstimateFractions()" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{EstimateFractions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.dim = c(6, 4) ) ``` ## Introduction The first step of almost any NR-seq analysis is estimating how much of each mutational population is present for each feature in each sample. For example, in a standard TimeLapse-seq/SLAM-seq/TUC-seq/etc. experiment, you need to first estimate the fraction of reads from each feature that have high T-to-C mutational content. `EstimateFractions()` is designed for exactly this task. In this vignette, I will walk through the basics of using `EstimateFractions()`, while also diving into some of its unique functionality. ```{r setup} library(EZbakR) # Going to show one tidyr trick to help with cB filtering library(tidyr) # Going to use a bit of dplyr magic in my demonstrations library(dplyr) ``` ## The basics First, let's simulate some data to showcase how `EstimateFractions()` works: ```{r} simdata <- EZSimulate(nfeatures = 300, nreps = 2) ezbdo <- EZbakRData(simdata$cB, simdata$metadf) ``` Technically, all you need to do is run `EstimateFractions()`, providing it your `EZbakRData` object, and all will be fine: ```{r} ezbdo <- EstimateFractions(ezbdo, features = "feature") # NOTE: # For real data from fastq2EZbakR, you will often # want to set `features = "XF"`, specifying analysis # of reads mapping to exonic regions of genes. ``` Despite this, it is important to realize that `EstimateFractions()` is automating a lot of things under the hood. In particular, it is making the following assumptions about your data: 1. You want to analyze every mutational population tracked in your cB file. In this case, that is just the "TC" column of the simulated cB. 2. All possible mutational populations are present in your data. 3. The feature sets for which you want to estimate fractions includes each unique combination of all feature columns in your cB. In this case, that is just the "feature" column of the simulated cB. 4. You want to filter out rows for which all feature columns are the string "__no_feature" or "NA". 5. Rows with multiple feature assignments in a given feature column should not be split into separate rows for each feature assignment. In the rest of this section, I will discuss how to adjust all of these behaviors. ### Tuning the mutational modeling If you have multiple mutation-type columns in your cB, but would only like to analyze a subset of them, this subset can be specified using the `mutrate_populations` argument: ```{r} ezbdo <- EstimateFractions(ezbdo, mutrate_populations = "TC") ``` If you check `?EstimateFractions()`, you will see that its default value is "all", which means to use all present mutation-type columns. In addition, under the hood, `EstimateFractions()` creates what I have termed a fractions design matrix. This "design matrix" is really just a data frame with n + 1 columns, where n is the number of mutation-types you are analyzing. EZbakR provides some lazily loaded examples for you to check out. "Standard" TC-only fractions design matrix: ```{r, echo = T, results = 'hide'} # Observe contents of cB standard_fraction_design ``` ```{r, echo = FALSE, warning = FALSE} knitr::kable(standard_fraction_design, "pipe") ``` There is one column for each mutation-type, and one column called `present`. Each value under a mutation-type column is either `TRUE` or `FALSE`. `TRUE` indicates that the row describes a population with high levels of that mutation-type. So the first row in this example describes the high T-to-C mutation content population. The value under the `present` column denotes whether or not that population is expected to be present in your data. The first row has a `present` value of `TRUE` because we expect there to be high T-to-C mutation content reads in our +label data (don't worry about -label controls, they are properly handled automatically). The second row has a `TC` value of `FALSE`, indicating that it pertains to the population of reads with low T-to-C mutation content. This population is also expected to exist in a standard NR-seq dataset, so its `present` value is also `TRUE. For a more complicated example, consider the fractions design matirx for [TILAC](https://pubmed.ncbi.nlm.nih.gov/36018791/). TILAC is a method where an s^4^U labeled RNA population is mixed with an s^6^G labeled population. In this case, there are expected to be reads with either high T-to-C content (new reads from the s^4^U labeled sample), high G-to-A content (new reads from the s^6^G labeled sample), and reads with low T-to-C and G-to-A mutational content (old reads from either sample). However, you will NEVER expect a read with **both** high T-to-C and high G-to-A content, as there are no samples subjected to both labels: ```{r, results = "hide"} # Observe contents of cB tilac_fraction_design ``` ```{r, echo = FALSE, warning = FALSE} knitr::kable(tilac_fraction_design, "pipe") ``` Because of this the `present` value for the `TC == TRUE` and `GA == TRUE` row is `FALSE`. The dually high mutation population is not present in this data. All other populations are present though, so their `present` values are `TRUE`. You can automatically generate a fraction design table as such: ```{r, results = "hide"} # Three populations for fun: fd_table <- create_fraction_design(c("TC", "GA", "CG")) fd_table ``` ```{r, echo = FALSE, warning = FALSE} knitr::kable(fd_table, "pipe") ``` By default, `create_fraction_design` assumes all possible populations are present. You can edit this table to better reflect the true circumstances of your particular experiment. If you don't provide a fraction design table, `EstimateFractions()` will use the default output of `create_fraction_design()` to ensure that all possible populations are modeled. This is rarely going to be a truly accurate fraction design table for multi-label experiments, but its a conservative default that you can easily adjust. ### Tuning the feature set choice and filtering fastq2EZbakR, the upstream pipeline I developed to conveniently produce output compatible with EZbakR, is able to assign reads to lots of different "features". These include genes, exonic regions of genes, transcript equivalence classes, exon-exon junctions, and more. To make full effective use of this diverse set of feature assignments, its important to understand how `EstimateFractions()` will treat the various feature columns in your cB. Often, you will be interested in analyzing subsets of these features separately. For example, it might make sense to estimate fractions for gene + transcript equivalence class combos, gene + exon-exon junction combos, and the standard exonic-gene feature, all with the same cB. You can specify which features to use in a given analysis via setting the `features` argument: ```{r} ezbdo <- EstimateFractions(ezbdo, features = 'feature') ``` In this example, there is only one feature column called `feature`, but in practice, the `features` argument can be provided a vector of feature column names. Sometimes, a read was not assignable to a given feature. In that case, there is likely a characteristic string that denotes failed assignment. In the current version of `fastq2EZbakR`, this is `"__no_feature"` (though in older versions was `NA`, which is a bit harder to deal with; more on that later). The default is thus for `EstimateFractions()` to filter out any rows of the cB for which all analyzed features have this value. This is set by the `remove_features` argument, which is a vector of strings that should be considered ripe for filtering: ```{r} ezbdo <- EstimateFractions(ezbdo, remove_features = c("__no_feature", "feature1")) ``` You can also use this to filter out certain features, like "feature1" in the simulated example. As I mentioned though, all analyzed feature columns need to have one of the `remove_features` strings in order for them to get filtered. This behavior can be changed to the opposite extreme, where only a single feature needs to fail this test for the entire row to get filtered out: ```{r} ezbdo <- EstimateFractions(ezbdo, filter_condition = `|`) ``` As I mentioned, older versions of fastq2EZbakR and bam2bakR would denote failed assignment with an `NA`. While the default value for `remove_features` includes the string `"NA"`, this does not properly handle filtering out of columns with the actual value `NA`. You can convert NA's in your cB file with `tidyr::replace_na()` to whatever string you please: ```{r, results = 'hide'} example_df <- data.frame(feature = c('A', NA, 'C'), other_feature = c(NA, 'Y', 'Z')) replaced_df <- replace_na(example_df, list(feature = '__no_feature', other_feature = 'NA')) replaced_df ``` ```{r, echo = FALSE, warning = FALSE} example_df <- data.frame(feature = c('A', NA, 'C'), other_feature = c(NA, 'Y', 'Z')) replaced_df <- replace_na(example_df, list(feature = '__no_feature', other_feature = 'NA')) knitr::kable(replaced_df, "pipe") ``` Finally, some feature assignments will be ambiguous. That is, one read will assign to multiple instances of a given feature. For example, when assigning reads to exon-exon junctions, one read may overlap multiple exon-exon junctions. In fastq2EZbakR, such instances are handled by including the names of all features a read assigned to, separated by "+". In some cases, you will want to split these rows into multiple rows for each feature assignment. To do that, you can specify the `split_multi_features` and `multi_feature_cols` arguments: ```{r} ezbdo <- EstimateFractions(ezbdo, split_multi_features = TRUE, multi_feature_cols = "feature") ``` `split_multi_features = TRUE` will copy the data for multi-feature assignment rows for any of the feature columns denoted in `multi_feature_cols`. If the feature names for multi-feature assigned reads is different from "+", this can be addressed by altering the `multi_feature_sep` argument. ## Improving estimates Two things can impact the accuracy of fractions estimation: 1. Dropout: this is the phenomenon by which metabolically labeled RNA or sequencing reads derived from such RNA are lost either during library preparation or during computational processing of the raw sequencing data. Several labs have independently identified and discussed the prevalence of dropout in NR-seq and other metabolic labeling RNA-seq experiments. Recent work suggests that there are three potential causes for this: 1) Loss of labeled RNA on plastic surfaces during RNA extraction, 2) RT falloff due to modifications of the metabolic labeling made by some NR-seq chemistries, and 3) loss of reads with many NR-induced mutations due to poor read alignment. 2. Inaccurate mutation rate estimates In this section, we will discuss some strategies you can use to address these two challenges. ### Correcting for dropout If you have -label data to compare to, EZbakR can use this data to quantify and correct for dropout. Currently, the default strategy used by EZbakR is similar to that implemented in [grandR](https://www.nature.com/articles/s41467-023-39163-4) and originally described [here](https://pubmed.ncbi.nlm.nih.gov/38381903/). Alternatively, EZbakR also implements that previously implemented in bakR. Internal benchmarks suggest that that grandR strategy is a bit conservative (overestimates dropout) but more robust than the bakR strategy. The bakR strategy has the advantage of being explicitly derived from a particular generative model of dropout (details [here](https://simonlabcode.github.io/bakR/articles/Dropout.html)), allowing for things like model fit assessment. This sort of functionality is currently not implemented in EZbakR though, so I have chosen to default to the grandR strategy for now. To implement dropout correction, run `CorrectDropout()`, specifying which metadf sample characteristic columns should be used to associate -label and +label samples: ```{r} ezbdo <- CorrectDropout(ezbdo, grouping_factors = "treatment") ``` `grouping_factors` is technically not a required argument, as EZbakR will try to infer it automatically. EZbakR assumes that all metadf sample characteristic columns should be used for grouping though, so its important to know that this default can be altered as necessary. To alter the correction strategy, set the `strategy` parameter of `CorrectDropout()` (options being "grandR" and "bakR"). ### Inferring background mutation rates from -label data In some cases (e.g., if label incorporation rates are low, or if a very short label time was used), it can be difficult to estimate the high and low mutation rates in a given sample. To curtail this challenge, you can use -label data to infer background mutation rates, and then estimate the chemically induced mutation rate fixing the background mutation rate to the -label rate. To do this, you just need to set `pold_from_nolabel` to `TRUE`: ```{r} ezbdo <- EstimateFractions(ezbdo, pold_from_nolabel = TRUE) ``` By default, a single background mutation rate (pold) will be estimated using all of the -label data. Alternatively, you can specify sample characteristics by which to stratify -label data, so that one pold is estimated per group of samples with the same value for the metadf columns specified in `grouping_factors`: ```{r} ezbdo <- EstimateFractions(ezbdo, pold_from_nolabel = TRUE, grouping_factors = 'treatment') ``` ## Fancy functionalities In the first section, we discussed the basic functionality of `EstimateFractions()` and how to alter its key behaviors. In this section, I will discuss some of the cooler, more niche analysis strategies that `EstimateFractions()` can implement. ### Hierarchical mutation rate model The two-component mixture modeling strategy implemented in tools like bakR and GRAND-SLAM, is largely considered the "gold-standard" in analyzing NR-seq data. Like any statistical method though, it makes assumptions, and real data will often violate these assumptions. Namely, these models typically assume that every RNA synthesized in the presence of metabolic label has an equal probability of incorporating said label. What if this is not true though?This may seem like a pedantic statistical question, but there is evidence that some RNA's deviate strongly from this assumption. For example, it has been noted (paper from the Churchman lab [here](https://pubmed.ncbi.nlm.nih.gov/38503286/)) that mitochondrial RNA have much lower mutation rates in reads from new RNA than other RNA. To account for this heterogeneity, `EstimateFractions()` can implement what I will term a "hierarchical" two-component mixture model. "Hierarchical" means that a new read mutation rate will be estimated for each feature, but this estimate will be strongly informed by the sample-wide average new read mutation rate. In other words, if a given feature has enough coverage and strong evidence for its new read mutation rate being different from the sample-wide average, then this feature-specific mutation rate will be estimated and used in estimating the fraction of high mutation content reads for this feature. If a feature has limited coverage though, its new read mutation rate estimate will be strongly pushed towards to the sample-wide average. In other words, the feature-specific mutation rates are "regularized" towards the sample-wide average. This strategy will take a bit longer to run and can be implemented as so: ```{r} ezbdo <- EstimateFractions(ezbdo, strategy = 'hierarchical') ``` When doing this, the feature-specific and sample-wide mutation rates will both be saved in the `ezbdo$mutation_rates` list.As usual, let's check the results: ```{r} est <- EZget(ezbdo, type = 'fractions') truth <- simdata$PerRepTruth compare <- dplyr::inner_join(est, truth, by = c('sample', 'feature')) plot(compare$true_fraction_highTC, compare$fraction_highTC) abline(0,1) ``` Still good! ### Isoform deconvolution For many years now, we have had tools to estimate transcript isoform abundances from RNA-seq data. These include RSEM, kallisto, Salmon, and many more. Wouldn't it be nice if we could similarly estimate synthesis and degradation rate constants for individual transcript isoforms with NR-seq data? EZbakR now implements such an analysis strategy! To use it, you will need to have a cB in which reads have been assigned to their transcript equivalence class, which is just a fancy way of saying "the set of isoforms with which they are completely consistent". [fastq2EZbakR](https://github.com/isaacvock/fastq2EZbakR) is able to do this, so check out its documentation for details. We can simulate such a cB ourselves for demonstration purposes: ```{r} # Simulates a single sample worth of data simdata_iso <- SimulateIsoforms(nfeatures = 300) # We have to manually create the metadf in this case metadf <- data.frame(sample = 'sampleA', tl = 4, condition = 'A') ezbdo <- EZbakRData(simdata_iso$cB, metadf) ``` If you inspect the cB simulated by `SimulateIsoforms()`, you will note that there is a column called `transcripts`, which represents transcript equivalence classes. The transcript IDs for all of the transcripts a read is consistent with are separated by "+"'s, though in this case, we don't want to separate and copy the data for each isoform. Whether, we want to estimate fractions for each unique equivalence class: ```{r} ezbdo <- EstimateFractions(ezbdo) ``` To get isoform-specific estimates, we will need to run a new function, called `EstimateIsoformFractions()`. Before doing this though, we need to import transcript isoform quantification estimates from tools like RSEM, which will be used by `EstimateIsoformFractions()`. To do this, you can use `ImportIsoformQuant()`. `ImportIsoformQuant()` has three parameters: * `obj`: The `EZbakRData` object you want to which you want to add isoform quantification information * `files`: A named vector of paths to each isoform quantification output file you want to import. The names should be the relevant sample names as they appear in your cB. * `quant_tool`: The tool you used for isoform quantification. `ImportIsoformQuant()` is just a convenient wrapper for `tximport::tximport`, so I urge you to read that tools's documentation for more information (documentation [here](https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html)). In this example, I will hack a solution, since I don't have any isoform quantification output to use. You could technically do something similar if you are having a tough time using `ImportIsoformQuant()`, but I wouldn't recommend it: ```{r} ### Hack in the true, simulated isoform levels reads <- simdata_iso$ground_truth %>% dplyr::select(transcript_id, true_count, true_TPM) %>% dplyr::mutate(sample = 'sampleA', effective_length = 10000) %>% dplyr::rename(expected_count = true_count, TPM = true_TPM) # Name of table needs to have "isoform_quant" in it ezbdo[['readcounts']][['simulated_isoform_quant']] <- reads ### Perform deconvolution ezbdo <- EstimateIsoformFractions(ezbdo) ``` We can then see how it did by comparing to ground truth: ```{r, fig.align='center'} est <- EZget(ezbdo, type = 'fractions', features = "transcript_id") truth <- simdata_iso$ground_truth compare <- truth %>% dplyr::inner_join(est, by = c("feature", "transcript_id")) plot(compare$true_fn, compare$fraction_highTC) abline(0,1) ``` This simulation is a bit extreme as around 50% of all isoform abundance differences are assumed to be completely driven by differences in isoform stability, which leads to underestimation of some of the crazy extreme fraction news. Other than that though, it looks good! ### Using the Apache Arrow backend **NOTE:** The documentation for this section lacks good examples as I don't yet have an example external dataset included with EZbakR's installation. Thus, I will be describing the details of how the analysis would go without showing a real analysis. Look for this to change in the not-to-distant future. What if you have a massive dataset with 10s of samples that you would like to analyze? Loading the entire cB table for such a dataset into memory will surely crash most laptops. If you have access to some sort of HPC cluster, you could of course use that, but there is nothing like the convenience and interactivity of working with a dataset on your personal computer. Because of this, EZbakR is able to use [Apache Arrow's](https://arrow.apache.org/) R frontend (i.e., the [arrow](https://arrow.apache.org/docs/r/) package) to help with large datasets. The steps for this process are described below. #### Step 1: Create an Arrow Dataset You need to create an Arrow Dataset partioned by the "sample" column of your cB. This will create a set of .parquet files, with one file created for each sample. This will allow `EstimateFractions()` to load your data in one sample at a time, so as to only hold a single sample of the cB in RAM at a time: ```{r, eval=FALSE} library(arrow) ### Move into the directory with your cB file setwd("Path/to/cB/containing/directory") ### This will not load the cB into memory # You can run `read_csv("cB.csv.gz", n_max = 1)` to check to see what # the order of the columns are, as this order needs to match your provided # schema. ds <- open_dataset("cB.csv.gz", format = "csv", schema = schema( sample = string(), rname = string(), GF = string(), XF = string(), exon_bin = string(), bamfile_transcripts = string(), junction_start = string(), junction_end = string(), TC = int64(), nT = int32(), sj = bool(), n = int64() ), skip_rows=1) ### Create Arrow dataset setwd("Path/to/where/you/want/to/write/Arrow/Dataset") ds %>% group_by(sample) %>% write_dataset("fulldataset/", format = "parquet") ``` See the [arrow documentation](https://arrow.apache.org/docs/r/articles/dataset.html) for a lot more details about how to tune the dataset creation process. This will never load the entire cB into memory, though may use a bit too much RAM if you try to add some of the custom filtering or summarization discussed in the arrows docs. #### Step 2: Create EZbakRArrowData object You can then create an `EZbakRArrowData` object similarly to how you would create a standard `EZbakRData` object: ```{r, eval = FALSE} ds <- arrow::open_dataset("Path/to/where/you/want/to/Arrow/Dataset/") metadf <- tibble(sample = c("WT_1", "WT_2", "WT_ctl", "KO_1", "KO_2", "KO_ctl"), tl = c(2, 2, 0, 2, 2, 0), genotype = rep(c('WT', 'KO'), each = 3)) ezbado <- EZbakRArrowData(ds, metadf) ``` #### Step 3: Run EstimateFractions like normal Finally, just run `EstimateFractions()` with all of the settings you would normally use if you were working with an `EZbakRData` object: ```{r, eval = FALSE} ezbado <- EstimateFractions(ezbado, features = c("GF", "XF", "junction_start", "junction_end"), filter_cols = c("XF", "junction_start", "junction_end"), filter_condition = `|`, split_multi_features = TRUE, multi_feature_cols = c("junction_start", "junction_end")) ``` #### Simulating the Arrow workflow To give you the opportunity to explore using the Arrow backend in a controlled setting, here is a workflow where we simulate a cB, create a temporary arrow dataset with it, and run `EstimateFractions()` on an `EZbakRArrowData` object: ```{r} library(arrow) simdata <- EZSimulate(nfeatures = 250) outdir <- tempdir() dataset_dir <- file.path(outdir, "arrow_dataset") write_dataset( simdata$cB, path = dataset_dir, format = "parquet", partitioning = "sample" ) ds <- open_dataset(dataset_dir) ezbdo <- EZbakRArrowData(ds, simdata$metadf) ezbdo <- EstimateFractions(ezbdo) ``` In this case, a single sample is analyzed at a time to reduce RAM usage, as the arrow dataset allows EZbakR to only load the sample being currently analyzed into RAM. Note that this will result in a slight runtime hit relative to the non-arrow analysis. Thus, this is best reserved for working with large datasets.