--- title: "Coding-variant Allelic Series Test" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{COAST} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- Updated: 2025-03-26 ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Data To run an allelic series test, there are 4 key inputs: * A numeric annotation vector, of the same length as the number of variants, coded as 1 for benign missense variants (BMVs), 2 for damaging missense variants (DMVs), and 3 for protein truncating variants (PTVs). * A covariates matrix, with as many rows as subjects and including columns such as age and sex. If omitted, defaults to an intercept only. If provided, an intercept will automatically be added if not detected. * A genotype matrix, with subjects as rows and variants as columns. The number of columns should correspond to the length of the annotation vector. * A numeric phenotype vector, either continuous or binary. The example data used below were generated using the `DGP` function provided with the package. The data set includes 100 subjects, 300 variants, and a continuous phenotype. The true effect sizes follow an allelic series, with magnitudes proportional to `c(1, 2, 3)` for BMVs, DMVs, and PTVs respectively. ```{r} set.seed(101) n <- 100 data <- AllelicSeries::DGP( n = n, snps = 300, beta = c(1, 2, 3) / sqrt(n), ) # Annotations. anno <- data$anno head(anno) # Covariates. covar <- data$covar head(covar) # Genotypes. geno <- data$geno head(geno[,1:5]) # Phenotype. pheno <- data$pheno head(pheno) ``` The example data generated by the preceding are available under `vignettes/vignette_data`. ## Running the alleic series test The COding-variant Allelic Series Test (COAST) is run using the `COAST` function. By default, the output of `COAST` includes a data.frame of counts showing the number of alleles, variants, and carriers in each class that contributed to the test, and a data.frame of p-values, with the `omni` test denoting the final, overall p-value. Inspection of the component p-values may be useful for determining which model(s) detected evidence of an association. In the present case, the baseline count model provided the greatest power. ```{r} results <- AllelicSeries::COAST( anno = anno, geno = geno, pheno = pheno, covar = covar ) show(results) ``` The effect sizes data.frame is accessed via: ```{r} results@Betas ``` the counts data.frame via: ```{r} results@Counts ``` and the p-values data.frame via: ```{r} results@Pvals ``` ### Ultra-rare variants Effect size estimation for ultra-rare variants is challenging due to the scarcity of carriers on which to base the estimate. Following works such as [SAIGE-GENE+](https://doi.org/10.1038/s41588-022-01178-w), we recommend collapsing variants with a minor allele count (MAC) $\leq 10$ into an aggregate variant, separately within each annotation category. The `CollapseGeno` utility is provided for this purpose. The `min_mac` specifies the threshold for retention as an individual variant (e.g. `min_mac = 11` will collapse variants with a MAC $\leq 10$). The output is a list containing the annotation vector `anno` and genotype matrix `geno` after collapsing. In addition, a mapping `vars` is provided showing which variants were collapsed to create each aggregate variant. Note that: * `CollapseGeno` will change the order of the variants in the genotype matrix. * No aggregate variant will be created within annotation categories where no variants have a MAC below the threshold. * Collapsing ultra-rare variants does not guarantee that the resulting aggregate variant will itself have a MAC $\geq$ the threshold. To ensure that only variants with a minimum MAC enter the association test, a `min_mac` threshold should also be supplied to `COAST`. ```{r} set.seed(102) # Generate data. n <- 1e3 data <- AllelicSeries::DGP( n = n, snps = 10, prop_anno = c(1, 1, 1) / 3 ) # Collapse ultra-rare variants. collapsed <- AllelicSeries::CollapseGeno( anno = data$anno, geno = data$geno, min_mac = 11 ) # Variants collapsed to form each aggregate variant. head(collapsed$vars) ``` ```{r, eval=FALSE} # Run COAST on the collapsed data. results <- AllelicSeries::COAST( anno = collapsed$anno, covar = data$covar, geno = collapsed$geno, pheno = data$pheno, min_mac = 10 ) ``` ### Different numbers of annotation categories `COAST` was originally intended to operate on the benign missense variants, damaging missense variants, and protein truncating variants within a gene, but it has been generalized to allow for an arbitrary number of discrete annotation categories. The following example simulates and analyzes data with 4 annotation categories. The main difference when analyzing a different number of annotation categories is that the `weight` vector should be specified, and should have length equal to the number of *possible* annotation categories. `COAST` will run, albeit with a warning, if there are possible annotation categories to which no variants are assigned (e.g. a gene contains no PTVs). ```{r, cache=TRUE} set.seed(102) # Generate data. n <- 1e2 data <- AllelicSeries::DGP( n = n, snps = 400, beta = c(1, 2, 3, 4) / sqrt(n), prop_anno = c(0.4, 0.3, 0.2, 0.1), weights = c(1, 1, 1, 1) ) # Run COAST-SS. results <- AllelicSeries::COAST( anno = data$anno, covar = data$covar, geno = data$geno, pheno = data$pheno, weights = c(1, 2, 3, 4) ) show(results) ``` ### Test options * `apply_int = TRUE` applies the rank-based inverse normal transformation from the [RNOmni](https://CRAN.R-project.org/package=RNOmni) package. This transformation is expected to improve power for phenotypes that have a skewed or kurtotic (e.g. long-tailed) distribution. It is applied by default in the case of continuous phenotype, and is ignored in the case of a binary phenotype. ```{r, eval = FALSE} AllelicSeries::COAST( anno = anno, geno = geno, pheno = pheno, covar = covar, apply_int = TRUE ) ``` * `include_orig_skato_all = TRUE` includes standard SKAT-O applied to all variants as a component of the omnibus test, while `include_orig_skato_ptv = TRUE` includes standard SKAT-O applied to PTVs only. In cases where other annotation schemes are used, the annotation for the PTV class can be specified via `ptv_anno`. Including standard SKAT-O as a component of the omnibus test can improve power to detect associations between the phenotype and genes that may not be allelic series. ```{r, eval = FALSE} AllelicSeries::COAST( anno = anno, geno = geno, pheno = pheno, covar = covar, include_orig_skato_all = TRUE, include_orig_skato_ptv = TRUE, ptv_anno = 3 ) ``` * `is_pheno_binary = TRUE` is required to indicate that the supplied phenotype is binary, and should be analyzed using a logistic regression model. ```{r, eval = FALSE} AllelicSeries::COAST( anno = anno, geno = geno, pheno = 1 * (pheno > 0), covar = covar, is_pheno_binary = TRUE ) ``` * `min_mac` is used to filter the variant set to those containing a minimum minor allele count (MAC). The following example filters to only those variants with a MAC of at least 2: ```{r, eval = FALSE} AllelicSeries::COAST( anno = anno, geno = geno, pheno = pheno, covar = covar, min_mac = 2 ) ``` * `return_omni_only = TRUE` is used to return `p_omni` only when the component p-values are not of interest: ```{r, eval = FALSE} AllelicSeries::COAST( anno = anno, geno = geno, pheno = pheno, covar = covar, return_omni_only = TRUE ) ``` * `score_test = TRUE` specifies the use of a score-type allelic series burden test. The default of `score_test = FALSE` specifies a Wald-type allelic series burden test. ```{r, eval = FALSE} AllelicSeries::COAST( anno = anno, geno = geno, pheno = pheno, covar = covar, score_test = TRUE ) ``` * `weights` specifies the relative phenotypic effects of BMVs, DMVs, and PTVs. An increasing pattern such as the default setting of `weights = c(1, 2, 3)` targets allelic series. Setting `weights = c(1, 1, 1)` would target a genetic architecture where all variants have equivalent expected magnitudes. ```{r, eval = FALSE} AllelicSeries::COAST( anno = anno, geno = geno, pheno = pheno, covar = covar, weights = c(1, 2, 3) ) ```