---
title: "Getting started with gwasrapidd"
output: 
  bookdown::html_document2:
    base_format: rmarkdown::html_vignette
    fig_caption: yes
    number_sections: false
# as_is: for figure captions in pkgdown output
pkgdown:
  as_is: true
vignette: >
  %\VignetteIndexEntry{Getting started with gwasrapidd}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: gwasrapidd.bib
---

## The GWAS Catalog

The GWAS Catalog is a service provided by the EMBL-EBI and NHGRI that offers a
manually curated and freely available database of published genome-wide
association studies (GWAS). The Catalog website and infrastructure is hosted by
the [EMBL-EBI](https://www.ebi.ac.uk).

`{gwasrapidd}` facilitates the access to the Catalog via the REST API,
allowing you to programmatically retrieve data directly into R.

## GWAS Catalog Entities

The Catalog REST API is organized around four core entities:

- studies
- associations
- variants
- traits

`{gwasrapidd}` provides four corresponding functions to get each of the
entities: `get_studies()`, `get_associations()`, `get_variants()`, and
`get_traits()`.

Each function maps to an appropriately named S4 classed object: [studies](https://rmagno.eu/gwasrapidd/reference/studies-class.html), [associations](https://rmagno.eu/gwasrapidd/reference/associations-class.html), [variants](https://rmagno.eu/gwasrapidd/reference/variants-class.html), and [traits](https://rmagno.eu/gwasrapidd/reference/traits-class.html) (see Figure \@ref(fig:fns)).


```{r fns, out.width = "70%", echo=FALSE, fig.cap='`{gwasrapidd}` retrieval functions.'}
knitr::include_graphics('../man/figures/get_fns.png') 
```

You can use a combination of several search criteria with each retrieval
function as shown in Figure \@ref(fig:criteria). For example, if you want to get studies using
either one of these two criteria:

- study accession identifier (`study_id`)
- variant identifier (`variant_id`)

You could run the following code:

```r
library(gwasrapidd)

my_studies <- get_studies(study_id = 'GCST000858', variant_id = 'rs12752552')
my_studies@studies
```

This command returns all studies that match either `'GCST000858'` or
`'rs12752552'`. This is equivalent to running `get_studies` separately on each
criteria, and combining the results afterwards:


```r
s1 <- get_studies(study_id = 'GCST000858')
s2 <- get_studies(variant_id = 'rs12752552')
my_studies <- gwasrapidd::union(s1, s2)
```

All four retrieval functions accept the `set_operation` parameter which defines the way the results obtained with each criterion are combined. The two options for this parameter are `'union'` (default) or `'intersection'`, resulting, respectively, in an OR or AND operation.

```{r criteria, out.width = "70%", echo=FALSE, fig.cap='`{gwasrapidd}` arguments for retrieval functions. Colors indicate the criteria that can be used for retrieving GWAS Catalog entities: studies (green), associations (red), variants (purple), and traits (orange).'}
knitr::include_graphics('../man/figures/get_criteria.png') 
```

## Finding Risk Alleles Associated with Autoimmune Disease

As a first example, take the work by @Light2014. In this work the authors focused on variants that had been previously reported in genome-wide association studies (GWAS) for autoimmune disease.

With **gwasrapidd** we can interrogate the GWAS Catalog for the study/studies by searching by *autoimmune disease* (an EFO trait). To do that let's load gwasrapidd first:


```r
library(gwasrapidd)
```

Then query the GWAS Catalog by EFO trait:


```r
my_studies <- get_studies(efo_trait = 'autoimmune disease')
```

We can now check how many GWAS studies we got back:


```r
gwasrapidd::n(my_studies)
#> [1] 9
my_studies@studies$study_id
#> [1] "GCST003097"   "GCST011008"   "GCST007071"   "GCST009873"   "GCST011005"   "GCST011009"   "GCST90029015"
#> [8] "GCST90029016" "GCST90012738"
```

Apparently only 9 studies: GCST003097, GCST011008, GCST007071, GCST009873, GCST011005, GCST011009, GCST90029015, GCST90029016, GCST90012738. Let's see the associated publication titles:


```r
my_studies@publications$title
#> [1] "Meta-analysis of shared genetic architecture across ten pediatric autoimmune diseases."                                                            
#> [2] "Genetic factors underlying the bidirectional relationship between autoimmune and mental disorders - findings from a Danish population-based study."
#> [3] "Leveraging Polygenic Functional Enrichment to Improve GWAS Power."                                                                                 
#> [4] "Meta-analysis of Immunochip data of four autoimmune diseases reveals novel single-disease and cross-phenotype associations."                       
#> [5] "Genetic factors underlying the bidirectional relationship between autoimmune and mental disorders - findings from a Danish population-based study."
#> [6] "Genetic factors underlying the bidirectional relationship between autoimmune and mental disorders - findings from a Danish population-based study."
#> [7] "Mixed-model association for biobank-scale datasets."                                                                                               
#> [8] "Mixed-model association for biobank-scale datasets."                                                                                               
#> [9] "Clinical laboratory test-wide association scan of polygenic scores identifies biomarkers of complex disease."
```

If you want to further inspect these publications, you can quickly browse the respective PubMed entries:

``` r
# This launches your web browser at https://www.ncbi.nlm.nih.gov/pubmed/26301688
open_in_pubmed(my_studies@publications$pubmed_id)
```

Now if we want to know the variants previously associated with autoimmune disease, as used by @Light2014, we need to retrieve statistical association information on these variants, and then filter them based on the same level of significance $P < 1\times 10^{-6}$ [@Light2014].

So let's start by getting the associations by `study_id`:


```r
# You could have also used get_associations(efo_trait = 'autoimmune disease')
my_associations <- get_associations(study_id = my_studies@studies$study_id)
```

Seemingly, there are 182 associations.


```r
gwasrapidd::n(my_associations)
#> [1] 182
```

However, not all variants meet the level of significance, as required by @Light2014:


```r
# Get association ids for which pvalue is less than 1e-6.
dplyr::filter(my_associations@associations, pvalue < 1e-6) %>% # Filter by p-value
  tidyr::drop_na(pvalue) %>%
  dplyr::pull(association_id) -> association_ids # Extract column association_id
```

Here we subset the `my_associations` object by a vector of association identifiers (`association_ids`) into a smaller object, `my_associations2`:


```r
# Extract associations by association id
my_associations2 <- my_associations[association_ids]
gwasrapidd::n(my_associations2)
#> [1] 180
```

Of the 182 associations found in GWAS Catalog, 180 meet the p-value threshold of $1\times 10^{-6}$. Here are the variants, and their respective risk allele and risk frequency:


```r
my_associations2@risk_alleles[c('variant_id', 'risk_allele', 'risk_frequency')]
#> # A tibble: 180 × 3
#>    variant_id risk_allele risk_frequency
#>    <chr>      <chr>                <dbl>
#>  1 rs11580078 G                     0.43
#>  2 rs6679677  A                     0.09
#>  3 rs34884278 C                     0.3 
#>  4 rs6689858  C                     0.29
#>  5 rs2075184  T                     0.23
#>  6 rs36001488 C                     0.48
#>  7 rs4676410  A                     0.19
#>  8 rs4625     G                     0.31
#>  9 rs62324212 A                     0.42
#> 10 rs7725052  C                     0.43
#> # … with 170 more rows
```

## References