---
title: "PLINK cheatsheet"
output: rmarkdown::html_vignette
        #pdf_document
vignette: >
  %\VignetteIndexEntry{PLINK cheatsheet}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r, include = FALSE}
library(tidypopgen)
# Create gen_tibble for examples
bed_path <- system.file("extdata/pop_a.bed", package = "tidypopgen")
bigsnp_path <- bigsnpr::snp_readBed(bed_path, backingfile = tempfile())
data <- gen_tibble(bigsnp_path, quiet = TRUE)
```

The following flags and corresponding tidypopgen functions are based on 
plink version 1.9.

## File management and reading data:

| PLINK 1.9                        | `tidypopgen`                                             |
|----------------------------------|----------------------------------------------------------|
| --make-bed --out                 | `gt_as_plink(data, file = my_file, type = "bed")`        |
| --recode                         | `gt_as_plink(data, file = my_file, type = "ped")`        |
| --recode vcf                     | `gt_as_vcf(data, file = my_file)`                        |
| --allele1234 and --alleleACGT    | See `gen_tibble()` parameter 'valid_alleles'             |

PLINK flags --update-alleles, --allele1234, and --alleleACGT, all alter the coding of alleles. In `tidypopgen`, valid alleles are supplied when reading in a `gen_tibble`.

## Quality control:

| PLINK              | `tidypopgen`                    |
|--------------------|---------------------------------|
| --maf              | `data %>% loci_maf()`          |
| --geno             | `data %>% loci_missingness()`  |
| --hwe              | `data %>% loci_hwe()`          |
| --freq --a2-allele | `data %>% loci_alt_freq()`     |
| --mind             | `data %>% indiv_missingness()` |
| --het              | `data %>% indiv_het_obs()`     |

To filter out variants in `tidypopgen`, in a similar way to PLINK flags such as --extract or --autosome, it is necessary to use the `gen_tibble` with `select_loci_if()`. For example:

```{r}
data %>% select_loci_if(loci_chromosomes(genotypes) %in% c(1:22))
```

will select autosomal loci in the same way as --autosome. Or alternatively:

```{r}
my_snps <- c("rs4477212", "rs3094315", "rs3131972", "rs12124819", "rs11240777")

data %>%
  select_loci_if(loci_names(genotypes) %in% my_snps) %>%
  show_loci()
```

will select loci from a previously defined set in the same way as --extract.

Similarly, to filter out individuals, as might be performed with --keep in PLINK, requires using filter:

```{r}
my_individuals <- c("GRC14300079", "GRC14300142", "GRC14300159")

data %>% filter(id %in% my_individuals)
```

## Handling linkage

Linkage disequilibrium is managed through clumping in `tidypopgen` with `loci_ld_clump()`.

This option is similar to the --indep-pairwise flag in PLINK, but results in a more even distribution of loci when compared to LD pruning.

To explore why clumping is preferable to pruning, see <https://privefl.github.io/bigsnpr/articles/pruning-vs-clumping.html>

## Quality control for relatedness (KING)

| KING        | `tidypopgen`                |
|-------------|-----------------------------|
| --kinship   | `pairwise_king()`           |
| --distance  | `pairwise_ibs()`            |
| --unrelated | `filter_high_relatedness()` |

`pairwise_king()` implements the KING-robust estimator of kinship, equivalent to --kinship in KING. To remove related individuals, the user can pass the kinship matrix of `pairwise_king()` and a relatedness threshold (a numeric KING kinship coefficient) to `filter_high_relatedness()`, which will return the largest possible set of individuals with no relationships above the threshold.

`pairwise_king()` also forms part of `qc_report_indiv()` through the parameter `kings_threshold`. To remove related individuals in `qc_report_indiv()`, the user can pass either a relatedness threshold, or a string of either "first" or "second" to remove any first degree or second degree relationships from the dataset. This second option is similar to using  --unrelated --degree 1 or  --unrelated --degree 2 in KING. `qc_report_indiv()` returns a dataframe including columns 'id' and 'to_keep', showing the ID of individuals and a logical column of whether the individual should be removed to retain the largest possible set of individuals with no relationships above the threshold.


## Merging datasets:

| PLINK       | `tidypopgen`                          |
|-------------|---------------------------------------|
| --bmerge    | `rbind()`                             |
| --flip-scan | `rbind_dry_run()`                     |
| --flip      | use 'flip_strand = TRUE' in `rbind()` |

In PLINK, data merging can fail due to strand inconsistencies that are not addressed prior to merging. PLINK documentation suggests to users to try a 'trial flip' of data to address this, and then to 'unflip' any errors that remain. In `tidypopgen`, when data are merged with rbind, strand inconsistencies are identified and automatically flipped, avoiding multiple rounds of flipping before merging.

PLINK does allow users to identify inconsistencies prior to merging with --flip-scan, and this functionality is included in the `tidypopgen` rbind_dry_run(). rbind_dry_run() reports the numeric overlap of datasets, alongside the number of SNPs to 'flip' in the new target dataset, as well as the number of ambiguous SNPs.

Data are only merged one set at a time, there is no equivalent to --merge-list.

## Analysis:

| PLINK     | `tidypopgen`                           |
|-----------|----------------------------------------|
| --pca     | See `gt_pca()` for pca options         |
| --fst     | `pairwise_pop_fst()` with `group_by()` |
| --homozyg | `windows_indiv_roh()`                      |