--- title: "tidypopgen" output: rmarkdown::html_vignette #pdf_document vignette: > %\VignetteIndexEntry{tidypopgen} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` This vignette gives a short introduction on how to use the `tidypopgen` package to perform basic data cleaning and a PCA on a dataset of European lobsters (*Homarus gammarus*). The original data are available at https://datadryad.org/dataset/doi:10.5061/dryad.2v1kr38, but for the purpose of this vignette we will use a subset of the data stored as a .bed file, available in the `inst/extdata/lobster` directory of the package. To install `tidypopgen` from r-universe, use: ```{r eval = FALSE} install.packages("tidypopgen", repos = c( "https://evolecolgroup.r-universe.dev", "https://cloud.r-project.org" ) ) ``` Next, load the `tidypopgen` package and the `ggplot2` package for plotting. ```{r} library(tidypopgen) library(ggplot2) ``` # Creating a `gen_tibble` In `tidypopgen`, we represent data as a `gen_tibble`, a subclass of `tibble` containing the columns `id` and `genotypes` for each individual. Genotypes are stored in a compressed format as a File-Backed Matrix that can be easily accessed by functions in `tidypopgen`. The `genotypes` column contains a vector of the row indices of each individual in the File-Backed Matrix, and when printed, the `genotypes` column shows the first two genotypes for each individual. Additionally, if data are loaded from a .bed file with information in the `FID` column, this is treated as population information and is automatically added to the `gen_tibble` as column `population`. As with a normal tibble, this information can be changed, updated, or removed from the `gen_tibble` if needed. `tidypopgen` can also read data form `packedancestry` files and `vcf`. Let's start by creating a `gen_tibble` from the `lobster.bed` file. ```{r} lobsters <- gen_tibble( x = system.file("extdata/lobster/lobster.bed", package = "tidypopgen"), quiet = TRUE, backingfile = tempfile() ) head(lobsters) ``` We can see the structure of the `gen_tibble` above, with the expected columns. In this case, our .bed file contains population information corresponding to the sampling site of each lobster in the dataset. If we want to take a look at the genotypes of our lobsters, we can use the `show_genotypes` function to return a matrix of genotypes, where rows are individuals and columns are loci. This is a big table, so we will just look at the first ten loci for the first 5 individuals: ```{r} lobsters %>% show_genotypes(indiv_indices = 1:5, loci_indices = 1:10) ``` And, similarly, if we want to see information about the loci we can use the `show_loci` function, which returns a tibble with information about each locus. Again this is a big table, so we will use `head()` to only look at the first few: ```{r} head(lobsters %>% show_loci()) ``` Now we have a `gen_tibble` to work with, we can start to clean the data. # Quality control Let's start by checking the quality of the data for each individual lobster in our dataset. We can use the `qc_report_indiv` function to generate a report which contains information about missingness and heterozygosity for each individual. ```{r} indiv_qc_lobsters <- lobsters %>% qc_report_indiv() ``` We can take a look at this data using the `autoplot` function: ```{r, fig.alt = "Scatter plot of missingness proportion and observed heterozygosity for each individual"} autoplot(indiv_qc_lobsters, type = "scatter") ``` We can see that most individuals have low missingness and heterozygosity, but there are a few individuals missing >20% of their genotypes. We can remove these individuals using the `filter` function, specifying to keep only individuals with missingness under 20%. ```{r} lobsters <- lobsters %>% filter(indiv_missingness(genotypes) < 0.2) ``` Now lets check our loci. We can use the `qc_report_loci` function to generate a report of the loci quality. This function will return another `qc_report` object, which contains information about missingness, minor allele frequency, and Hardy-Weinberg Equilibrium for each locus. ```{r} loci_qc_lobsters <- lobsters %>% qc_report_loci() ``` Here, we get a message because the `qc_report_loci` function calculates Hardy-Weinberg equilibrium assuming that all individuals are part of a single population. As our dataset contains multiple lobster populations, we should group our data by population first: ```{r} lobsters <- lobsters %>% group_by(population) loci_qc_lobsters <- lobsters %>% qc_report_loci() ``` That's better. Now, lets take a look at minor allele frequency for all loci: ```{r, fig.alt = "Histogram of minor allele frequency"} autoplot(loci_qc_lobsters, type = "maf") ``` And we can see that we don't have any monomorphic SNPs. Now let's look at missingness. ```{r, fig.alt = "Histogram of the proportion of missing data"} autoplot(loci_qc_lobsters, type = "missing") ``` Our data mostly have low missingness, but we can see that some loci have > 5% missingness across individuals, and we want to remove these individuals. In tidypopgen, there are two functions to subset the loci in a `gen_tibble` object: `select_loci` and `select_loci_if`. The function `select_loci` operates on information about the loci (e.g filtering by chromosome or by rsID), while `select_loci_if` operates on the genotypes at those loci (e.g filtering by minor allele frequency or missingness). In this case, we want to remove loci with >5% missingness, so we can use `select_loci_if` with the `loci_missingness` function, operating on the `genotypes` column of our `gen_tibble`. ```{r} lobsters <- lobsters %>% select_loci_if(loci_missingness(genotypes) < 0.05) ``` Through our filtering, we have removed a few individuals and loci. We should now update the file backing matrix to reflect these changes, using the function `gt_update_backingfile`: ```{r} lobsters <- gt_update_backingfile(lobsters, backingfile = tempfile()) ``` Now our data are clean and the backingfile is updated, we are ready to create a PCA. # Impute First, we need to impute any remaining missing data using the `gt_impute_simple` function. ```{r} lobsters <- gt_impute_simple(lobsters, method = "random") ``` # PCA Then we can run a PCA. There are a number of PCA algorithms, here, we will use the `gt_pca_partialSVD` function: ```{r} partial_pca <- gt_pca_partialSVD(lobsters) ``` And we can create a simple plot using `autoplot`: ```{r, fig.alt = "Score plot of individuals across the first and second Principal Components"} autoplot(partial_pca, type = "scores") ``` That was easy! The `autoplot` gives us a quick idea of the explained variance and rough distribution of the samples, but we need to see the different populations within our dataset. For a quick overview, we could add an aesthetic to our plot: ```{r, fig.alt = "Score plot of individuals across the first and second Principal Components, with individual samples coloured by population"} autoplot(partial_pca, type = "scores") + aes(color = lobsters$population) + labs(color = "population") ``` However, if we want to fully customise our plot, we can wrangle the data directly and use `ggplot2`. # Plot with ggplot2 For a customised plot, we can extract the information on the scores of each individual using the `augment` method for `gt_pca`. ```{r} pcs <- augment(x = partial_pca, data = lobsters) ``` Then we can extract the eigenvalues for each principal component with the `tidy` function, using the "eigenvalues" argument: ```{r} eigenvalues <- tidy(partial_pca, "eigenvalues") xlab <- paste("Axis 1 (", round(eigenvalues[1, 3], 1), " %)", sep = "" ) ylab <- paste("Axis 2 (", round(eigenvalues[2, 3], 1), " %)", sep = "" ) ``` And finally plot: ```{r, fig.alt = "Score plot of individuals across the first and second Principal Components, with individual samples coloured by population"} ggplot(data = pcs, aes(x = .fittedPC1, y = .fittedPC2)) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + geom_point(aes(fill = population), shape = 21, size = 3, show.legend = FALSE ) + scale_fill_distruct() + labs(x = xlab, y = ylab) + ggtitle("Lobster PCA") ``` Or we could create a labelled version of our PCA by determining the centre of each group to place the labels ```{r} # Calculate centre for each population centroid <- aggregate(cbind(.fittedPC1, .fittedPC2, .fittedPC3) ~ population, data = pcs, FUN = mean ) # Add these coordinates to our augmented pca object pcs <- left_join(pcs, centroid, by = "population", suffix = c("", ".cen")) ``` And then add labels to the plot: ```{r, fig.alt = "Score plot of individuals across the first and second Principal Components. Individual samples are coloured by population, the centroid of points for each group is labelled with the population name, and lines are drawn from the centroid to each point."} ggplot(data = pcs, aes(x = .fittedPC1, y = .fittedPC2)) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + geom_segment(aes(xend = .fittedPC1.cen, yend = .fittedPC2.cen), show.legend = FALSE ) + geom_point(aes(fill = population), shape = 21, size = 3, show.legend = FALSE ) + geom_label( data = centroid, aes(label = population, fill = population), size = 4, show.legend = FALSE ) + scale_fill_distruct() + labs(x = xlab, y = ylab) + ggtitle("Lobster PCA") ``` That's it! There are more functions to run different types of analyses (DAPC, ADMIXTURE, f statistics with admixtools2, etc.); each resulting object has `autoplot`, `tidy`, and `augment` to explore the results and integrated into the information from the `gen_tibble`.