Coalescent simulation refers to the idea of simulating the evolution of biological sequences like DNA by tracing their ancestry back in time. The coala package is an interface for calling a number of commonly used coalescent simulators from R. It should be able to use the simulator scrm out of the box. Other simulators need to be installed separately and must explicitly be activated. This is described in the installation vignette:
vignette("coala-install", package = "coala")In this introduction we’ll stick to using scrm.
In order to conduct simulations, we first need to specify a model of evolution that is used for the simulations. The function coal_model creates a basic coalescent model:
library(coala)
model <- coal_model(sample_size = 3, loci_number = 1)This creates a basic model with one population of constant size. One genetic locus for three haplotypes is sampled from the population.
Printing the model gives a short summary of this content:
model## Features: Sampling of 3 (pop 1) individuals with ploidy 1 at time par(0) 
## 
## Parameter: None
## 
## Loci: 1 locus of length 1000 
## 
## Summary Statistics: None
## 
## Simulator: scrm 
## Command: scrm 3 1Models consist of
In order to simulate the model we need to add a few more features and at least one summary statistic.
For simulating sequences, we need to add mutations to the model. To do so, we create a corresponding feature using feat_mutation and add it to the existing model using the plus operator:
model <- model + feat_mutation(rate = 1, model = "IFS")
model## Features: 
## * Sampling of 3 (pop 1) individuals with ploidy 1 at time par(0) 
## * Mutations with rate `` following a IFS mutation model
## 
## Parameter: None
## 
## Loci: 1 locus of length 1000 
## 
## Summary Statistics: None
## 
## Simulator: scrm 
## Command: scrm 3 1 -t 1Now, mutations occur with rate 1 and according to an infinite-sites mutation model (IFS). Details on both rate and mutation model are given the help page of the feature (?feat_mutation).
Coala currently support the features
## [1] "feat_growth"        "feat_migration"     "feat_mutation"     
## [4] "feat_outgroup"      "feat_pop_merge"     "feat_recombination"
## [7] "feat_selection"     "feat_size_change"   "feat_unphased"However, not all combination of all features might be possible. Please refer to the features help pages for detailed information.
If we a model consisting of multiple populations, we need to state the sample sizes as a vector of sample sizes for the different populations. The lines
model <- coal_model(sample_size = c(5, 2, 0), loci_number = 1) +
  feat_migration(rate = 0.5, symmetric = TRUE) +
  feat_pop_merge(0.5, 3, 2) +
  feat_pop_merge(0.8, 2, 1)create a model of three populations, with a symmetric migration rate of 0.5 between them. View backwards in time, population 3 merges into population 2 0.5 coalescent time units in the past and population 2 into population 1 0.3 time units further into the past. Looking forwards in time, this represents two speciation events with migration going on afterwards. At time 0 five haploids of population 1 and two ones of population 2 are sampled. Please note that sample sizes for all populations must be given, even if no haploid is sampled from a population, as it is the case for population 3 here.
Adding summary statistics works in a similar fashion as adding features:
model <- coal_model(3, 1) +
  feat_mutation(rate = 1) +
  sumstat_seg_sites()
model## Features: 
## * Sampling of 3 (pop 1) individuals with ploidy 1 at time par(0) 
## * Mutations with rate `` following a IFS mutation model
## * Generating Seg. Sites
## 
## Parameter: None
## 
## Loci: 1 locus of length 1000 
## 
## Summary Statistics: stat_segsites 
## 
## Simulator: scrm 
## Command: scrm 3 1 -t 1This adds the segregating sites summary statistic to the model, which is a basic summary statistic in population genetics. Again, refer to ?sumstat_seg_sites for details.
Available summary statistics are:
##  [1] "sumstat_class"          "sumstat_dna"           
##  [3] "sumstat_file"           "sumstat_four_gamete"   
##  [5] "sumstat_ihh"            "sumstat_jsfs"          
##  [7] "sumstat_mcmf"           "sumstat_nucleotide_div"
##  [9] "sumstat_omega"          "sumstat_seg_sites"     
## [11] "sumstat_sfs"            "sumstat_tajimas_d"     
## [13] "sumstat_trees"We now can simulate the model. The printed output of a model contains information which program will be used for the simulation and which arguments will be used. As coala is in a early stage, please make sure to always check both.
The function simulate will call the program with the printed options, parse its output and calculated the added summary statistics:
set.seed(123)
sumstats <- simulate(model)The returned object sumstats a list, in which each entry corresponds to one summary statistic. As there is only one summary statistic in our model, it has only one entry:
names(sumstats)## [1] "pars"      "cmds"      "simulator" "seg_sites"The structure in sumstats$seg_sites is given by the segregating sites statistic. It is again a list, where each entry represents one locus. For each locus, it contains a matrix as specified in ?sumstat_seg_sites:
sumstats$seg_sites[[1]]##      0.1041606 0.4223301
## [1,]         0         0
## [2,]         1         1
## [3,]         1         1If we want to have more loci in a model, we can add them using the locus_ functions. The most basic option is to add an additional locus with a different length:
model <- model + locus_single(500)
model## Features: 
## * Sampling of 3 (pop 1) individuals with ploidy 1 at time par(0) 
## * Mutations with rate `` following a IFS mutation model
## * Generating Seg. Sites
## 
## Parameter: None
## 
## Loci: 
## * 1 locus of length 1000 
## * 1 locus of length 500 
## 
## Summary Statistics: stat_segsites 
## 
## Simulator: scrm 
## Command: scrm 3 1 -t 1Now the model consists of two loci, the first with length 1000, the second with 500. Simulation now produces a segregating sites list with two entries corresponding to the loci:
sumstats <- simulate(model)
sumstats$seg_sites[[1]]##      0.3144767 0.4034762
## [1,]         0         0
## [2,]         0         0
## [3,]         1         1sumstats$seg_sites[[2]]##      0.919158
## [1,]        1
## [2,]        0
## [3,]        0Another possibility is to add multiple loci with the same length using locus_averaged, which gives better performance than adding the loci one by one. For example
model <- model + locus_averaged(2, 750)
sumstats <- simulate(model)
length(sumstats$seg_sites)## [1] 4adds two more loci with length of 750bp to the model.
So far, we have used a model without parameters that can vary between simulations. In particular for fitting a model to data via ABC or Jaatha it is useful to such parameters so that you don’t need to create a new model for each simulation.
Named parameters values can be specified in the simulation command. If we want to simulate a model for different values of the mutation rate, we can use a named parameter:
model <- coal_model(5, 1) +
  feat_mutation(rate = par_named("theta")) +
  sumstat_seg_sites()
sumstats <- simulate(model, pars=c(theta=2.5))A parameter distributed according to a prior can be specified using the par_prior function. The functions first argument is a name for the parameter, the second an expression that when evaluated produces a sample from the prior distribution.
So if we want the mutation to follow a uniform distribution between 0 and 10, we can use:
model <- coal_model(5, 1) +
  feat_mutation(rate = par_prior("theta", runif(1, 0, 10))) +
  sumstat_seg_sites()
sumstats <- simulate(model)
sumstats$pars##  theta 
## 3.0772sumstats2 <- simulate(model)
sumstats2$pars##    theta 
## 6.499852For Jaatha, you need to give a range of possible values for each parameters. This is done using par_range. For instance
model <- coal_model(5, 1) +
  feat_mutation(rate = par_range("theta", 0.1, 5)) +
  sumstat_seg_sites()sets a possible range from 0.1 to 5 for the mutation rate. The actual rate is given in simulate just as with named parameters.
Finally, there is a very powerful type of parameters generated with par_expr. Similar to parameters with priors, the value of the parameter is given as an R expression, which is evaluated before simulation. Different to par_prior, this expression can contain other named parameters. For example
model <- coal_model(4, 2) +
  feat_mutation(rate = par_named("theta")) +
  feat_recombination(rate = par_expr(theta * 2))creates a model with a a recombination rate that always is twice as high as the mutation rate.