\newpage
This vignette describes how the R package REHH can be applied to perform whole genome scans for footprints of selection using statistics related to the Extended Haplotype Homozygosity (EHH) [@Sabeti2002].
The current implementation of the package needs biallelic genetic markers as input. Typically, albeit not necessarily, these will be SNPs.
The rehh package is currently available for most platforms (Linux, MS Windows and MacOSX) from the CRAN repository https://cran.r-project.org/ and may be installed using a standard procedure. Once the package has been successfully installed on your system, it can be loaded by:
library(rehh)
#Input Files {#make.example.files}
The package rehh requires as input:
Important Note: For a given chromosome, SNPs are assumed to be ordered in the same way in the haplotype and SNP information files.
For illustration purposes, example files that originate from a previously published study on the Creole cattle breed from Guadeloupe (CGU) [@Gautier2011] are provided in the package and can be copied in the working directory with the command:
make.example.files()
Throughout this vignette, this command is assumed to have been run so that the example files are in the working directory.
##Haplotype data file {#hap.file}
Three haplotype input file formats are supported:
bta12_cgu.hap containing 280 haplotypes (identifier 1 to 280) with 1424 SNPs each (see section \ref{LoadDataEx1}).bta12_cgu.thap. This format is similar to the one produced by the phasing program SHAPEIT2 @OConnell2014.bta12_hapguess_switch.out example file. Note that this file format allows to include haplotypes from several populations (i.e., if the -u fastPHASE option was used) (see section \ref{LoadDataEx3}).By default alleles are assumed to be coded as 0 (missing data), 1 (ancestral allele) or 2 (derived allele). Other encodings can be recoded into this format, using the SNP information data file (see \ref{map.file}) and the recode.allele option of the function data2haplohh() (see \ref{LoadData}).
##SNP information file {#map.file}
The SNP information file should contain columns without header as in the map.inp example:
head(read.table("map.inp"))
>         V1 V2     V3 V4 V5
> 1 F0100190  1 113642  T  A
> 2 F0100220  1 244699  C  G
> 3 F0100250  1 369419  G  C
> 4 F0100270  1 447278  A  T
> 5 F0100280  1 487654  T  A
> 6 F0100290  1 524507  C  G
For each SNP the five columns correspond to:
The fourth and fifth columns (allele coding) should be always filled in, although the corresponding information is only of relevance when the recode.allele option of the function data2haplohh() is used (see \ref{LoadData}). In that case, the allele encoding in the haplotype file will be replaced by either 1 (ancestral) if the allele is found in the fourth column of the info file, 2 (derived) if it is found on the fifth column and 0 (missing data) otherwise. The information about ancestral or derived status is relevant only for within population tests (based on iHS). If one is interested exclusively in cross-population tests (based on Rsb or XP-EHH), assignment of the two alleles to the fourth and fifth column may be done at random. 
##Loading data files {#LoadData}
The data2haplohh() function converts data files into an R object of class haplohh subsequently used by several functions of the rehh package. The following main options are available to recode alleles or select SNPs (based on Minor Allele Frequency or percentage of missing data) and haplotypes (based on percentage of missing data): 
recode.allele=TRUE and in this case recodes the haplotype. The allele coding of the haplotype file is replaced by 0 (missing data), 1 (ancestral allele) or 2 (derived allele) according to the whether the allele coding is given at the fourth or fifth column in the corresponding row of the SNP information file. min_perc_geno.hap of SNPs genotyped. By default min_perc_geno.hap=100 meaning that only completely phased haplotypes are retained.min_perc_geno.snp of haplotypes. By default min_perc_geno.snp=100 meaning that only fully genotyped SNPs are retained.min_maf. This is generally not recommended and by default min_maf=0 meaning that all SNPs are retained.\footnote{The arguments \texttt{min_perc_geno.hap}, \texttt{min_perc_geno.snp} and \texttt{min_maf} are evaluated in this order.}More details about the different arguments of the function are available in the documentation accessible using the command:
?data2haplohh
###Example 1: reading haplotype file in standard format {#LoadDataEx1}
In this example, the haplotype input file bta12_cgu.hap (in a standard haplotype format) and SNP information input file map.inp are converted into an haplohh object named hap. Because the map file contains information for SNPs mapping to multiple chromosomes we have to specify that the haplotype input file is about chromosome 12 by setting the option chr.name=12. Allele recoding is activated (recode.allele=TRUE) to recode alleles given in the haplotype file as nucleotides to 0, 1 or 2.
hap<-data2haplohh(hap_file="bta12_cgu.hap",map_file="map.inp",
                  recode.allele=TRUE,chr.name=12)
> Map file seems OK: 1424  SNPs declared for chromosome 12 
> Standard rehh input file assumed
> Alleles are being recoded according to map file as:
>   0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than  100 % of genotyped SNPs
> No haplotype discarded
> Discard SNPs genotyped on less than  100 % of haplotypes
> No SNP discarded
> Data consists of 280 haplotypes and 1424 SNPs
If no value is specified for the chr.name argument and more than one chromosome is detected in the map file, the function asks interactively which chromosome to choose:
hap<-data2haplohh(hap_file="bta12_cgu.hap",map_file="map.inp",
                  recode.allele=TRUE) 
> More than one chromosome name in Map file: map.inp
> Which chromosome should be considered among:
> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
> 1:
12
> Map file seems OK: 1424  SNPs declared for chromosome 12 
> Standard rehh input file assumed
> Alleles are being recoded according to map file as:
>   0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than  100 % of genotyped SNPs
> No haplotype discarded
> Discard SNPs genotyped on less than  100 % of haplotypes
> No SNP discarded
> Data consists of 280 haplotypes and 1424 SNPs
Finally, as an example of an error message, the following message is prompted if the number of SNPs for the chromosome in the info file does not correspond to the one in the haplotype file (for instance when a wrong chromosome is specified):
hap<-data2haplohh(hap_file="bta12_cgu.hap",map_file="map.inp",
                  recode.allele=TRUE,chr.name=18)
> Map file seems OK: 1123  SNPs declared for chromosome 18 
> Standard rehh input file assumed
> The number of snp in the haplotypes 1424  is not equal
> to the number of snps declared in the map file 1123
> Error in data2haplohh(hap_file = "bta12_cgu.hap", map_file = "map.inp", : Conversion stopped
###Example 2: reading haplotype file in transposed format (SHAPIT2–like) {#LoadDataEx2}
In this example, the haplotype input file bta12_cgu.thap (transposed format) and the SNP information input file map.inp are converted to an haplohh object named hap. Setting haplotype.in.columns=TRUE informs the function that the haplotype file is in transposed format:
hap<-data2haplohh(hap_file="bta12_cgu.thap",map_file="map.inp",haplotype.in.columns=TRUE,
                  recode.allele=TRUE,chr.name=12)
> Map file seems OK: 1424  SNPs declared for chromosome 12 
> Haplotype are in columns with no header
> Alleles are being recoded according to map file as:
>   0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than  100 % of genotyped SNPs
> No haplotype discarded
> Discard SNPs genotyped on less than  100 % of haplotypes
> No SNP discarded
> Data consists of 280 haplotypes and 1424 SNPs
###Example 3: reading haplotype file in fastPHASE output format {#LoadDataEx3}
In this example, the fastPHASE output file bta12_hapguess_switch.out and the SNP information input file map.inp are 
converted into a haplohh object named hap. As explained above we use the option chr.name=12. 
Because haplotypes originated here from several populations (the -u fastPHASE option was used), we specify the population of interest 
(in our example the 280 haplotypes from the CGU population, see above) using the option popsel=7
(7 corresponding to the code of CGU in the example fastPHASE input file).
hap<-data2haplohh(hap_file="bta12_hapguess_switch.out",map_file="map.inp",
                  recode.allele=TRUE,popsel=7,chr.name=12)
> Map file seems OK: 1424  SNPs declared for chromosome 12 
> Looks like a FastPHASE haplotype file
> Haplotypes originate from  8  different populations in the fastPhase output file
> Alleles are being recoded according to map file as:
>   0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than  100 % of genotyped SNPs
> No haplotype discarded
> Discard SNPs genotyped on less than  100 % of haplotypes
> No SNP discarded
> Data consists of 280 haplotypes and 1424 SNPs
If no value is specified for the popsel argument and more than one population is detected in the fastPHASE output file, the function asks interactively which population to chose:
hap<-data2haplohh(hap_file="bta12_hapguess_switch.out",map_file="map.inp",
                  recode.allele=TRUE,chr.name=12)
> Map file seems OK: 1424  SNPs declared for chromosome 12
> Looks like a FastPHASE haplotype file
> Haplotypes originate from  8  different populations in the fastPhase output file
> Chosen pop. is not in the list of pop. number: 1 2 3 4 5 6 7 8
> Which population should be considered among: 1 2 3 4 5 6 7 8
> 1:
7
> Map file seems OK: 1424  SNPs declared for chromosome 12 
> Looks like a FastPHASE haplotype file
> Haplotypes originate from  8  different populations in the fastPhase output file
> Alleles are being recoded according to map file as:
>   0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than  100 % of genotyped SNPs
> No haplotype discarded
> Discard SNPs genotyped on less than  100 % of haplotypes
> No SNP discarded
> Data consists of 280 haplotypes and 1424 SNPs
#Computing EHH, EHHS and their “integrals” iHH and iES: the calc_ehh(), calc_ehhs() and scan_hh() functions
##Definition and computation
###The (allele-specific) Extended Haplotype Homozygosity (EHH) {#EHH} For a given core allele (either ancestral or derived) at a focal SNP, the (allele–specific) extended haplotype homozygosity (EHH) is defined as the probability that two randomly chosen chromosomes (carrying the core allele considered) are identical by descent (as assayed by homozygosity at all SNPs) over a given surrounding chromosomal region [@Sabeti2002]. The EHH aims at measuring to which extent an extended haplotype is transmitted without mutation and recombination. It is computed for a given core allele \(a\) (ancestral or derived) of a focal SNP \(s\) over the chromosomal stretch extending to some SNP \(t\): \begin{equation} \label{ehh.def} \mathrm{EHH}{s,t}a=\frac{1}{n{a}(na-1)}\sum\limits{k=1}{Ka_{s,t}}n_k(n_k-1) \end{equation} where \(n_a\) represents the number of chromosomes carrying the core allele \(a\), \(K^a_{s,t}\) represents the number of different extended haplotypes that can be discerned among these chromosomes from SNP \(s\) to SNP \(t\), and \(n_k\) refers to the number of chromosomes pertaining to the \(k\)-th such extended haplotype (it yields \(n_a=\sum\limits_{k=1}^{K^a_{s,t}}n_k\)).
###The integrated (allele-specific) EHH (iHH) {#iHH} By definition, irrespective of the allele considered, EHH starts at 1, and decays monotonically to 0 with increasing distance from the focal SNP. For a given core allele, the integrated EHH (iHH) is defined as the area under the EHH curve with respect to map position [@Voight2006]\footnote{In their seminal paper, Voight et al. considered genetic distances and apply a penalty (proportional to physical distances) for successive SNPs separated by more than 20 kb. In addition, they did not compute iHH if any physical distance between a pair of neighboring SNPs was above 200 kb. Only the latter can be specified in rehh.}. In rehh, iHH is computed using the trapezoid method. In practice, the integral is often truncated if the value of EHH reaches a certain lower threshold, e.g. 0.05.
###The (site-specific) Extended Haplotype Homozygosity (EHHS) {#EHHS} An extended homozygosity can be defined for the whole set of chromosomes of a sample. In this case, the quantity is aimed to reflect the probability that any two randomly chosen chromosomes are identical by descent over a given surrounding chromosomal region of a focal SNP. In contrast to the allele-specific EHH defined above, the chromosomes are not devided with respect to their allele at the focal SNP. In order to distinguish this quantity from that defined in the previous section, we adopt the naming by [@Tang2007] as site–specific EHH, abbreviated by EHHS. Note, however that this quantity is sometimes referred to as EHH, too, and there is no agreed notation in the literature.
EHHS was used in genome scans in two versions: un-normalized by [@Sabeti2007] and normalized by [@Tang2007].
In line with [@Sabeti2007] we define \begin{equation} \label{ehhssab.def} \mathrm{EHHS}{\text{Sab}}{s,t}=\frac{1}{n_s(n_s-1)}\left(\sum\limits{k=1}{K_{s,t}}n_k(n_k-1)\right) \end{equation} where we re-use notation from above and let \(n_s\) refer to the number of chromosomes at SNP \(s\). If there are no missing values at that SNP, this is simply the number of chromosomes in the sample.
[@Tang2007] proposed an apparently different estimator for the normalized EHHS, namely \begin{equation} \label{ehhstang.def} \mathrm{EHHS}{\text{Tang}}{s,t}=\frac{1-h{s,t}}{1-h_s} \end{equation} where:
\(h_s=\frac{n_s}{n_s-1}\left(1-\frac{1}{n_s^2}\left(\sum\limits_{k=1}^{K_{s,s}}n_k^2 \right )\right)=\frac{n_s}{n_s-1}\left(1-\frac{1}{n_s^2}\left(n_{a1}^2+n_{a2}^2\right) \right)\) is an estimator of the focal SNP heterozygosity with \(a1\) and \(a2\) referring to the numbers of the two alleles at SNP \(s\) (\(n_{a1}+n_{a2}=n_s\)).
\(h_{s,t}=\frac{n_s}{n_s-1}\left(1-\frac{1}{n_s^2}\left(\sum\limits_{k=1}^{K_{s,t}}n_k^2 \right )\right)\) is an estimator of haplotype heterozygosity across the chromosome region extending from SNP \(s\) to SNP \(t\).
However both definitions are in fact equivalent, because it holds \(\mathrm{EHHS}^{\text{Sab}}_{s,t}=1-h_{s,t}\) and hence \begin{equation} \mathrm{EHHS}{\text{Tang}}{s,t}=\frac{\mathrm{EHHS}{\text{Sab}}{s,t}}{\mathrm{EHHS}{\text{Sab}}_{s,s}}\;. \end{equation} Thus \(\mathrm{EHHS}^{\text{Tang}}_{s,t}\) is just normalized in order to yield 1 at the focal SNP \(s\). Note that the normalization factor depends on the frequency of the two alleles at the focal SNP and consequently is not constant over the whole data set.
Furthermore, we note that EHHS and EHH are related by \begin{equation} \mathrm{EHHS}{\text{Sab}}{s,t}=\frac{n{a1}(n{a1}-1)}{n_s(n_s-1)}\mathrm{EHH}{s,t}{a1}+\frac{n{a2}(n{a2}-1)}{ns(n_s-1)}\mathrm{EHH}{a2}{s,t}\;. \end{equation} EHHS might hence be viewed as a linear combination of the EHH's of the two alternative focal alleles, weighted by roughly the square of the focal allele frequencies.
###The integrated EHHS (iES) {#iES} As for the EHH (see \ref{iHH}), \(EHHS^{\text{Tang}}\) starts at 1 and decays monotonically to 0 with increasing distance from the focal SNP. For a given focal SNP, analogously to iHH, iES is defined as the integrated EHHS [@Tang2007]. Depending on wether un-normalized or normalized EHHS is used (respectively, \(\mathrm{EHHS}^{\text{Sab}}\) or \(\mathrm{EHHS}^{\text{Tang}}\)), we yield two different values for iES that we denote by \(\mathrm{iES}^{\text{Sab}}\) and \(\mathrm{iES}^{\text{Tang}}\) respectively. As for iHH, the iES integral is computed using the trapezoid method and is often computed only for the region where EHHS lies over a given threshold (e.g., EHHS>0.05).
###Dealing with missing data
In the computation of both EHH and EHHS from a focal SNP \(s\) to a SNP \(t\), only extended haplotypes with no missing data are considered. As a consequence, the number of extended haplotypes retained to compute these two statistics might decrease with increasing distance of \(t\) from the focal SNP \(s\). If the number of available extended haplotypes falls below a threshold, computation of EHH and EHHS stops. Note however that most phasing programs (such as fastPHASE or SHAPEIT2) allow to impute missing genotypes resulting in phased haplotypes with no missing data.
##The function calc_ehh() {#calcehh}
The calc_ehh() function computes EHH for both ancestral and derived alleles of a focal SNP \(s\) relative to any other SNP \(t\) upstream or downstream. The corresponding integral iHH of these EHH values is returned as well. The two options limehh and limhaplo allow to specify conditions to truncate computing EHH (see \ref{EHH}). By default limehh=0.05 and limhaplo=2. If the border of the chromosome is reached, but EHH has not yet decayed below limehh, calculation of iHH is discarded in order to avoid border effects. This behaviour can be turned off by setting discard_integration_at_border to FALSE. Large “gaps” between consecutive SNPs can be caused by problems in sequencing or SNP calling and may lead to spuriously long extended haplotypes. To account for this, the option maxgap can be used to stop integration at gaps that are greater than the specified size. Again, if discard_integration_at_border is set to TRUE, no value is reported. Finally, if plotehh=TRUE, the decay of EHH for both ancestral and derived allele is plotted against the SNP map position. More details are available in the R documentation by using the command:
?calc_ehh
In the following example, EHH is computed around the SNP with name “F1205400”. Note that the haplohh_cgu_bta12 object was generated using the data2haplohh() function with the example input files (\ref{LoadDataEx1}). For convenience, it is stored as an example object (accessible with the R function data) as shown below:
#example haplohh object (280 haplotypes, 1424 SNPs) see ?haplohh_cgu_bta12 for details
data(haplohh_cgu_bta12)
#computing EHH statistics for the focal SNP with name "F1205400" 
#which displays a strong signal of selection
res.ehh<-calc_ehh(haplohh_cgu_bta12,mrk="F1205400") 
The output contained in 
res.ehh is a list with four elements: 
limhaplo) or the minimal value of EHH (limehh), is reached, no chromosomes are evaluated any more.res.ehh$ehh[1:2,454:458]
>                   F1205380  F1205390 F1205400  F1205420  F1205440
> Ancestral allele 0.2764706 0.5529412        1 0.8879552 0.6422969
> Derived allele   1.0000000 1.0000000        1 1.0000000 1.0000000
res.ehh$nhaplo_eval[1:2,454:458]
>                  F1205380 F1205390 F1205400 F1205420 F1205440
> Ancestral allele       85       85       85       85       85
> Derived allele        195      195      195      195      195
res.ehh$freq_all1
> [1] 0.3035714
res.ehh$ihh
> Ancestral allele   Derived allele 
>         284429.9        2057107.4
In addition, as plotehh=TRUE by default, we obtain Figure \ref{fig:ehhplot}.
##The function calc_ehhs()
The calc_ehhs() function computes \(\mathrm{EHHS}^{\text{Sab}}\) and \(\mathrm{EHHS}^{\text{Tang}}\) around the focal SNP \(s\) relative to each other SNP \(t\). This function also computes the corresponding integrals \(\mathrm{iES}^{\text{Sab}}\) and \(\mathrm{iES}^{\text{Tang}}\) respectively. The two options limehhs and limhaplo allow to specify conditions to truncate computing EHHS (see \ref{EHHS}). By default limehhs=0.05 and limhaplo=2. If the border of the chromosome is reached, but EHHS has not yet decayed below limehhs, calculation of iES is discarded in order to avoid border effects. This behaviour can be turned off by setting discard_integration_at_border to FALSE. Large “gaps” between consecutive SNPs can be caused by problems in sequencing or SNP calling and may lead to spuriously long extended haplotypes. To account for this, the option maxgap can be used to stop integration at gaps that are greater than the specified size. Again, if discard_integration_at_border is set to TRUE, no value is reported. Finally, if plotehhs=TRUE, the decay of EHHS is plotted against SNP map position. More details are available in the R documentation by using the command:
?calc_ehhs
In the following example, the EHHS statistics are computed around the SNP with name “F1205400” on the haplohh_cgu_bta12 object already mentioned above.
data(haplohh_cgu_bta12)
res.ehhs<-calc_ehhs(haplohh_cgu_bta12,mrk="F1205400")
The output contained in res.ehhs is a list with five elements: 
limhaplo) or the minimal value
of EHHS (limehhs), is reached, no chromosomes are evaluated any more.res.ehhs$EHHS_Sabeti_et_al_2007[453:459] 
>  F1205370  F1205380  F1205390  F1205400  F1205420  F1205440  F1205450 
> 0.5017153 0.5095238 0.5347926 0.5756528 0.5654122 0.5429595 0.5386841
res.ehhs$EHHS_Tang_et_al_2007[453:459] 
>  F1205370  F1205380  F1205390  F1205400  F1205420  F1205440  F1205450 
> 0.8715588 0.8851234 0.9290193 1.0000000 0.9822104 0.9432066 0.9357794
res.ehhs$nhaplo_eval[453:459] 
> F1205370 F1205380 F1205390 F1205400 F1205420 F1205440 F1205450 
>      280      280      280      280      280      280      280
res.ehhs$IES_Tang_et_al_2007
> [1] 1760565
res.ehhs$IES_Sabeti_et_al_2007
> [1] 936407.6
In addition, as plotehh=TRUE by default, we obtain Figure \ref{fig:ehhsplot}.
##The function scan_hh() {#fun.scanhh}
The scan_hh() function efficiently computes iHH for both the ancestral and derived alleles as well as \(\mathrm{iES}^{\text{Sab}}\) and \(\mathrm{iES}^{\text{Tang}}\) for all SNPs in the haplohh object. The options limehh, limehhs and limhaplo specify conditions to stop computing EHH and EHHS. By default limehh=limehhs=0.05 and limhaplo=2. If the border of the chromosome is reached, but EHH(S) has not yet decayed below limehh(s), calculation of iHH resp. iES is discarded in order to avoid border effects. This behaviour can be turned off by setting discard_integration_at_border to FALSE. Large “gaps” between consecutive SNPs can be caused by problems in sequencing or SNP calling and may lead to spuriously long extended haplotypes. To account for this, the option maxgap can be used to stop integration at gaps that are greater than the specified size. Again, if discard_integration_at_border is set to TRUE, no value is reported. Finally, the option threads, set by default to 1, allows to specify the number of threads to parallelize computation (parallelization being carried out over SNPs).
For instance in order to scan the haplohh_cgu_bta12 object (containing data on 1424 SNPs for 280 haplotypes), one may use the following command:
data(haplohh_cgu_bta12)
res.scan<-scan_hh(haplohh_cgu_bta12)
The resulting object res.scan is a data frame with haplohh_cgu_bta12@nsnp (the number of SNPs declared in the haplohh object) rows and seven columns yielding for each focal SNP in turn:
As an example, the following R code provides the dimension and a segment of the res.scan data frame obtained above:
dim(res.scan)
> [1] 1424    7
res.scan[453:459,]
>          CHR POSITION     freq_A     iHH_A     iHH_D iES_Tang_et_al_2007
> F1205370  12 28925117 0.06071429  765720.1 1121814.8             1119699
> F1205380  12 28947722 0.82500000 1477031.7  454891.9             1433143
> F1205390  12 28967990 0.90000000 1211909.5  574301.6             1204234
> F1205400  12 28993983 0.30357143  284429.9 2057107.4             1760565
> F1205420  12 29101326 0.01785714  269322.2 1086530.3             1086279
> F1205440  12 29147373 0.05714286  336600.7 1185898.6             1182946
> F1205450  12 29197279 0.04285714  528732.6 1149951.0             1148696
>          iES_Sabeti_et_al_2007
> F1205370              971728.6
> F1205380              966416.3
> F1205390              954903.3
> F1205400              936407.6
> F1205420             1041943.5
> F1205440             1036274.2
> F1205450             1039850.3
Note that scan_hh() is more efficient than calc_ehh() and calc_ehhs() applied consecutively for each SNP as can be seen by running the two code snippets below:
system.time(res.scan<-scan_hh(haplohh_cgu_bta12))
>    user  system elapsed 
>   0.233   0.000   0.233
foo<-function(haplo){
  res.ihh=res.ies=matrix(0,haplo@nsnp,2)
  for(i in 1:haplo@nsnp){
    res.ihh[i,]=calc_ehh(haplo,mrk=haplo@snp.name[i],plotehh=FALSE)$ihh
    tmp=calc_ehhs(haplo,mrk=haplo@snp.name[i],plotehhs=FALSE)
    res.ies[i,1]=tmp$IES_Tang_et_al_2007
    res.ies[i,2]=tmp$IES_Sabeti_et_al_2007  
  }
  list(res.ies=res.ies,res.ihh=res.ihh)
}
system.time(res.scan2<-foo(haplohh_cgu_bta12))
>    user  system elapsed 
>  13.165   0.136  13.302
Nevertheless, results are the same:
identical(res.scan2$res.ihh[,1],res.scan[,4])
> [1] TRUE
identical(res.scan2$res.ihh[,2],res.scan[,5])
> [1] TRUE
identical(res.scan2$res.ies[,1],res.scan[,6])
> [1] TRUE
identical(res.scan2$res.ies[,2],res.scan[,7])
> [1] TRUE
#Computing iHS, Rsb and XP-EHH: the ihh2ihs(), ies2rsb() and ies2xpehh() functions
##The iHS within-population statistic
###Definition {#def.ihs} The abbreviation iHS refers to “integrated haplotype homozygosity score”. Let \(\mathrm{uniHS}\) represent the un-standardized log-ratio of ancestral iHH\(_a\) to derived iHH\(_d\) of a certain focal SNP \(s\): \[\mathrm{uniHS}=\log\left(\frac{\mathrm{iHH}_a}{\mathrm{iHH}_d}\right)\] Following [@Voight2006] we perform a standardization by setting: \[\mathrm{iHS}=\frac{\mathrm{uniHS} - \mu^{p_s}_\mathrm{uniHS}}{\sigma^{p_s}_\mathrm{uniHS}}\] where \(\mu^{p_s}_\mathrm{uniHS}\) and \(\sigma^{p_s}_\mathrm{uniHS}\) represent the average and standard deviation of the \(\mathrm{uniHS}\) computed over all the SNPs with a derived allele frequency \(p_s\) similar to that of the SNP \(s\). In practice, the derived allele frequencies are binned so that each bin contains a large enough number of SNPs (e.g., >10) to obtain reliable estimates of \(\mu^{p_s}_\mathrm{uniHS}\) and \(\sigma^{p_s}_\mathrm{uniHS}\).
Note that the iHS is constructed to have an approximately standard Gaussian distribution and to be comparable across SNPs regardless of their underlying allele frequencies. Hence, one may further transform iHS into \(p_\mathrm{iHS}\) [@Gautier2011]: \[p_\mathrm{iHS}=-\log_{10}\left(1-2|\Phi\left(\mathrm{iHS}\right)-0.5|\right)\] where \(\Phi\left(x\right)\) represents the Gaussian cumulative distribution function. Assuming most of the genotyped SNPs behave neutrally (i.e., the genome-wide empirical iHS distribution is a fair approximation of the neutral distribution), \(p_\mathrm{iHS}\) might thus be interpreted as a two-sided P-value (on a \(-\log_{10}\) scale) associated to the neutral hypothesis of no selection.
###The function ihh2ihs() {#cgu.ihs}
The ihh2ihs() function computes iHS using a matrix of iHH statistics (for both the ancestral and derived alleles) as obtained by the scan_hh() function (see \ref{fun.scanhh}). The argument minmaf allows to remove SNPs according to their MAF (by default SNPs with a MAF<minmaf=0.05 are discarded from the standardization). The argument freqbin controls the size (or number) of the allele frequency bins used to perform standardization (see \ref{def.ihs}). More precisely, allele frequency bins are built from minmaf to 1-minmaf in steps of size freqbin (by default freqbin=0.025). If instead an integer of 1 or greater is specified, a corresponding number of equally spaced bins is created. If freqbin is set to 0, standardization is performed considering each observed frequency as a discrete frequency class, which is useful in case that there are only a few different haplotypes.
For instance, to perform a whole genome scan one might run scan_hh() on haplotype data from each chromosome and concatenate the resulting matrices before standardization. In the following example, we assume that the haplotype files are named as hap_chr_i.pop1 where the chromosome number \(i\) goes from 1 to 29 and the SNP information file is named snp.info. The R code below then generates a matrix wg.res with \(iHH_a\) and \(iHH_d\) estimates for all SNPs in an appropriate format to perform standardization with the ihh2ihs() function:
for(i in 1:29){
  hap_file=paste("hap_chr_",i,".pop1",sep="")
  data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
  res<-scan_hh(data)
  if(i==1){wg.res<-res}else{wg.res<-rbind(wg.res,res)}
}
wg.ihs<-ihh2ihs(wg.res)
For illustration, \(iHH\) values of a whole genome scan [@Gautier2011] are provided as example data. The following R code computes the iHS for the CGU population:
data(wgscan.cgu)
## results from a genome scan (44,057 SNPs) see ?wgscan.eut and ?wgscan.cgu for details
ihs.cgu<-ihh2ihs(wgscan.cgu)
The resulting object ihs.cgu is a list with two elements:
ihs.cgu$iHS data frame are displayed below using the R command:head(ihs.cgu$iHS)
>          CHR POSITION        iHS -log10(p-value)
> F0100190   1   113642 -0.5582992       0.2390952
> F0100220   1   244699  0.2723337       0.1049282
> F0100250   1   369419  0.4810736       0.2003396
> F0100270   1   447278  1.0618710       0.5401640
> F0100280   1   487654  0.8184060       0.3839181
> F0100290   1   524507 -0.3897024       0.1569189
ihs.cgu$frequency.class data frame are displayed below:head(ihs.cgu$frequency.class)
>              #mrk mean(log(iHHA/iHHD)) sd(log(iHHA/iHHD))
> 0.05 - 0.075 1635            0.7286087          0.6457742
> 0.075 - 0.1  1316            0.5804760          0.5556798
> 0.1 - 0.125  1478            0.4710504          0.5079392
> 0.125 - 0.15 1593            0.3720585          0.4708235
> 0.15 - 0.175 1078            0.3263215          0.4524270
> 0.175 - 0.2  1325            0.2721166          0.4533404
###Manhattan plot of the results: the function ihsplot()
The function ihsplot() draws a Manhattan plot of the Whole Genome Scan results returned by the function ihh2ihs().  Various options are available to modify the graphical display (see ?ihsplot). Figure \ref{fig:ihsplot} was drawn using the following R code:
layout(matrix(1:2,2,1))
ihsplot(ihs.cgu,plot.pval=TRUE,ylim.scan=2,main="iHS (CGU cattle breed)")
##The Rsb pairwise population statistic
###Definition {#def.rsb}
The abbreviation Rsb stands for “ratio of EHHS between populations”. For a given SNP \(s\), let \[\mathrm{LRiES}^{\text{Tang}}=\log\left(\frac{\mathrm{iES}_\text{pop1}^{\text{Tang}}}{\mathrm{iES}_\text{pop2}^{\text{Tang}}}\right)\] represent the log-ratio of the \(\mathrm{iES}^{\text{Tang}}\) values computed in the pop1 and pop2 populations (see \ref{iES}).
The Rsb for a given focal SNP is then defined as the standardized \(\mathrm{LRiES}^{\text{Tang}}\) [@Tang2007]:
\begin{equation} \mathrm{Rsb}=\frac{\mathrm{LRiES}{\text{Tang}} - \text{med}{\mathrm{LRiES}{\text{Tang}}}}{\sigma{\mathrm{LRiES}{\text{Tang}}}} \end{equation} where \(\text{med}_{\mathrm{LRiES}^{\text{Tang}}}\) and \(\sigma_{\mathrm{LRiES}^{\text{Tang}}}\) represent the median and standard deviation of the \(\mathrm{LRiES}(s)^{\text{Tang}}\) computed over all analyzed SNPs. Note that we follow [@Tang2007] in using the median instead of the mean (hence in contrast to the definitions of iHS and XP-EHH). They assume that this might increase the robustness against different demographic scenarios. It should be noticed, too, that the information about ancestral/derived status of alleles at the focal SNP does not figure in the formula. Furthermore, in contrast with \(iHS\), no binning is performed.
As iHS (see \ref{def.ihs}), Rsb is constructed to have an approximately standard Gaussian distribution and may further be transformed into \(p_\mathrm{Rsb}\): \begin{equation} p\mathrm{Rsb}=-\log{10}\left(1-2|\Phi\left(\mathrm{Rsb}\right)-0.5|\right) \end{equation} where \(\Phi\left(x\right)\) represents the Gaussian cumulative distribution function. Assuming most of the genotyped SNPs behave neutrally (i.e., the genome-wide empirical Rsb distribution is a fair approximation of their corresponding neutral distributions), \(p_\mathrm{Rsb}\) might thus be interpreted as a two-sided P-value (in a \(-\log_{10}\) scale) associated to the neutral hypothesis of no selection. Alternatively, \(p_\mathrm{Rsb}\) might also be computed [@Gautier2011]: \begin{equation} p\prime\mathrm{Rsb}=-\log{10}\left(|\Phi\left(\mathrm{Rsb}\right)|\right) \end{equation} \(p\prime_\mathrm{Rsb}\) and \(p\prime_\mathrm{Rsb}\) might then be interpreted as a one-sided P-value (in a \(-\log_{10}\) scale) allowing the identification of those sites displaying a significantly high extended haplotype homozygosity in population \(pop2\) (represented in the denominator of the corresponding \(\mathrm{LRiES}\)) relatively to the \(pop1\) reference population.
###The function ies2rsb() {#res.rsb}
The ies2rsb() function computes Rsb using two data frames containing the iES statistics for each of the two populations considered in the same format as obtained by running the scan_hh() function (see \ref{fun.scanhh}).
In order to perform a genome-wide scan one might first run for each population scan_hh() on haplotype data from each chromosome and then concatenate the resulting matrices. In the following example, we assume that the haplotype files are named as hap_chr_i.pop1 and  hap_chr_i.pop2 where \(i\) is the chromosome number (going from 1 to 29), the suffixes pop1 and pop2 indicate the population of origin and the SNP information file is named snp.info. The R code below then generates two data frames (wg.res.pop1 and wg.res.pop2) containing the results from all SNPs in the appropriate format to compute Rsb with the ies2rsb() function:
for(i in 1:29){
  hap_file=paste("hap_chr_",i,".pop1",sep="")
  data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
  res<-scan_hh(data)
  if(i==1){wg.res.pop1<-res}else{wg.res.pop1<-rbind(wg.res.pop1,res)}
  hap_file=paste("hap_chr_",i,".pop2",sep="")
  data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
  res<-scan_hh(data)
  if(i==1){wg.res.pop2<-res}else{wg.res.pop2<-rbind(wg.res.pop2,res)}
}
wg.rsb<-ies2rsb(wg.res.pop1,wg.res.pop2)
For illustration, we take \(iES\) values from a genome scan [@Gautier2011] provided as example data and compute for each SNP the Rsb between the CGU and EUT populations as follows:
data(wgscan.cgu) ; data(wgscan.eut)
## results from a genome scan (44,057 SNPs) see ?wgscan.eut and ?wgscan.cgu for details
cguVSeut.rsb<-ies2rsb(wgscan.cgu,wgscan.eut,"CGU","EUT")
The resulting object cguVSeut.rsb is a data frame which shows for each SNP its Rsb and corresponding P-Values assuming Rsb are normally distributed under the neutral hypothesis. The P-value might be either bilateral (default) or unilateral (specified by the method argument). The first rows of the cguVSeut.rsb data frame are displayed below using the following R command:
head(cguVSeut.rsb)
>          CHR POSITION Rsb (CGU vs. EUT) -log10(p-value) [bilateral]
> F0100190   1   113642        -0.3398574                  0.13432529
> F0100220   1   244699        -1.0566283                  0.53658299
> F0100250   1   369419        -0.1468326                  0.05390941
> F0100270   1   447278        -1.8191608                  1.16186336
> F0100280   1   487654        -0.2193069                  0.08280392
> F0100290   1   524507        -0.7941300                  0.36945032
###Manhattan plot of the results: the function rsbplot()
The rsbplot() function draws a Manhattan plot of the Whole Genome Scan results as obtained by the function ies2rsb().  Various options are available to modify the graphical display (see ?rsbplot). As an example, Figure \ref{fig:rsbplot} below provides the output of the function rsbplot() for the Rsb computed above across the CGU and EUT populations (see \ref{res.rsb}). It was drawn using the following R code:
layout(matrix(1:2,2,1))
rsbplot(cguVSeut.rsb,plot.pval=TRUE)
##The XP-EHH pairwise population statistic ###Definition The XP-EHH (cross-population EHH) statistic [@Sabeti2007] is similar to Rsb except that it is based on \(\mathrm{iES}^{\text{Sab}}\) instead of \(\mathrm{iES}^{\text{Tang}}\) (see \ref{iES}). Hence, for or a given SNP \(s\), let \[\mathrm{LRiES}^{\text{Sab}}=\log\left(\frac{\mathrm{iES}_\text{pop1}^{\text{Sab}}}{\mathrm{iES}_\text{pop2}^{\text{Sab}}}\right)\] represent the log-ratio of the \(\mathrm{iES}^{\text{Sab}}\) values computed in the pop1 and pop2 populations (see \ref{iES}).
The XP-EHH for a given focal SNP is then defined as the standardized \(\mathrm{LRiES}^{\text{Sab}}\) [@Sabeti2007]:
\begin{equation} \mathrm{XP-EHH}=\frac{\mathrm{LRiES}{\text{Sab}} - \text{mean}{\mathrm{LRiES}{\text{Sab}}}}{\sigma{\mathrm{LRiES}{\text{Sab}}}} \end{equation} where \(\text{mean}_{\mathrm{LRiES}^{\text{Sab}}}\) and \(\sigma_{\mathrm{LRiES}^{\text{Sab}}}\) represent the mean and standard deviation of \(\mathrm{LRiES}^{\text{Sab}}\) computed over all analyzed SNPs. As with Rsb, the information about the ancestral and derived status of alleles at the focal SNP does not figure in the formula and no binning is performed.
As with iHS (see \ref{def.ihs}) and Rsb (see \ref{def.rsb}), XP-EHH is constructed to have an approximately standard Gaussian distribution and may further be transformed into \(p_\mathrm{XP-EHH}\): \begin{equation} p\mathrm{XP-EHH}=-\log{10}\left(1-2|\Phi\left(\mathrm{XP-EHH}\right)-0.5|\right) \end{equation} where \(\Phi\left(x\right)\) represents the Gaussian cumulative distribution function. Assuming most of the genotyped SNPs behave neutrally (i.e., the genome-wide empirical XP-EHH distribution is a fair approximation of their corresponding neutral distributions), \(p_\mathrm{XP-EHH}\) might thus be interpreted as a two-sided P-value (in a \(-\log_{10}\) scale) associated to the neutral hypothesis of no selection. Alternatively, \(p_\mathrm{XP-EHH}\) might also be computed [@Gautier2011]: \begin{equation} p\prime\mathrm{XP-EHH}=-\log{10}\left(|\Phi\left(\mathrm{XP-EHH}\right)|\right) \end{equation} \(p\prime_\mathrm{XP-EHH}\) and \(p\prime_\mathrm{XP-EHH}\) might then be interpreted as a one-sided P-value (in a \(-\log_{10}\) scale) allowing the identification of those sites displaying a significantly high extended haplotype homozygosity in population \(pop2\) (represented in the denominator of the corresponding \(\mathrm{LRiES}\)) relatively to the \(pop1\) reference population.
###The function ies2xpehh() {#res.xpehh}
The ies2xpehh() function computes XP-EHH using two data frames containing the iES statistics for each of the two populations in the format as obtained by running the scan_hh() function (see \ref{fun.scanhh}).
For instance, to perform a genome scan one might first run for each population scan_hh() in turn on haplotype data from each chromosome and concatenate the resulting matrices. In the following example, we assume that the haplotype files are named as hap_chr_i.pop1 and  hap_chr_i.pop2 where \(i\) is the chromosome number (going from 1 to 29), the suffixes pop1 and pop2 indicate the population of origin and the SNP information file is named snp.info. The R code below then generates two data frames (wg.res.pop1 and wg.res.pop2) containing the results from all SNPs in the appropriate format to compute Rsb with the ies2rsb() function:
for(i in 1:29){
  hap_file=paste("hap_chr_",i,".pop1",sep="")
  data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
  res<-scan_hh(data)
  if(i==1){wg.res.pop1<-res}else{wg.res.pop1<-rbind(wg.res.pop1,res)}
  hap_file=paste("hap_chr_",i,".pop2",sep="")
  data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
  res<-scan_hh(data)
  if(i==1){wg.res.pop2<-res}else{wg.res.pop2<-rbind(wg.res.pop2,res)}
}
wg.xpehh<-ies2xpehh(wg.res.pop1,wg.res.pop2)
For illustration, we consider the \(iES\) values of a genome scan [@Gautier2011] provided as example data and compute for each SNP the XP-EHH between the CGU and EUT populations as follows:
data(wgscan.cgu) ; data(wgscan.eut)
## results from a genome scan (44,057 SNPs) see ?wgscan.eut and ?wgscan.cgu for details
cguVSeut.xpehh<-ies2xpehh(wgscan.cgu,wgscan.eut,"CGU","EUT")
The resulting object cguVSeut.xpehh is a data frame containing for each SNP the XP-EHH and corresponding P-value assuming XP-EHH are normally distributed under the neutral hypothesis. The P-value might be either bilateral (default) or unilateral (as specified by the method argument). The first rows of this data frame are displayed below:
head(cguVSeut.xpehh)
>          CHR POSITION XPEHH (CGU vs. EUT) -log10(p-value) [bilateral]
> F0100190   1   113642          -0.5943673                   0.2578513
> F0100220   1   244699          -0.7903997                   0.3672448
> F0100250   1   369419          -0.9273568                   0.4513142
> F0100270   1   447278          -0.3858354                   0.1551387
> F0100280   1   487654          -0.9570604                   0.4703941
> F0100290   1   524507          -0.7908863                   0.3675322
###Manhattan plot of the results: the function xpehhplot()
The xpehhplot() draws a Manhattan plot of the Whole Genome Scan results produced by the function ies2xpehh().  Various options are available to modify the graphical display (see ?xpehhplot).
As an example, Figure \ref{fig:xpehhplot} provides the output of the function xpehhplot for the XP-EHH computed above across the CGU and EUT populations (see \ref{res.xpehh}). It was drawn by:
layout(matrix(1:2,2,1))
xpehhplot(cguVSeut.xpehh,plot.pval=TRUE)
###Rsb vs. XP-EHH comparison:
A plot of the Rsb against XP-EHH values across the CGU and EUT populations (see \ref{res.rsb} and \ref{res.xpehh} respectively) is represented in Figure \ref{fig:comp}. Marked in red is the SNP that was used repeatedly in the examples above. The plot was generated using the following R code:
plot(cguVSeut.rsb[,3],cguVSeut.xpehh[,3],xlab="Rsb",ylab="XP-EHH",pch=".",
     xlim=c(-7.5,7.5),ylim=c(-7.5,7.5))
points(cguVSeut.rsb["F1205400",3],cguVSeut.xpehh["F1205400",3],col="red")
abline(a=0,b=1,lty=2)
\clearpage \newpage
##Visual inspection of the standardized scores distribution: the function distribplot()
The distribplot() function allows to visualize the distributions of the standardized scores (either iHS, Rsb or XP-EHH) and compare them to the standard Gaussian distribution. As an example, Figure \ref{fig:distribplot} below provides the output the function distribplot() when considering the iHS estimates obtained for the CGU population (see \ref{cgu.ihs}) using the following R code:
layout(matrix(1:2,2,1))
distribplot(ihs.cgu$iHS[,3],xlab="iHS")
\clearpage
\newpage 
#Visualizing haplotype structure around a focal SNP: the bifurcation.diagram() function
The function bifurcation.diagram() draws haplotype bifurcation diagrams [@Sabeti2002] that visualize the decay of EHH around a focal SNP. A stark contrast of ancestral and derived bifurcation diagrams should correspond to outlier values of iHS. Within the plot, the root (focal SNP) is identified by a vertical dashed line. 
The diagram is bi-directional, portraying decay along both sides of the focal SNP. Moving in one direction, each marker is an opportunity for a bifurcation to occurr further differentiating between extended haplotypes. The thickness of the lines corresponds to the number of chromosomes with the same haplotype.
Several options are available to modify the aspect of the plots (see command ?bifurcation.diagram).
As an illustration, Figure \ref{fig:bifdia} shows the bifurcation diagrams for both the derived and ancestral alleles at the SNP with name “F1205400” on BTA12 CGU haplotypes. This SNP is associated with a strong signal of selection (using both iHS and Rsb statistics) and is located closely (<5kb) to a strong candidate gene involved in horn development [@Gautier2011]. Figure \ref{fig:bifdia} was obtained by the following R code: 
data(haplohh_cgu_bta12)
layout(matrix(1:2,2,1))
bifurcation.diagram(haplohh_cgu_bta12,mrk_foc="F1205400",all_foc=1,nmrk_l=20,nmrk_r=20,
                    main="Bifurcation diagram (RXFP2 SNP on BTA12): Ancestral Allele")
bifurcation.diagram(haplohh_cgu_bta12,mrk_foc="F1205400",all_foc=2,nmrk_l=20,nmrk_r=20,
                    main="Bifurcation diagram (RXFP2 SNP on BTA12): Derived Allele")
\clearpage
\newpage
The C++ program hapbin [@Maclean2015] is an alternative implementation of the statistics iHS and XP-EHH. The calculated values vary between hapbin and rehh. As far as we can tell, this is due to the following differences in implementation:
Hapbin disregards the SNPs directly at the border of chromosomes.
If an ancestral or derived focal allele is present only on a single chromosome, hapbin assigns a EHH value of 1 for that allele at the focal SNP and zero otherwise. rehh assigns zero at the focal SNP, too.
If run in default mode (without option --binom or -a), hapbin calculates EHH by (notation as in \ref{EHH})
\begin{equation}
\mathrm{EHH}a{s,t}=\sum{k=1}{Ka_{s,t}}\left(\frac{n_k}{n_a}\right)2\;.
\end{equation}
Note that this formula yields a homozygosity of 0.5 for a sample of two chromosomes with non-identical alleles. This number reflects the situation of a population with a frequency of 0.5 for two alleles and the average homozygosity of two chromosomes taken at random (with replacement). The formula (\ref{ehh.def}) used by rehh and applied by hapbin if run with option --binom or -a returns in this case zero, hence the homozygosity of the sample proper. For increasing sample size both approaches converge.
The same holds for EHHS.
For the standardization of iHS resp. XP-EHH, hapbin uses the estimator \(\sqrt{\frac{1}{n}\sum(x_i-\bar{x})^2}\) for the standard deviation while rehh uses \(\sqrt{\frac{1}{n-1}\sum(x_i-\bar{x})^2}\).
The bins that are used for the standardization of \(iHS\) are 
spanning in hapbin the whole interval ]0,1], while in rehh they span the interval [minmaf,1-minmaf[. Hapbin includes the upper endpoint into each bin, while rehh includes the lower endpoint. The default number of bins in hapbin is set to 50, yielding bin intervals of 0.02. The corresponding default value in rehh is 0.025.
Integration over EHH resp. EHHS is performed by hapbin on the area between the curve spanned by these quantities and the x-axis (y=0) while rehh integrates only over the part of that area that is above the threshold set by the parameters limehh resp. limehhs, i.e. the area between the curve and the line y=threshold. This is not to be confused with the condition for truncation at left and right ends of the curve (i.e. the abortion of further integration if EHH resp. EHHS decayes beyond a given threshold) which is (for all practical purposes) identical in both programs. 
By far the largest discrepancies arise from the different handling of gaps during the integration of EHH resp. EHHS yielding iHH resp. iES. Hapbin has an optional parameter -s or --scale to “down-weight” large gaps by capping them to the specified value. For example, taking its default value of \(20000\), this means, that any gap greater than this will be reduced to this number. By contrast, using the option maxgap within rehh leads to a stop of the integration and if the parameter discard_integration_at_border is set to TRUE, then no value is reported. 
\clearpage
\newpage
#References