---
title: "QFASA Workflow Example"
author: "Shelley Lang"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{QFASA Workflow Example}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
# Load Package

```{r, eval=TRUE}
library(QFASA)
library(plyr)
```

# Modeling Inputs
Prior to starting make sure that:

* Fatty acid names in all files are the same (contain the exact same
  numbers/characters and punctuation)
* There are no fatty acids in the prey file that do not appear in the
  predator file and visa versa


## Distance Measure
Choose from one of three distance measures:

1) KL (Kullback-Leibler)
2) AIT (Aitchison)
3) CSD (Chi-Squared) 

```{r, eval=TRUE}
dist.meas=1
```

## Fatty Acid Set
* This is the list of FAs to be used in the modelling.
* The simplest alternative is to load a .csv file which contains a
  single column with a header row and the names of the fatty acids
  listed below (see example file __"FAset.csv"__).
* A more complicated alternative is to load a .csv file with the full
  set of FAs and then add code to subset the FAs you wish to use from
  that set --> this alternative is useful if you are planning to test
  multiple FA sets.
* Regardless of how you load the FA set it must be converted to a
  vector.

```{r, eval=TRUE}
data(FAset)
fa.set = as.vector(unlist(FAset))
```

## Matrix of Predator FA signatures
* The FA signatures in the originating .csv file should sum to 100 or 1.
  
* Each predator signature is a row with the FAs in columns (see
  example file __"predatorFAs.csv"__).
  
* the FA signatures are subsetted for the chosen FA set (created
  above) and renormalized during the modelling so there is no need to
  subset and/or renormalize prior to loading the .csv file or running
  p.QFASA BUT make sure that the the same FAs appear in the predator
  and prey files (if a FA appears in one but not the other the code
  will give you an error).

* Unlike the original QFASApack code the predator FA .csv file can
  contain as much tombstone data in columns as you wish but the
  predator FA signatures must be extracted as a separate input in
  order to run in p.QFASA. For example: in the code below the predator
  .csv file ("predatorFAs.csv") has 4 tombstone columns (SampleCode,
  AnimalCode, SampleGroup, Biopsy). Prior to running QFASA the
  tombstone (columns 1-4) and FA data (columns 5 onward) are each
  extracted from the original data frame. The FA data become the the
  predator.matrix (which is passed to p.QFASA) and the tombstone data
  is retained so that it can be recombined with the model output later
  on.
  
```{r, eval=TRUE }
data(predatorFAs)
tombstone.info = predatorFAs[,1:4]
predator.matrix = predatorFAs[,5:(ncol(predatorFAs))]

# number of predator FA signatures this is used to create the matrix of CC values (see section 6 below)
npredators = nrow(predator.matrix)
```


## Matrix of Prey FA signatures
* The FA signatures in the originating .csv file should sum to 100 or 1.
  
* The prey file should contain all of the individual FA signatures of
  the prey and their lipid contents (where appropriate) - a matrix of
  the mean values for the FAs (prey.matrix) by the designated prey
  modelling group is then calculated using the MEANmeth function
  loaded above.
  
* Like the predator .csv file you can have as many tombstone data
  columns as required but there must be at least one column that
  identifies the modelling group to be used (in the example file used
  below __"preyFAs.csv"__ it is the "Species" column).
  
* Unlike the predator data, the prey data is not subsetted and
  renormalized during the modelling so the prey file needs to be
  subsetted for the desired FA set (created above) and renormalized to
  sum to 1 prior to calculating the mean values (see code
  below). Example: in the code below the "preyFAs.csv" file has 3
  tombstone columns. The full FA set is extracted from the data frame
  (columns 4 onward), subsetted for the FA set in use and then
  renormalized over 1. The modelling group names (the "Species" column
  in this case) is then added back to the subsetted and renormalized
  data (as the first column) and the average values calculated using
  the MEANmeth function. Note that for the MEANmeth function to work
  the modelling group name must be in the first column.
    
```{r, eval=TRUE}
#full file
data(preyFAs)

#extract prey FA only from data frame and subset them for the FA set designated above
prey.sub=(preyFAs[,4:(ncol(preyFAs))])[fa.set]

#renormalize over 1
prey.sub=prey.sub/apply(prey.sub,1,sum) 

#extract the modelling group names from the full file
group=as.vector(preyFAs$Species)

#add modelling group names to the subsetted and renormalized FAs
prey.matrix=cbind(group,prey.sub)

#create an average value for the FA signature for each designated modelling group
prey.matrix=MEANmeth(prey.matrix) 
```

## Prey Lipid Content
* mean lipid content by modelling group is calculated from the full
  prey file using the modelling group as a summary variable (see code
  below).
* **Note:** if no lipid content correction is going to be applied then
  a vector of '1's of length equal to the number of modelling groups
  is used as the vector instead i.e. FC=rep(1,nrow(prey.matrix))

```{r, eval=TRUE}
#numbers are the column which identifies the modelling group, and the column which contains the lipid contents
FC = preyFAs[,c(2,3)] 
FC = as.vector(tapply(FC$lipid,FC$Species,mean,na.rm=TRUE))
```


## Calibration Coefficients
* Originating .csv file should contain 2 columns (with headers). The
  first contains the FA names, the second the value of the CC for each
  FA (see example file __"CC.csv"__).
* __IMPORTANT:__ the FAs in the CC.csv file __MUST__ be exactly the
  same as the FAs in the originating predator.csv file __AND__ they
  __MUST__ BE IN THE __*EXACT*__ SAME ORDER.
  
```{r, eval=TRUE}
data(CC)
cal.vec = CC[,2]
cal.mat = replicate(npredators, cal.vec)
```


# Run QFASA

```{r, eval=TRUE}
Q = p.QFASA(predator.matrix, prey.matrix, cal.mat, dist.meas, gamma=1, FC, start.val=rep(1,nrow(prey.matrix)), fa.set)
```

## p.QFASA Output
The QFASA output is a list with 2 components:

* Diet Estimates
* Additional Measures

### Diet Estimates
This is a matrix of the diet estimate for each predator (by rows, in
the same order as the input file) by the modelling groups (by column,
in the same order as the prey.matrix file). The estimates are
expressed as a proportion (they will sum to 1). In the code below the
Diet Estimate matrix is extracted from the QFASA output and the
modelling group identities and predator tombstone data (created above)
are added to the matrix:

```{r, eval=TRUE}
DietEst = Q$'Diet Estimates'

#estimates changed from proportions to percentages
DietEst = round(DietEst*100,digits=2)
DietEst = cbind(tombstone.info,DietEst)
``` 

### Additional Measures
This is a list of lists where each list (one per predator) is itself a list of four outputs:

* __ModFAS__: the value of the modelled FA.  These are expressed as proportions (they will sum to 1).
  
* __DistCont__: the contribution of each FA to the final minimized distance.

* __PropDistCont__: the contribution of each FA to the final minimized
  distance as a proportion of the total.
  
* __MinDist__: the final minimized distance in the code below the
  'ldply' function from the plyr package is used to compile the lists
  within 'Additional Measures' into a data frame with one row per
  predator (in the same order as the input predator matrix) and the
  values for each of the 4 lists arranged into columns. The 'ldply'
  function automatically names the columns of the data frame with a
  concatenation of the originating list name and the FA name so that
  the 4 sets of outputs can be easily identified within the data
  frame.
  
```{r, eval=TRUE}

Add.meas = ldply(Q$'Additional Measures', data.frame)
``` 
### Note that the function "conf.meth" will return approximate simultaneous confidence intervals for diet.