---
title: "Get started"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Get started}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
markdown:
wrap: 72
bibliography: references.bib
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
This is a brief introduction to AMBI index calculations with ambiR.
## Structuring species observation data
Species counts (or abundance) should be organized in a data frame
arranged in *long* format. That is, species names are in one column and
species counts in another column. If the data represents several
stations and/or if there are replicates, then the data should include
columns with this information.
Looking at the `test_data` example dataset included with the ambiR package,
one can see how the data should be arranged:
```{r test_data}
library(ambiR)
head(test_data)
```
If your data look like this, you can go directly to [Calculate AMBI]. If
not, the following examples show how to reorganize your data.
### Species names in columns {#data-with-species-names-in-columns}
Here is an example dataframe where there are counts for species in
separate columns:
```{r transpose_wide, include=FALSE}
library(dplyr)
library(tidyr)
wide_data_species <- test_data %>%
pivot_wider(names_from = "species",
values_from = "count",
values_fill = 0)
```
```{r data_wide_spec}
head(wide_data_species)
```
To arrange the data in the correct form, use `tidyr::pivot_longer()`:
```{r pivot_species}
# columns 1 and 2 contain station and replicate information
# so, select all columns from 3 to be pivoted
long_data <- wide_data_species %>%
pivot_longer(cols = 3:ncol(wide_data_species),
names_to = "species",
values_to = "count")
head(long_data)
```
### Stations and replicates in columns
Here is an example where each column contains species counts for a
station and replicate. The first row of the table contains the station
ID *1, 2, ...* and the second row contains the replicate ID *a, b, ...*.
The first column of the table contains species names.
```{r transpose_wide_stns, include=FALSE}
df <- test_data %>%
pivot_wider(names_from = c("station","replicate"),
values_from = "count", values_fill = 0)
stns <- names(df)[2:ncol(df)]
stns <- stns %>%
sapply(strsplit,"_")
reps <- stns %>%
sapply(function(x){x[2]})
stns <- stns %>%
sapply(function(x){x[1]})
n <- 1+length(stns)
df2 <- matrix(nrow=2, ncol=n) %>%
as.data.frame()
df2[1,2:n] <- stns
df2[2,2:n] <- reps
names(df) <- names(df2)
df <- df %>%
mutate(across(all_of(names(df)), as.character))
df <- df2 %>%
bind_rows(df)
names(df) <- rep("", ncol(df))
wide_data_stns <- df
```
```{r data_wide_stns}
head(wide_data_stns)
```
Note that if the if the observation data *only* has stations *or*
replicates, then rearranging the data can be done in one step, as in the
[previous example](#data-with-species-names-in-columns). But in this
case, the station and replicate information are in separate rows. The
restructuring process is a little more complicated.
#### Combine station and replicate values
Before we can use a pivot function, we need to combine the station ID
and replicate ID for each column into a single value. In this example,
each station ID and replicate ID are joined into a single character
value, with an underscore to separate them *`_`*. The underscore will be
used again after pivoting the table to identify where to split the
combined station/replicate values back into separate values again.
If there are station names which already contain underscores, then
another suitable character should be used when joining and splitting
station/replicate IDs.
```{r combine_stn_rep}
sep_character <- "_"
# get the station IDs from row 1, excluding column 1 (this contains species names)
stations <- wide_data_stns[1, 2:ncol(wide_data_stns)]
# get the replicate IDs from row 2
replicates <- wide_data_stns[2, 2:ncol(wide_data_stns)]
# combine the station and replicate for each column using an underscore
station_replicate <- paste(stations, replicates, sep=sep_character)
# now we have extracted the station/replicate information, we can drop the
# first two rows of the data frame
wide_data_stns <- wide_data_stns[3:nrow(wide_data_stns),]
# apply the station_replicates as column names
names(wide_data_stns) <- c("species", station_replicate)
head(wide_data_stns)
```
#### Transpose using the combined station/replicate variable
We can see that the column names now contain the combined station and
replicate information. We are ready to transpose the data.
```{r pivot_stations}
# column 1 contains species names
# so, select all columns from 2 to be pivoted
long_data <- wide_data_stns %>%
pivot_longer(cols = 2:ncol(wide_data_stns),
names_to = "stn_rep",
values_to = "count")
head(long_data)
```
#### Retrieve station and replicate variables from the transposed data
Now we can split the *stn_rep* column into separate columns for
*station* and *replicate*, We will use the underscore we introduced
earlier to identify where the split should occur:
```{r split_stations}
long_data <- long_data %>%
separate_wider_delim(cols="stn_rep",
delim = sep_character,
names=c("station", "replicate"))
head(long_data)
```
Now we are ready to calculate AMBI...
## Calculate AMBI
We have now ensured that our species abundance/count data have the
correct structure, as in the example `test_data` provided:
```{r test_data2}
head(test_data)
```
Call the `AMBI()` function:
```{r run_ambi}
res <- AMBI(test_data, by="station", var_rep="replicate",
var_species="species", var_count="count")
```
### Results
The `AMBI()` function returns a list of three dataframes:
- `.$AMBI`
- `.$AMBI_rep`
- `.$matched`
`.$AMBI`- the main results with the `AMBI` index calculated for each
unique combination of `by` variables, in our case the results are per
`station`.
In addition to the `AMBI` index, the results also include the *Shannon diversity
index* `H'` and the *Species richness* `S`, the three
metrics which are necessary to calculate [M-AMBI](#calculate-m-ambi).
```{r show_ambi}
res$AMBI
```
`.$AMBI_rep` - if the observations include replicates, then the
function also returns results calculated for each replicate, within each
unique combination of `by` variables:
```{r show_ambi_rep}
res$AMBI_rep
```
`.$matched` - for each observation, this dataframe shows which species
in the `AMBI` list the observed species was matched with, if any. It
also shows the `AMBI` species group assigned. This dataframe has the
same number of rows as the input data.
```{r show_matched}
head(res$matched)
```
For more information about the principles underlying the `AMBI` calculations and
the grouping of species according to sensitivity to pollution,
see `vignette("background")`.
## Calculate M-AMBI {#calculate-m-ambi}
Calculate M-AMBI the multivariate AMBI index, based on the three
separate species diversity metrics:
- AMBI index `AMBI`.
- Shannon diversity index `H`
- Species richness `S`.
All three indices required to calculate `MAMBI()`are included in the
results returned by the `AMBI()` function.
### Limit values
In addition to index values calculated from observed species data, the
M-AMBI factor analysis requires values defining the limits for the three
metrics, corresponding to the best and worst possible conditions.
See @MUXIKA200716 for more details.
The default limit values used by `MAMBI()` are:
```{r limits_mambi}
limits_AMBI <- c("bad" = 6, "high" = 0)
limits_H <- c("bad" = 0, "high" = NA)
limits_S <- c("bad" = 0, "high" = NA)
```
By default, upper limit values for `H` and `S` are not provided (`= NA`).
If they are not provided explicitly, then the maximum values found in
the input data will be used. The results returned by `MAMBI()` show the
limits used.
### Status class boundaries
`MAMBI()` also returns the status class (*Bad, Poor, Moderate, Good* or
*High*) for each M-AMBI index value. To do this, it compares the
calculated M-AMBI index value with values defining the boundaries
between status classes.
- `PB` - *Poor* / *Bad*
- `MP` - *Moderate* / *Poor*
- `GM` - *Good* / *Moderate*
- `HG` - *High* / *Good*
The default values for the class boundaries are:
```{r bounds_mambi}
bounds <-c("PB" = 0.2, "MP" = 0.39, "GM" = 0.53, "HG" = 0.77)
```
### Call `MAMBI()`
We call `MAMBI()` using the previously generated `AMBI` results:
```{r calc_mambi}
res_mambi <- MAMBI(res$AMBI, var_H = "H", var_S = "S", var_AMBI = "AMBI")
```
### M-AMBI results
In addition to retaining the values for `AMBI`. `H` and `S`, the
results include the following information:
- `x`, `y`, `z` - factor scores giving coordinates in the new factor
space.
- `MAMBI` - the calculated M-AMBI index value
- `Status` - the status class for the M-AMBI index value
- `EQR` - the normalised [EQR](#eqr-values) index
The dataframe returned contains 2 more rows than the input data. Our
input data contained 3 rows (1 for each `station`). The results contain
5 rows. The 2 *additional* rows show the limit values for each of the
three metrics used in the M-AMBI calculations.
```{r mambi_results}
res_mambi %>%
select(station, AMBI, H, S, x, y, z, MAMBI, Status, EQR)
```
#### EQR values {#eqr-values}
As well as the status class, the function also returns a normalised
index value from 0 to 1 (*EQR*), calculated using the `bounds` boundary
values for the M-AMBI index. The following EQR values correspond to
status class boundaries:
- `0.2` - *Poor* / *Bad*
- `0.4` - *Moderate* / *Poor*
- `0.6` - *Good* / *Moderate*
- `0.8` \_ *High* / *Good*
For example, the M-AMBI value for station 3 is `0.478`. This lies
between the M-AMBI value corresponding to the *Moderate/Poor* status
class boundary (`"MP" = 0.39`) and the M-AMBI value corresponding to the
*Good/Moderate* status class boundary (`"GM" = 0.53`).
The normalised EQR value is given by linear interpolation:
$$ EQR = 0.4 + (0.6 - 0.4) \frac{0.478 - 0.39}{0.53 - 0.39 } $$
## References