---
title: "Introduction to ROCnGO"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to ROCnGO}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

ROCnGO is an R package which allows to analyze the performance of a classifier by using receiver operating characteristic ($ROC$) curves. Conventional $ROC$ based analyses just tend to use area under $ROC$ curve ($AUC$) as a metric of global performance, besides this functionality, the package allows deeper analysis options by calculating partial area under $ROC$ curve ($pAUC$) when prioritizing local performance is preferred.

Furthermore, ROCnGO implements different $pAUC$ transformations described in literature which:

* Make local performance interpretation easier.
* Allow to work with $ROC$ curves which are not completely concave or not at all (improper).
* Provide additional discrimination power when comparing classifiers with identical local performance (equal $pAUC$).

This document provides an introduction to ROCnGO tools and workflow to study the global and local performance of a classifier.

## Prerequisites

In order to reproduce the example, following packages are needed:

```{r setup, warning=FALSE, message=FALSE}
library(ROCnGO)
library(dplyr)
library(forcats)
```

## Data

To explore basic tools in the package we will be using `iris` dataset. The dataset contains 5 variables for 150 flowers of 3 different species: *setosa*, *versicolor* and *virginica*.

For the purpose of simplicity, we will only work with a subset of `iris`, considering only *setosa* and *virginica* species. In the following sections, performance of different variables to classify cases in the different species will be evaluated. 

```{r}
# Filter cases of versicolor species
iris_subset <- as_tibble(iris) %>% filter(Species != "versicolor")
iris_subset
``` 

## Global performance

### Calculate ROC curve

The foundation of this type of analyses implies to plot the $ROC$ curve of a classifier. This type of curves represent a classifier probability of correctly classify a case with a condition of interest, also known as *true positive rate* or $\text{Sensitivity}$ ($TPR$), and the complementary probability of correctly classify a case without the condition; also known as *false positive rate*, $1 - \text{Specificity}$, or $1 - TNR$, ($FPR$).

When working with a classifier that returns a series of numeric values, it can be complex to say when it is classifying a case as having the condition of interest (positive) or not (negative). To solve this problem, $ROC$ curves represent $(FPR, TPR)$ points considering hypothetical thresholds ($c$) where a case is considered as positive if its value is higher than the defined threshold ($X > c$).

These curve points can be calculated by using `roc_points()`. As most functions in the package, it takes a dataset, a data frame, as its first argument. The second and third argument refer to variables in the data frame, corresponding the variable that will be used as a classifier (`predictor`) and the response variable we want to predict (`response`).

For example, we can calculate $ROC$ points for Sepal.Length as a classifier of *setosa* species.

```{r, warning=FALSE}
# Calculate ROC points for Sepal.Lenght
points <- roc_points(
  data = iris_subset,
  predictor = Sepal.Length,
  response = Species
)
points

# Plot points
plot(points$fpr, points$tpr)
```

As we may see, Sepal.Length doesn't perform very well predicting when a flower is from setosa species, in fact it's the other way around, the lower the Sepal.Length the more probable to be working with a *setosa* flower. This can be tested if we change the condition of interest to *virginica*.

### Changing condition of interest

By default, condition of interest is automatically set to the first value in `levels(response)`, so we can change this value by changing the order of levels in data.

```{r, warning=FALSE}
# Check response levels
levels(iris_subset$Species)

# Set virginica as first value in levels
iris_subset$Species <- fct_relevel(iris_subset$Species, "virginica")
levels(iris_subset$Species)

# Plot ROC curve
points <- roc_points(
  data = iris_subset,
  predictor = Sepal.Length,
  response = Species
)
plot(points$fpr, points$tpr)
```

## Local performance

Sometimes a certain task may requiere prioritize e.g. high sensitivity over global performance. In these scenarios, it's preferable to work in specific regions of $ROC$ curve.

We can calculate points in a specific region using `calc_partial_roc_points()`. Function uses same arguments as `roc_points()` but adding `lower_threshold`, `upper_threshold` and `ratio`, which delimit region in which we want to work.

For example, if we require to work in high sensitivity conditions, we could check points in region $(0.9, 1)$ of $TPR$.

```{r, warning=FALSE}
# Calc partial ROC points
p_points <- calc_partial_roc_points(
  data = iris_subset,
  predictor = Sepal.Length,
  response = Species,
  lower_threshold = 0.9,
  upper_threshold = 1,
  ratio = "tpr"
)
p_points

# Plot partial ROC curve
plot(p_points$fpr, p_points$tpr)
```

## Automating analysis

### Performance metrics

When working with a high number of classifiers, it can be difficult to check each $ROC$ individually. In these scenarios, metrics such as $AUC$ and $pAUC$ may present more interest. Thus, by using the function `summarize_predictor()` we can obtain an overview of the performance of a classifier.

For example, we could consider the performance of Sepal.Length over a high sensitivity region, $TPR \in (0.9, 1)$, and high specificity region, $FPR \in (0, 0.1)$.

```{r, warning=FALSE}
# Summarize predictor in high sens region
summarize_predictor(
  data = iris_subset,
  predictor = Sepal.Length,
  response = Species,
  threshold = 0.9,
  ratio = "tpr"
)

# Summarize predictor in high spec region
summarize_predictor(
  data = iris_subset,
  predictor = Sepal.Length,
  response = Species,
  threshold = 0.1,
  ratio = "fpr"
)
```

Besides $AUC$ and $pAUC$, function also returns other partial indexes derived from $pAUC$ which provide a better interpretation of performance than $pAUC$.

Furthermore, if we are interested in computing these metrics simultaneously for several classifiers `summarize_dataset()` can be used, which also provides some metrics of analysed classifiers.

```{r, warning=FALSE}
summarize_dataset(
  data = iris_subset,
  response = Species,
  threshold = 0.9,
  ratio = "tpr"
)
```

### Plotting

As we have seen, by using the output of `roc_points()` we can plot $ROC$ curve. Nevertheless, these plots can also be generated using `plot_*()` and `add_*()` functions, which provide further options to customize plot for classifier comparison.

For example, we can plot $ROC$ points of Sepal.Length in this way.

```{r, warning=FALSE}
# Plot ROC points of Sepal.Length
sepal_length_plot <- plot_roc_points(
  data = iris_subset,
  predictor = Sepal.Length,
  response = Species
)
sepal_length_plot
``` 

Now by using `+` operator we can add further options to the plot. For example, including chance line, adding further $ROC$ points of other classifiers, etc.

```{r, warning=FALSE}
sepal_length_plot +
  add_roc_curve(
    data = iris_subset,
    predictor = Sepal.Width,
    response = Species
  ) +
  add_roc_points(
    data = iris_subset,
    predictor = Petal.Width,
    response = Species
  ) +
  add_partial_roc_curve(
    data = iris_subset,
    predictor = Petal.Length,
    response = Species,
    ratio = "tpr",
    threshold = 0.7
  ) +
  add_threshold_line(
    threshold = 0.7,
    ratio = "tpr"
  ) +
  add_chance_line()
```