--- title: "Selecting condition of interest" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Selecting condition of interest} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ROC based analyses aim to evaluate **binary classification performance** of a classifier. In other words, this type of analyses evaluate a classifier performance on differentiating two different outcomes, classes or categories. In real world scenarios, classification processes usually present more than two possible outcomes. Thus, these scenarios can be dichotomized by selecting one outcome as the **condition of interest**, or the one to be predicted, and others as not being it. Following vignette aims to show: * How ROCnGO select a condition of interest or its absence. * How this condition be selected through `.condition` argument. * How an outcome can be manually selected as the condition of interest. We'll start by loading ROCnGO and some other libraries which will help in the analysis. ```{r setup, message=FALSE, warning=FALSE} library(ROCnGO) library(dplyr) ``` ## Selection of the condition of interest As mentioned before, the outcomes of these analyses can be dichotomized in being a condition of interest $(D=1)$ or not $(D=0)$. In this way, ROCnGO internally transform the variable with each case outcome (`response`) to a factor of values 1 and 0, representing presence or absence of the condition. Taking the following example with three different outcomes, if we considered *setosa* as the condition of interest, the following factor would be generated. | Case | Response | Factor | |------ |:-----------:|--------:| | 1 | Setosa | 1 | | 2 | Versicolor | 0 | | 3 | Virginica | 0 | `response` may be of different types, so in order to select by default which class will correspond to the condition of interest among its values, library functions follow some criteria based on the variable type: * **integer**. When working with an integer vector, functions will consider the smallest one as the class to predict. * **character**. When working with a character vector, functions will consider the first value after using `sort()` over all posible options. * **factor**. When working with a factor variable, functions will select first class in `levels()`. All other classes not identified as the class to predict will be combined into a common category, labelled as 0. ## `.condition` argument Sometimes, default criteria used by functions may not be desirable. Thus, if we want to change the category identified as the condition of interest we can use `.condition` argument. This argument takes as an input one of the values of `response`, setting it as the condition of interest of the classifier. ### Examples These behaviours can be tested with the following examples. In the first place we will create an small dataset by using a small subset of `iris` dataset. ```{r warning=FALSE} # Create a small subset of iris with 5 random flowers of each species iris_subset <- as_tibble(iris) %>% group_by(Species) %>% slice_sample(n = 5) %>% ungroup() iris_subset ``` Once we have created our dataset, we can check the performance of the different variables as predictors for the species, for this task we may use `summarize_dataset()` function. ```{r, warning=FALSE} # Check levels in Species levels(iris_subset$Species) # Summarize dataset classifiers iris_results <- summarize_dataset( iris_subset, response = Species, ratio = "tpr", threshold = 0.9 ) iris_results$data ``` As we may see Sepal.Width scores the best performance in the dataset, at least for *setosa* species. As we have mentioned before, this class has been selected as the condition of interest since it is the first element in species levels. Furthermore, the performance of Sepal.Width as a *setosa* classifier may be addressed since it presents slightly higher scores. Now, if we want to repeat the analysis but considering *virginica* as the species of interest, we can consider `.condition` argument. ```{r, warning=FALSE} # Summarize dataset classifiers with virginica species as D=1 virginica_results <- summarize_dataset( iris_subset, response = Species, ratio = "tpr", threshold = 0.9, .condition = "virginica" ) virginica_results$data ``` As we may see, new results highly differ from previous ones. Now Sepal.Length, Petal.Length and Petal.Width behave as better classifiers instead of Sepal.Width. In the same way, these results can be qualitatively matched with values in dataset, where variables score higher for this species. ## Manual selection of the condition of interest Sometimes, it may be more useful to select manually the condition of interest. This may be the case, e.g. when working with a variable type than cannot be easily treated. In order to manually select this condition, we could simply transform `response` to another type that can be recognized by the library, even `.condition` may be used to specify which class to use. Alternatively, we can transform `response` to a factor of 0 and 1 values, where its first item in `levels()` will be 0. Library recognizes this variable as not needing any treatment, so it can be used to easily define this new responses. ### Examples We can check this manual selection with the following example. In this scenario, we will be supposing that we cannot make directly calculations over Species and we will need to define new variables to do it. ```{r warning=FALSE} # Create new variables to evaluate "virginica" species classifiers iris_subset <- iris_subset %>% mutate( Species_int = ifelse(Species == "virginica", 2L, 1L), Species_fct = factor( ifelse(Species == "virginica", 1, 0), levels = c(0, 1) ) ) # Check new variables iris_subset[, c("Species", "Species_int", "Species_fct")] ``` Now we can evaluate the classifier performance. ```{r warning=FALSE} # Select predictors predictors <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") # Check performance of virginica classifiers with .condition = 2 int_results <- summarize_dataset( iris_subset, predictors = predictors, response = Species_int, ratio = "tpr", threshold = 0.9, .condition = 2 ) int_results$data # Check performance of virginica classifiers with factor fct_results <- summarize_dataset( iris_subset, predictors = predictors, response = Species_fct, ratio = "tpr", threshold = 0.9 ) fct_results$data ``` As we may see results for each scenario correspond to ones obtained in the previous section, where we evaluated Species variable using `.condition = "virginica"` directly.