--- title: "Introduction to the SONO package" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to the SONO package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r logo, echo=FALSE, out.width="20%", fig.align="right"} knitr::include_graphics("../man/figures/logo.png") ``` The SONO (Scores Of Nominal Outlyingness) package is used to compute scores of outlyingness for data sets consisting of nominal features. The scores are computed using the framework of [Costa, E., & Papatsouma, I. (2025)](https://arxiv.org/abs/2408.07463). SONO also includes several functions that compute evaluation metrics typically used for outlier identification algorithms that produce scores. The main function of the package is `sono` which computes the scores. The score of nominal outlyingness for an observation $\boldsymbol{x}_i$ is given by: $$s(\boldsymbol{x}_i)=\sum_{\substack{d \subseteq \boldsymbol{x}_{i}: \\ \text{supp}(d) \notin (\sigma_d, n], \\ \lvert d \rvert \leq \mathrm{MAXLEN}}} \frac{\sigma_d}{\text{supp}(d) \times \lvert d \rvert^r}, \ r> 0, \ i=1,\dots,n,$$ for highly infrequent itemsets and: $$s(\boldsymbol{x}_i)=\sum_{\substack{d \subseteq \boldsymbol{x}_{i}: \\ \text{supp}(d) \notin [0, \sigma_d), \\ \lvert d \rvert \leq \mathrm{MAXLEN}}} \frac{\text{supp}(d)}{\sigma_d \times \left( \text{MAXLEN} - \lvert d \rvert + 1 \right)^r}, \ r> 0, \ i=1,\dots,n,$$ for highly frequent itemsets. In the above, $\text{supp}(d)$ is the support of itemset $d$, $\sigma_d$ is the the maximum/minimum support threshold and $\text{MAXLEN}$ is the maximum length of sequences considered, while $r$ is an exponent term to be determined by the user. The `sono` function only requires two input arguments; a data frame `data` that needs to contain factors only and a list of probability vectors `probs`. Each element in `probs` must be a probability vector consisting of as many elements as the number of unique factors in the corresponding column of `data`. The length of `probs` must be equal to `ncol(data)`. Additional input arguments include `alpha`, `r`, `MAXLEN`, `frequent` and `verbose`. These are set by default equal to 0.01, 2, 0, `FALSE` and `TRUE`, respectively but they can be changed by the user. `alpha` is the significance level of the confidence interval constructed for determining the minimum/maximum support threshold values $\sigma_d$. The exponent `r` is as in the definition of the score and same for `MAXLEN`. Setting `MAXLEN = 0` (the default option) triggers an automatic search for the value of `MAXLEN` so as to ensure that no redundant computations are being done, while accounting for the sparsity in the contingency tables introduced by considering several nominal variables. This can be estimated beforehand using the `MAXLEN_est` function, so that the user can set it to a lower value (but not larger) if needed. Finally, `frequent = FALSE` considers highly infrequent itemsets as outlying, whereas `frequent = TRUE` treats highly frequent itemsets as more likely to be outliers. Progress messages are printed by setting `verbose = TRUE`. Below, we generate an artificial data set and illustrate how `sono` works. ```{r setup} library(SONO) # Generate data set.seed(1) X <- sample(c(1:3), 500, replace = TRUE, prob = c(0.2, 0.3, 0.5)) X <- cbind(X, sample(c(1:2), 500, replace = TRUE, prob = c(0.1, 0.9))) X <- cbind(X, sample(c(1:5), 500, replace = TRUE, prob = rep(0.2, 5))) X <- data.frame(X) # Ensure every column is a factor for (i in 1:ncol(X)){ X[, i] <- factor(X[, i]) } # Run SONO with probability vectors matching data generating process prob_vecs <- list(c(0.2, 0.3, 0.5), c(0.1, 0.9), rep(0.2, 5)) # Run SONO with true probabilities and r = 2 sono_res1 <- sono(data = X, probs = prob_vecs, alpha = 0.01, r = 2, MAXLEN = 0, frequent = FALSE, verbose = TRUE) # See summary of scores summary(sono_res1[[2]][, 2]) ``` As expected, since the probabilities match the ones used to generate the data, all scores are equal to 0. The rest of the output elements of `sono` are `MAXLEN`, the matrix of variable contributions and the Nominal Outlyingness Depth for each observation. These additional concepts can be used to assess which and how many variables contribute to the outlier score of each observation. In order to showcase the rest of the functions of the `SONO` package, we use misspecified probabilities for `probs`. ```{r msspecification} # Run SONO with misspecified probability vectors prob_vecs_mis <- list(c(0.4, 0.4, 0.2), c(0.9, 0.1), rep(0.2, 5)) # Run SONO with true probabilities and r = 2 sono_res2 <- sono(data = X, probs = prob_vecs_mis, alpha = 0.01, r = 2, MAXLEN = 0, frequent = FALSE, verbose = TRUE) # See summary of scores summary(sono_res2[[2]][, 2]) ``` Based on the misspecified probability vectors, we expect that the first 2 levels of the first variable are infrequent, as well as the first level of the second variable. The largest misspecification is for the latter, so we expect these observations to have the largest scores; we print summaries for the scores of each observation possessing these aforementioned levels, confirming our claim. Also notice how the minimum score and the mean score for observations possessing the first level of the first nominal variable are higher than the respective values for the second level of the same feature; this is because the misspecification is larger in the former case. ```{r summaries} # See summary of scores for each case summary(sono_res2[[2]][which(X[, 1] == 1), 2]) summary(sono_res2[[2]][which(X[, 1] == 2), 2]) summary(sono_res2[[2]][which(X[, 2] == 1), 2]) ``` We can also visualise the contribution matrix for some observations. The function `vis_contribs` offers this possibility. The arguments include `contribs_mat` for the matrix of contributions, `subset` which only plots the rows for specific observations (default is `NULL` plots all rows of the matrix) and `scale` for scaling the values. The possible scaling options are `"none"` for no scaling (which is the default option), `"row"` for row-wise scaling so that each row sums to a unit and `"max"`, so that each row has a maximum value of a unit. Here we apply max scaling and only plot the observations for which a non-zero score is obtained. ```{r contributions_plot} # Plot matrix of contributions vis_contribs(contribs_mat = sono_res2[[3]], subset = which(sono_res2[[2]][, 2] > 0), scale = "max") ``` As can be seen, the third variable does not contribute at all to the clustering. The largest contribution is for the second variable, which is the one with the largest misspecification. We finally illustrate how the evaluation functions included in the package work. The first one is `avg_rank_outs`, which computes the average rank of outliers. We will only treat the observations for which `V2 = 1` as outlying. The function takes as input arguments the `scores`, a vector of outlier indices `outs` and a way to handle ties in scores. The default setting for `ties` is `"min"` (see the documentation for more options). We also compute the proportion of outliers correctly detected at the Top $K\%$ scores; this is the function `recall_at_k` that takes as input argument the `scores`, the outlier indices `outs` and a `grid` for the values of $K$. The ROC-AUC is also computed using the `roc_auc` function, with the exact same input arguments. We see that the ROC-AUC is constantly equal to a unit, showing that the outliers are indeed ranked so that they have the largest outlier scores. ```{r eval_funs} outliers <- which(X[, 2] == 1) # Compute average rank of outliers avg_rank <- avg_rank_outs(scores = sono_res2[[2]][, 2], outs = outliers, ties = "min") cat('Average rank of outliers:', avg_rank, '\n') grid_vals <- c(1, 2.5, seq(5, 100, by = 5))/100 recall <- recall_at_k(scores = sono_res2[[2]][, 2], outs = outliers, grid = grid_vals) for (i in 1:length(grid_vals)){ cat('Recall at', grid_vals[i], ':', recall[i], '\n') } roc_auc_vals <- roc_auc(scores = sono_res2[[2]][, 2], outs = outliers, grid = grid_vals) for (i in 1:length(grid_vals)){ cat('ROC AUC at', grid_vals[i], ':', roc_auc_vals[i], '\n') } ```