--- title: "Comparing many probability density functions" author: Jakub Nowosad date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Comparing many probability density functions} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- The **philentropy** package has several mechanisms to calculate distances between probability density functions. The main one is to use the the `distance()` function, which enables to compute 46 different distances/similarities between probability density functions (see `?philentropy::distance` and [a companion vignette](Distances.html) for details). Alternatively, it is possible to call each distance/dissimilarity function directly. For example, the `euclidean()` function will compute the euclidean distance, while `jaccard` - the Jaccard distance. The complete list of available distance measures are available with the `philentropy::getDistMethods()` function. Both of the above approaches have their pros and cons. The `distance()` function is more flexible as it allows users to use any distance measure and can return either a `matrix` or a `dist` object. It also has several defensive programming checks implemented, and thus, it is more appropriate for regular users. Single distance functions, such as `euclidean()` or `jaccard()`, can be, on the other hand, slightly faster as they directly call the underlining C++ code. Now, we introduce three new low-level functions that are intermediaries between `distance()` and single distance functions. They are fairly flexible, allowing to use of any implemented distance measure, but also usually faster than calling the `distance()` functions (especially, if it is needed to use many times). These functions are: - `dist_one_one()` - expects two vectors (probability density functions), returns a single value - `dist_one_many()` - expects one vector (a probability density function) and one matrix (a set of probability density functions), returns a vector of values - `dist_many_many()` - expects two matrices (two sets of probability density functions), returns a matrix of values Let's start testing them by attaching the **philentropy** package. ```{r} library(philentropy) ``` ## `dist_one_one()` `dist_one_one()` is a lower level equivalent to `distance()`. However, instead of accepting a numeric `data.frame` or `matrix`, it expects two vectors representing probability density functions. In this example, we create two vectors, `P` and `Q`. ```{r} P <- 1:10 / sum(1:10) Q <- 20:29 / sum(20:29) ``` To calculate the euclidean distance between them we can use several approaches - (a) build-in R `dist()` function, (b) `philentropy::distance()`, (c) `philentropy::euclidean()`, or the new `dist_one_one()`. ```{r} # install.packages("microbenchmark") microbenchmark::microbenchmark( dist(rbind(P, Q), method = "euclidean"), distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE), euclidean(P, Q, FALSE), dist_one_one(P, Q, method = "euclidean", testNA = FALSE) ) ``` All of them return the same, single value. However, as you can see in the benchmark above, some are more flexible, and others are faster. ## `dist_one_many()` The role of `dist_one_many()` is to calculate distances between one probability density function (in a form of a `vector`) and a set of probability density functions (as rows in a `matrix`). Firstly, let's create our example data. ```{r} set.seed(2020-08-20) P <- 1:10 / sum(1:10) M <- t(replicate(100, sample(1:10, size = 10) / 55)) ``` `P` is our input vector and `M` is our input matrix. Distances between the `P` vector and probability density functions in `M` can be calculated using several approaches. For example, we could write a `for` loop (adding a new code) or just use the existing `distance()` function and extract only one row (or column) from the results. The `dist_one_many()` allows for this calculation directly as it goes through each row in `M` and calculates a given distance measure between `P` and values in this row. ```{r} # install.packages("microbenchmark") microbenchmark::microbenchmark( as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1], distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1], dist_one_many(P, M, method = "euclidean", testNA = FALSE) ) ``` The `dist_one_many()` returns a vector of values. It is, in this case, much faster than `distance()`, and visibly faster than `dist()` while allowing for more possible distance measures to be used. ## `dist_many_many()` `dist_many_many()` calculates distances between two sets of probability density functions (as rows in two `matrix` objects). `dist_many_many()` calculates distances between two sets of probability density functions (as rows in two `matrix` objects). This is useful when you have two different sets of distributions, say `M1` and `M2`, and you want to compute the distance from every distribution in `M1` to every distribution in `M2`. The main `distance()` function cannot do this, as it only computes pairwise distances *within* a single matrix. Let's create two new `matrix` example data. Let's create two `matrix` examples. ```{r} set.seed(2020-08-20) M1 <- t(replicate(10, sample(1:10, size = 10) / 55)) M2 <- t(replicate(10, sample(1:10, size = 10) / 55)) ``` `M1` is our first input matrix and `M2` is our second input matrix. I am not aware of any function build-in R that allows calculating distances between rows of two matrices, and thus, to solve this problem, we can create our own - `many_dists()`... ```{r} many_dists = function(m1, m2){ r = matrix(nrow = nrow(m1), ncol = nrow(m2)) for (i in seq_len(nrow(m1))){ for (j in seq_len(nrow(m2))){ x = rbind(m1[i, ], m2[j, ]) r[i, j] = distance(x, method = "euclidean", mute.message = TRUE) } } r } ``` ... and compare it to `dist_many_many()`. `dist_many_many()` is fully implemented in C++ and can use multiple threads. For this comparison we will use the default `num.threads = NULL` which will use 2 threads unless the `RCPP_PARALLEL_NUM_THREADS` environment variable is set. There are trade-offs with selecting the number of threads. Using too many threads, more than needed for the workload, will incur additional overhead and may not get to the result any faster than a sequential approach. ```{r} # install.packages("microbenchmark") bm <- microbenchmark::microbenchmark( `many_dists` = many_dists(M1, M2), `dist_many_many` = dist_many_many(M1, M2, method = "euclidean", testNA = FALSE) ) bm ``` Both `many_dists()`and `dist_many_many()` return a matrix. ```{r echo=FALSE} x <- aggregate(bm$time, by = list(expr=bm$expr), FUN = mean) y <- x$x names(y) <- x$expr ``` The above benchmark concludes that `dist_many_many()` is about `r round(y["many_dists"] / y["dist_many_many"], -1)` times faster than our custom `many_dists()` approach. If we were to calculate the distance manually, as opposed to using the optimized `distance()` function which calls compiled code, we would see an even bigger difference.