---
title: "Comparing many probability density functions"
author: Jakub Nowosad
date: 2021-08-20
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Comparing many probability density functions}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

The **philentropy** package has several mechanisms to calculate distances between probability density functions.
The main one is to use the the `distance()` function, which enables to compute 46 different distances/similarities between probability density functions (see `?philentropy::distance` and [a companion vignette](Distances.html) for details).
Alternatively, it is possible to call each distance/dissimilarity function directly.
For example, the `euclidean()` function will compute the euclidean distance, while `jaccard` - the Jaccard distance.
The complete list of available distance measures are available with the `philentropy::getDistMethods()` function.

Both of the above approaches have their pros and cons. 
The `distance()` function is more flexible as it allows users to use any distance measure and can return either a `matrix` or a `dist` object. 
It also has several defensive programming checks implemented, and thus, it is more appropriate for regular users.
Single distance functions, such as `euclidean()` or `jaccard()`, can be, on the other hand, slightly faster as they directly call the underlining C++ code.

Now, we introduce three new low-level functions that are intermediaries between `distance()` and single distance functions. 
They are fairly flexible, allowing to use of any implemented distance measure, but also usually faster than calling the `distance()` functions (especially, if it is needed to use many times).
These functions are:

- `dist_one_one()` - expects two vectors (probability density functions), returns a single value
- `dist_one_many()` - expects one vector (a probability density function) and one matrix (a set of probability density functions), returns a vector of values
- `dist_many_many()` - expects two matrices (two sets of probability density functions), returns a matrix of values

Let's start testing them by attaching the **philentropy** package.

```{r}
library(philentropy)
```

## `dist_one_one()`

`dist_one_one()` is a lower level equivalent to `distance()`.
However, instead of accepting a numeric `data.frame` or `matrix`, it expects two vectors representing probability density functions.
In this example, we create two vectors, `P` and `Q`.

```{r}
P <- 1:10 / sum(1:10)
Q <- 20:29 / sum(20:29)
```

To calculate the euclidean distance between them we can use several approaches - (a) build-in R `dist()` function, (b) `philentropy::distance()`, (c) `philentropy::euclidean()`, or the new `dist_one_one()`.

```{r}
# install.packages("microbenchmark")
microbenchmark::microbenchmark(
  dist(rbind(P, Q), method = "euclidean"),
  distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE),
  euclidean(P, Q, FALSE),
  dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
)
```

All of them return the same, single value.
However, as you can see in the benchmark above, some are more flexible, and others are faster.

## `dist_one_many()`

The role of `dist_one_many()` is to calculate distances between one probability density function (in a form of a `vector`) and a set of probability density functions (as rows in a `matrix`).

Firstly, let's create our example data.

```{r}
set.seed(2020-08-20)
P <- 1:10 / sum(1:10)
M <- t(replicate(100, sample(1:10, size = 10) / 55))
```

`P` is our input vector and `M` is our input matrix.

Distances between the `P` vector and probability density functions in `M` can be calculated using several approaches. 
For example, we could write a `for` loop (adding a new code) or just use the existing `distance()` function and extract only one row (or column) from the results.
The `dist_one_many()` allows for this calculation directly as it goes through each row in `M` and calculates a given distance measure between `P` and values in this row.

```{r}
# install.packages("microbenchmark")
microbenchmark::microbenchmark(
  as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1],
  distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1],
  dist_one_many(P, M, method = "euclidean", testNA = FALSE)
)
```

The `dist_one_many()` returns a vector of values.
It is, in this case, much faster than `distance()`, and visibly faster than `dist()` while allowing for more possible distance measures to be used.

## `dist_many_many()`

`dist_many_many()` calculates distances between two sets of probability density functions (as rows in two `matrix` objects).

Let's create two new `matrix` example data.

```{r}
set.seed(2020-08-20)
M1 <- t(replicate(10, sample(1:10, size = 10) / 55))
M2 <- t(replicate(10, sample(1:10, size = 10) / 55))
```

`M1` is our first input matrix and `M2` is our second input matrix.
I am not aware of any function build-in R that allows calculating distances between rows of two matrices, and thus, to solve this problem, we can create our own - `many_dists()`...

```{r}
many_dists = function(m1, m2){
  r = matrix(nrow = nrow(m1), ncol = nrow(m2))
  for (i in seq_len(nrow(m1))){
    for (j in seq_len(nrow(m2))){
      x = rbind(m1[i, ], m2[j, ])
      r[i, j] = distance(x, method = "euclidean", mute.message = TRUE)
    }
  }
  r
}
```

... and compare it to `dist_many_many()`.

```{r}
# install.packages("microbenchmark")
microbenchmark::microbenchmark(
  many_dists(M1, M2),
  dist_many_many(M1, M2, method = "euclidean", testNA = FALSE)
)
```

Both `many_dists()`and `dist_many_many()` return a matrix.
The above benchmark concludes that `dist_many_many()` is about 30 times faster than our custom `many_dists()` approach.