---
title: "Distances"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Distances}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

## How to use `distance()`

The `distance()` function implemented in `philentropy` is able to compute 46 different distances/similarities between probability density functions (see `?philentropy::distance` for details).

### Simple Example

The `distance()` function is implemented using the same _logic_ as R's base function `stats::dist()` and takes a `matrix` or `data.frame` object as input. The corresponding `matrix` or `data.frame` should store probability density functions (as rows) for which distance computations should be performed.


```r
# define a probability density function P
P <- 1:10/sum(1:10)
# define a probability density function Q
Q <- 20:29/sum(20:29)

# combine P and Q as matrix object
x <- rbind(P,Q)
```

Please note that when defining a `matrix` from vectors, probability vectors should be combined as rows (`rbind()`).

```r
library(philentropy)

# compute the Euclidean Distance with default parameters
distance(x, method = "euclidean")
```

```
euclidean 
0.1280713
```

For this simple case you can compare the results with R's base function to compute the euclidean distance `stats::dist()`.

```r
# compute the Euclidean Distance using R's base function
stats::dist(x, method = "euclidean")
```

```
          P
Q 0.1280713
```

However, the R base function `stats::dist()` only computes the following distance measures: `"euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski"`, whereas `distance()` allows you to choose from 46 distance/similarity measures and when selecting the native distance functions underlying `distance()` users can speed up their computations 3-5x. 

To find out which `method`s are implemented in `distance()` you can consult
the `getDistMethods()` function.

```r
# names of implemented distance/similarity functions
getDistMethods()
```

```
 [1] "euclidean"         "manhattan"         "minkowski"         "chebyshev"        
 [5] "sorensen"          "gower"             "soergel"           "kulczynski_d"     
 [9] "canberra"          "lorentzian"        "intersection"      "non-intersection" 
[13] "wavehedges"        "czekanowski"       "motyka"            "kulczynski_s"     
[17] "tanimoto"          "ruzicka"           "inner_product"     "harmonic_mean"    
[21] "cosine"            "hassebrook"        "jaccard"           "dice"             
[25] "fidelity"          "bhattacharyya"     "hellinger"         "matusita"         
[29] "squared_chord"     "squared_euclidean" "pearson"           "neyman"           
[33] "squared_chi"       "prob_symm"         "divergence"        "clark"            
[37] "additive_symm"     "kullback-leibler"  "jeffreys"          "k_divergence"     
[41] "topsoe"            "jensen-shannon"    "jensen_difference" "taneja"           
[45] "kumar-johnson"     "avg"
```

Now you can choose any distance/similarity `method` that serves you.

```r
# compute the Jaccard Distance with default parameters
distance(x, method = "jaccard")
```

```
 jaccard 
0.133869
```

Analogously, in case a probability matrix is specified the following output is generated.

```r
# combine three probabilty vectors to a probabilty matrix 
ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39))
rownames(ProbMatrix) <- paste0("Example", 1:3)

# compute the euclidean distance between all 
# pairwise comparisons of probability vectors
distance(ProbMatrix, method = "euclidean")
```

```
#> Metric: 'euclidean'; comparing: 3 vectors.
          v1         v2         v3
v1 0.0000000 0.12807130 0.13881717
v2 0.1280713 0.00000000 0.01074588
v3 0.1388172 0.01074588 0.00000000
```

Alternatively, users can specify the argument `use.row.names = TRUE` to maintain the
rownames of the input matrix and pass them as rownames and colnames to the output distance matrix.

```r
# compute the euclidean distance between all 
# pairwise comparisons of probability vectors
distance(ProbMatrix, method = "euclidean", use.row.names = TRUE)
```

```
#> Metric: 'euclidean'; comparing: 3 vectors.
          Example1   Example2   Example3
Example1 0.0000000 0.12807130 0.13881717
Example2 0.1280713 0.00000000 0.01074588
Example3 0.1388172 0.01074588 0.00000000
```

This output differs from the output of `stats::dist()`.

```r
# compute the euclidean distance between all 
# pairwise comparisons of probability vectors
# using stats::dist()
stats::dist(ProbMatrix, method = "euclidean")
```

```
           1          2
2 0.12807130           
3 0.13881717 0.01074588
```

Whereas `distance()` returns a symmetric distance matrix, `stats::dist()` returns only one part of the symmetric matrix. 

However, users can also specify the argument `as.dist.obj = TRUE` in `philentropy::distance()`
to retrieve a `philentropy::distance()` output which is an object of type `stats::dist()`.

```r
ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39))
rownames(ProbMatrix) <- paste0("test", 1:3)
distance(ProbMatrix, method = "euclidean", use.row.names = TRUE, as.dist.obj = TRUE)
```

```
Metric: 'euclidean'; comparing: 3 vectors.
           test1      test2
test2 0.12807130           
test3 0.13881717 0.01074588
```

Now let's compare the run times of base R and `philentropy`. For this purpose you need to install the `microbenchmark` package.

> Note: Please make sure to insert vector objects (in our example P, Q) when directly running the low-level functions such as `euclidean()` etc. Otherwise, computational overheads are produced that 
[significantly slow down computations when using large vectors](https://github.com/drostlab/philentropy/issues/30).

```r
# install.packages("microbenchmark")
library(microbenchmark)

microbenchmark(
  distance(x,method = "euclidean", test.na = FALSE),
  dist(x,method = "euclidean"),
  euclidean(P, Q, FALSE)
)
```

```
Unit: microseconds
                                               expr    min      lq     mean  median      uq    max neval
 distance(x, method = "euclidean", test.na = FALSE) 26.518 28.3495 29.73174 29.2210 30.1025 62.096   100
                      dist(x, method = "euclidean") 11.073 12.9375 14.65223 14.3340 15.1710 65.130   100
                   euclidean(P, Q, FALSE)  4.329  4.9605  5.72378  5.4815  6.1240 22.510   100
```

As you can see, although the `distance()` function is quite fast, the internal checks cause it to be 2x slower than the base `dist()` function (for the `euclidean` example). Nevertheless, in case you need to implement a faster version of the corresponding distance measure you can type `philentropy::` and then `TAB` allowing you to select the base distance computation functions (written in C++), e.g. `philentropy::euclidean()` which is almost 3x faster than the base `dist()` function.

The advantage of `distance()` is that it implements 46 distance measures based on base C++ functions that can be accessed individually by typing `philentropy::` and then `TAB`. In future versions of `philentropy` I will optimize the `distance()` function so that internal checks for data type correctness and correct input data will take less termination time than the base `dist()` function.

<!--
## Detailed assessment of individual similarity and distance metrics


The vast amount of available similarity metrics raises the immediate question which metric should be used for which application. Here, I will review the origin of each individual metric and will discuss the most recent literature that aims to compare these measures. I hope that users will find valuable insights and might be stimulated to conduct their own comparative research since this is a field of ongoing research. 

### $L_p$ Minkowski Family

#### Euclidean distance

The [euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) is (named after [Euclid](https://en.wikipedia.org/wiki/Euclid)) a straight line distance between two points.
Euclid argued that that the __shortest__ distance between two points is always a line.

> $d = \sqrt{\sum_{i = 1}^N | P_i - Q_i |^2)}$

#### Manhattan distance

> $d = \sum_{i = 1}^N | P_i - Q_i |$

#### Minkowski distance

> $d = ( \sum_{i = 1}^N | P_i - Q_i |^p)^{1/p}$

#### Chebyshev distance

> $d = max | P_i - Q_i |$
-->