distr6 is a unified, self-contained and scalable interface to probability distributions in R. Making use of the R6 paradigm, distr6 implements a fully object-oriented (OO) interface complete with distribution construction, full inheritance and more complex design patterns. The API is built to be scalable and intuitive, ensuring that every distribution has the same interface and that more complex properties are abstracted from the core functionality. A full set of tutorials can be found here. In this introductory vignette we briefly demonstrate how to construct a distribution, view and edit its parameters and evaluate different in-built methods. The website covers more complex use-cases including composite distributions and decorators for numeric methods.
We think the best place to get started is to pick a probability distribution and work through constructing the distribution via different parameterisations and querying the distribution for different methods. Below is a running example with the Normal distribution.
All distributions are constructed using R6 dollar sign notation The default Normal distribution is the Standard Normal parameterised with mean and var
Normal$new()
#> Norm(mean = 0, var = 1, sd = 1, prec = 1)
But we could also parameterise with standard deviation or precision. Note that whichever we choose is clearly printed.
Normal$new(mean = 2, sd = 2)
#> Norm(mean = 2, var = 4, sd = 2, prec = 0.25)
Normal$new(mean = 3, prec = 0.5)
#> Norm(mean = 3, var = 2, sd = 1.4142135623731, prec = 0.5)
But all parameters are available to us via the parameters method. Note how all available parameters are displayed, but only the ones chosen in construction are shown in the print method.
N <- Normal$new()
N$print()
#> Norm(mean = 0, var = 1, sd = 1, prec = 1)
N$parameters()
#>      id value support                                 description
#> 1: mean     0       ℝ                   Mean - Location Parameter
#> 2:  var     1      ℝ+          Variance - Squared Scale Parameter
#> 3:   sd     1      ℝ+        Standard Deviation - Scale Parameter
#> 4: prec     1      ℝ+ Precision - Inverse Squared Scale Parameter
Parameters are accessed with getParameterValue and edited with setParameterValue
N$setParameterValue(prec = 2)
N$getParameterValue("prec")
#> [1] 2
Note how all parameters that are related also update
N$parameters()
#>      id     value support                                 description
#> 1: mean         0       ℝ                   Mean - Location Parameter
#> 2:  var       0.5      ℝ+          Variance - Squared Scale Parameter
#> 3:   sd 0.7071068      ℝ+        Standard Deviation - Scale Parameter
#> 4: prec         2      ℝ+ Precision - Inverse Squared Scale Parameter
To view the functions that relate these parameters add the following
N$parameters()$print(hide_cols = NULL)
#>      id     value support settable                                 description
#> 1: mean         0       ℝ     TRUE                   Mean - Location Parameter
#> 2:  var       0.5      ℝ+     TRUE          Variance - Squared Scale Parameter
#> 3:   sd 0.7071068      ℝ+     TRUE        Standard Deviation - Scale Parameter
#> 4: prec         2      ℝ+     TRUE Precision - Inverse Squared Scale Parameter
The line above introduces 'method chaining', this occurs when one method is added to another. As another example, let's edit and then access another parameter in the Normal distribution
N$setParameterValue(var = 3)$getParameterValue("var")
#> [1] 3
In keeping with R conventions, distributions have a print and summary method to view key details. We have already seen how the print method displays the distribution short_name and the parameterisation.
N$print()
#> Norm(mean = 0, var = 3, sd = 1.73205080756888, prec = 0.333333333333333)
The summary method can also show basic statistics and distribution properties and traits. Adding the argument full = F, suppresses the output slightly.
N$summary()
#> Normal Probability Distribution. Parameterised with:
#>   mean = 0, var = 3, sd = 1.73205080756888, prec = 0.333333333333333
#> 
#>   Quick Statistics 
#>  Mean:       0
#>  Variance:   3
#>  Skewness:   0
#>  Ex. Kurtosis:   0
#> 
#>  Support: ℝ    Scientific Type: ℝ 
#> 
#>  Traits: continuous; univariate
#>  Properties: symmetric; mesokurtic; no skew
N$summary(full = F)
#> Norm(mean = 0, var = 3, sd = 1.73205080756888, prec = 0.333333333333333)
#> Scientific Type: ℝ      See $traits for more
#> Support: ℝ      See $properties for more
All distributions are also comprised of properties and traits. Traits are ways of describing a class whereas properties describe an object. In simpler terms, this means that a trait is present independent of the distribution's parameterisation whereas a property depends on the constructed parameters.
N$properties
#> $support
#> ℝ 
#> 
#> $symmetry
#> [1] "symmetric"
N$traits
#> $valueSupport
#> [1] "continuous"
#> 
#> $variateForm
#> [1] "univariate"
#> 
#> $type
#> ℝ
distr6 is intended not to replace R stats distributions but to be a different way of interfacing them. All distributions in R stats can be found in distr6 and all their d/p/q/r functions which refer to density/cumulative distribution/quantile/random are all available in distr6. Continuing our Normal distribution example:
N$pdf(2) # dnorm(2)
#> [1] 0.1182551
N$cdf(2) # pnorm(2)
#> [1] 0.8758935
N$quantile(0.42) # qnorm(2)
#> [1] -0.3496898
N$rand(2) # rnorm(2)
#> [1]  2.3745697 -0.9780859
distr6 makes it easy to query these results by only requiring the distribution to be constructed once and then the specific parameterisation can be forgotten. In the case of the Normal distribution this may not seem like a big difference to R stats but now look at the difference when we construct a distribution without default parameters
B <- Beta$new(shape1 = 0.582, shape2 = 1.2490)
B$pdf(2) # dbeta(2, 0.582, 1.2490)
#> [1] 0
B$cdf(2) # pbeta(2, 0.582, 1.2490)
#> [1] 1
B$quantile(0.42) # qbeta(2, 0.582, 1.2490)
#> [1] 0.1790523
B$rand(2) # rbeta(2, 0.582, 1.2490)
#> [1] 0.14613311 0.07374848
Finally distr6 includes log/log.p and lower.tail arguments to be consistent with R stats.
N$cdf(3, lower.tail = FALSE, log.p = TRUE) == pnorm(3, lower.tail = FALSE, log.p = TRUE)
#> [1] FALSE
The final part of this tutorial looks at how to access mathematical and statistical results for probability distributions. This is another advantage of distr6 as it collects not only the results for the 17 distributions in R stats but also for all others implemented in distr6. Continuing with the Normal distribution:
N$mean()
#> [1] 0
N$variance()
#> [1] 3
N$entropy() # Note default is base 2
#> [1] 2.839577
N$mgf(2)
#> [1] 403.4288
N$cf(1)
#> [1] 0.2231302+0i
For a full list of methods available use the help documentation for any distribution
Instead of having to worry about remembering every distribution in R, distr6 provides a way of listing all of these, and filtering by traits or package. We only show the first 5 rows of this to save space.
head(listDistributions())
#>    ShortName      ClassName Type ValueSupport VariateForm Package   Tags
#> 1:       Arc        Arcsine    ℝ   continuous  univariate       - limits
#> 2:      Bern      Bernoulli   ℕ0     discrete  univariate   stats       
#> 3:      Beta           Beta   ℝ+   continuous  univariate   stats       
#> 4:    BetaNC BetaNoncentral   ℝ+   continuous  univariate   stats       
#> 5:     Binom       Binomial   ℕ0     discrete  univariate   stats limits
#> 6:       Cat    Categorical    V     discrete  univariate       -
head(listDistributions(simplify = TRUE))
#> [1] "Arcsine"        "Bernoulli"      "Beta"           "BetaNoncentral"
#> [5] "Binomial"       "Categorical"
# Lists discrete distributions only
head(listDistributions(filter = list(valuesupport = "discrete")))
#>    ShortName       ClassName Type ValueSupport VariateForm    Package   Tags
#> 1:      Bern       Bernoulli   ℕ0     discrete  univariate      stats       
#> 2:     Binom        Binomial   ℕ0     discrete  univariate      stats limits
#> 3:       Cat     Categorical    V     discrete  univariate          -       
#> 4:     Degen      Degenerate    ℝ     discrete  univariate          - limits
#> 5:     DUnif DiscreteUniform    ℤ     discrete  univariate extraDistr limits
#> 6:       Emp       Empirical    ℝ     discrete  univariate          -
# Multiple filters can be used, note this is case-insensitive
head(listDistributions(filter = list(VaLueSupport = "continuous", package = "extraDistr")))
#>    ShortName    ClassName    Type ValueSupport  VariateForm    Package     Tags
#> 1:      Diri    Dirichlet [0,1]^K   continuous multivariate extraDistr         
#> 2:      Frec      Frechet       ℝ   continuous   univariate extraDistr locscale
#> 3:      Gumb       Gumbel      ℝ+   continuous   univariate extraDistr locscale
#> 4:  InvGamma InverseGamma      ℝ+   continuous   univariate extraDistr         
#> 5:       Lap      Laplace       ℝ   continuous   univariate extraDistr locscale
#> 6:      Pare       Pareto      ℝ+   continuous   univariate extraDistr
As a final point, distr6 allows the use of R6 or S3 to call methods, which means that the package magrittr can also be used for 'piping'. Returning to the Normal distribution
library(magrittr)
N$print()
#> Norm(mean = 0, var = 3, sd = 1.73205080756888, prec = 0.333333333333333)
print(N)
#> Norm(mean = 0, var = 3, sd = 1.73205080756888, prec = 0.333333333333333)
N %>% print()
#> Norm(mean = 0, var = 3, sd = 1.73205080756888, prec = 0.333333333333333)
N$pdf(2)
#> [1] 0.1182551
pdf(N, 2)
#> [1] 0.1182551
N %>% pdf(2)
#> [1] 0.1182551