---
title: "Estimating Distributional Models with brms"
author: "Paul Bürkner"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: yes
vignette: >
  %\VignetteIndexEntry{Estimating Distributional Models with brms}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
params:
  EVAL: !r identical(Sys.getenv("NOT_CRAN"), "true")
---

```{r, SETTINGS-knitr, include=FALSE}
stopifnot(require(knitr))
options(width = 90)
opts_chunk$set(
  comment = NA,
  message = FALSE,
  warning = FALSE,
  eval = if (isTRUE(exists("params"))) params$EVAL else FALSE,
  dev = "jpeg",
  dpi = 100,
  fig.asp = 0.8,
  fig.width = 5,
  out.width = "60%",
  fig.align = "center"
)
library(brms)
ggplot2::theme_set(theme_default())
```

## Introduction

This vignette provides an introduction on how to fit distributional regression
models with **brms**. We use the term *distributional model* to refer to a
model, in which we can specify predictor terms for all parameters of the assumed
response distribution. In the vast majority of regression model implementations,
only the location parameter (usually the mean) of the response distribution
depends on the predictors and corresponding regression parameters. Other
parameters (e.g., scale or shape parameters) are estimated as auxiliary
parameters assuming them to be constant across observations. This assumption is
so common that most researchers applying regression models are often (in my
experience) not aware of the possibility of relaxing it. This is understandable
insofar as relaxing this assumption drastically increase model complexity and
thus makes models hard to fit. Fortunately, **brms** uses **Stan** on the
backend, which is an incredibly flexible and powerful tool for estimating
Bayesian models so that model complexity is much less of an issue.

Suppose we have a normally distributed response variable. Then, in basic linear
regression, we specify a predictor term $\eta_{\mu}$ for the mean parameter
$\mu$ of the normal distribution. The second parameter of the normal
distribution -- the residual standard deviation $\sigma$ -- is assumed to be
constant across observations. We estimate $\sigma$ but do not try to *predict*
it. In a distributional model, however, we do exactly this by specifying a
predictor term $\eta_{\sigma}$ for $\sigma$ in addition to the predictor term
$\eta_{\mu}$. Ignoring group-level effects for the moment, the linear predictor
of a parameter $\theta$ for observation $n$ has the form

$$\eta_{\theta n} = \sum_{i = 1}^{K_{\theta}} b_{\theta i} x_{\theta i n}$$
where $x_{\theta i n}$ denotes the value of the $i$th predictor of parameter
$\theta$ for observation $n$ and $b_{\theta i}$ is the $i$th regression
coefficient of parameter $\theta$. A distributional normal model with response
variable $y$ can then be written as

$$y_n \sim \mathcal{N}\left(\eta_{\mu n}, \, \exp(\eta_{\sigma n}) \right)$$
We used the exponential function around $\eta_{\sigma}$ to reflect that $\sigma$
constitutes a standard deviation and thus only takes on positive values, while a
linear predictor can be any real number.

## A simple distributional model

Unequal variance models are possibly the most simple, but nevertheless very
important application of distributional models. Suppose we have two groups of
patients: One group receives a treatment (e.g., an antidepressive drug) and
another group receives placebo. Since the treatment may not work equally well
for all patients, the symptom variance of the treatment group may be larger than
the symptom variance of the placebo group after some weeks of treatment. For
simplicity, assume that we only investigate the post-treatment values.

```{r}
group <- rep(c("treat", "placebo"), each = 30)
symptom_post <- c(rnorm(30, mean = 1, sd = 2), rnorm(30, mean = 0, sd = 1))
dat1 <- data.frame(group, symptom_post)
head(dat1)
```

The following model estimates the effect of `group` on both the mean and the
residual standard deviation of the normal response distribution.

```{r, results='hide'}
fit1 <- brm(bf(symptom_post ~ group, sigma ~ group),
            data = dat1, family = gaussian())
```

Useful summary statistics and plots can be obtained via

```{r, results='hide'}
summary(fit1)
plot(fit1, N = 2, ask = FALSE)
plot(conditional_effects(fit1), points = TRUE)
```

The population-level effect `sigma_grouptreat`, which is the contrast of the two
residual standard deviations on the log-scale, reveals that the variances of
both groups are indeed different. This impression is confirmed when looking at
the `conditional_effects` of `group`. Going one step further, we can compute the
residual standard deviations on the original scale using the `hypothesis`
method.

```{r}
hyp <- c("exp(sigma_Intercept) = 0",
         "exp(sigma_Intercept + sigma_grouptreat) = 0")
hypothesis(fit1, hyp)
```

We may also directly compare them and plot the posterior distribution of their
difference.

```{r}
hyp <- "exp(sigma_Intercept + sigma_grouptreat) > exp(sigma_Intercept)"
(hyp <- hypothesis(fit1, hyp))
plot(hyp, chars = NULL)
```

Indeed, the residual standard deviation of the treatment group seems to larger
than that of the placebo group. Moreover the magnitude of this difference is
pretty similar to what we expected due to the values we put into the data
simulations.

## Zero-Inflated Models

Another important application of the distributional regression framework are so
called zero-inflated models. These models are helpful whenever there are more
zeros in the response variable than one would naturally expect. For example, if
one seeks to predict the number of cigarettes people smoke per day and also
includes non-smokers, there will be a huge amount of zeros which, when not
modeled appropriately, can seriously distort parameter estimates. Here, we
consider an example dealing with the number of fish caught by various groups of
people. On the UCLA website
(\url{https://stats.idre.ucla.edu/stata/dae/zero-inflated-poisson-regression}),
the data are described as follows: "The state wildlife biologists want to model
how many fish are being caught by fishermen at a state park. Visitors are asked
how long they stayed, how many people were in the group, were there children in
the group and how many fish were caught. Some visitors do not fish, but there is
no data on whether a person fished or not. Some visitors who did fish did not
catch any fish so there are excess zeros in the data because of the people that
did not fish."

```{r}
zinb <- read.csv("https://paul-buerkner.github.io/data/fish.csv")
head(zinb)
```

As predictors we choose the number of people per group, the number of children,
as well as whether the group consists of campers. Many groups may not even try
catching any fish at all (thus leading to many zero responses) and so we fit a
zero-inflated Poisson model to the data. For now, we assume a constant
zero-inflation probability across observations.

```{r, results='hide'}
fit_zinb1 <- brm(count ~ persons + child + camper,
                 data = zinb, family = zero_inflated_poisson())
```

Again, we summarize the results using the usual methods.

```{r}
summary(fit_zinb1)
plot(conditional_effects(fit_zinb1), ask = FALSE)
```

According to the parameter estimates, larger groups catch more fish, campers
catch more fish than non-campers, and groups with more children catch less fish.
The zero-inflation probability `zi` is pretty large with a mean of 41%. Please
note that the probability of catching no fish is actually higher than 41%, but
parts of this probability are already modeled by the Poisson distribution itself
(hence the name zero-*inflation*). If you want to treat all zeros as originating
from a separate process, you can use hurdle models instead (not shown here).

Now, we try to additionally predict the zero-inflation probability by the number
of children. The underlying reasoning is that we expect groups with more
children to not even try catching fish. Most children are just terribly bad at
waiting for hours until something happens. From a purely statistical
perspective, zero-inflated (and hurdle) distributions are a mixture of two
processes and predicting both parts of the model is natural and often very
reasonable to make full use of the data.

```{r, results='hide'}
fit_zinb2 <- brm(bf(count ~ persons + child + camper, zi ~ child),
                 data = zinb, family = zero_inflated_poisson())
```

```{r}
summary(fit_zinb2)
plot(conditional_effects(fit_zinb2), ask = FALSE)
```

To transform the linear predictor of `zi` into a probability, **brms** applies
the logit-link:

$$logit(zi) = \log\left(\frac{zi}{1-zi}\right) = \eta_{zi}$$

The logit-link takes values within $[0, 1]$ and returns values on the real line.
Thus, it allows the transition between probabilities and linear predictors.

According to the model, trying to fish with children not only decreases the
overall number fish caught (as implied by the Poisson part of the model) but
also drastically increases your change of catching no fish at all (as implied by
the zero-inflation part) most likely because groups with more children are not
even trying.

## Additive Distributional Models

In the examples so far, we did not have multilevel data and thus did not fully
use the capabilities of the distributional regression framework of **brms**. In
the example presented below, we will not only show how to deal with multilevel
data in distributional models, but also how to incorporate smooth terms (i.e.,
splines) into the model. In many applications, we have no or only a very vague
idea how the relationship between a predictor and the response looks like. A
very flexible approach to tackle this problems is to use splines and let them
figure out the form of the relationship. For illustration purposes, we simulate
some data with the **mgcv** package, which is also used in **brms** to prepare
smooth terms.

```{r}
dat_smooth <- mgcv::gamSim(eg = 6, n = 200, scale = 2, verbose = FALSE)
head(dat_smooth[, 1:6])
```

The data contains the predictors `x0` to `x3` as well as the grouping factor
`fac` indicating the nested structure of the data. We predict the response
variable `y` using smooth terms of `x1` and `x2` and a varying intercept of
`fac`. In addition, we assume the residual standard deviation `sigma` to vary by
a smoothing term of `x0` and a varying intercept of `fac`.

```{r, results='hide'}
fit_smooth1 <- brm(
  bf(y ~ s(x1) + s(x2) + (1|fac), sigma ~ s(x0) + (1|fac)),
  data = dat_smooth, family = gaussian(),
  chains = 2, control = list(adapt_delta = 0.95)
)
```

```{r}
summary(fit_smooth1)
plot(conditional_effects(fit_smooth1), points = TRUE, ask = FALSE)
```

This model is likely an overkill for the data at hand, but nicely demonstrates
the ease with which one can specify complex models with **brms** and to fit them
using **Stan** on the backend.