---
title: "Using SAMBA"
author: "Created by Dr. Lauren J Beesley. Contact: lbeesley@umich.edu"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{UsingSAMBA}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

In this vignette, we provide a brief introduction to using the R package *SAMBA*
(selection and misclassification bias adjustment). The purpose of this package is
to provide resources for fitting a logistic regression for a binary outcome in
the presence of outcome misclassification (in particular, imperfect sensitivity)
along with potential selection bias. We will not include technical details about
this estimation approach and instead focus on implementation. For additional
details about the estimation algorithm, we refer the reader to ``Statistical
inference for association studies using electronic health records: handling both
selection bias and outcome misclassification'' by Lauren J Beesley and
Bhramar Mukherjee on medRvix.


## Model structure

Let binary $D$ represent a patient's true disease status for a disease of
interest, e.g. diabetes. Suppose we are interested in the relationship between
$D$ and and person-level information, $Z$. $Z$ may contain genetic information,
lab results, age, gender, or any other characteristics of interest. We call this
relationship the *disease mechanism* as in **Figure 1**.

Suppose we consider a large health care system-based database with the goal of
making inference about some defined general population. Let $S$ indicate whether
a particular subject in the population is sampled into our dataset (for example,
by going to a particular hospital and consenting to share biosamples), where
the probability of an individual being included in the current dataset may
depend on the underlying lifetime disease status, $D$, along with additional
covariates, $W$. $W$ may also contain some or all adjustment factors in $Z$.
In the EHR setting, we may often expect the sampled and non-sampled patients to
have different rates of the disease, and other factors such as patient age,
residence, access to care and general health state may also impact whether
patients are included in the study base or not. We will call this mechanism
the *selection mechanism.*

Instances of the disease are recorded in hospital or administrative records.
We might expect factors such as the patient age, the length of follow-up, and
the number of hospital visits to impact whether we actually *observe/record* the
disease of interest for a given person. Let $D^*$ be the *observed* disease
status. $D^*$ is a potentially misclassified version of $D$. We will call the
mechanism generating $D^*$ the *observation mechanism.* We will assume that
misclassification is primarily through underreporting of disease. In other
words, we will assume that $D^*$ has perfect specificity and potentially
imperfect sensitivity with respect to $D$. Let $X$ denote patient and
provider-level predictors related to the true positive rate (sensitivity).


```{r, echo=FALSE, fig.cap="Figure 1: Model Structure", out.width = '80%'}
knitr::include_graphics("images/ModelDiagram.png")
```

We express the conceptual model as follows:

$$\text{Disease Model}: \text{logit}(P(D=1 \vert Z; \theta )) = \theta_0 + \theta_Z Z $$
$$\text{Selection Model}:  P(S=1 \vert D, W; \phi )$$
$$\text{Sensitivity/Observation Model}: \text{logit}(P(D^*=1 \vert D=1, S=1, X; \beta)) = \beta_0 + \beta_X X $$

## Simulate data

We start our exploration by simulating some binary data subject to
misclassification of the outcome and selection bias. Variables related to
selection are D and W. Variables related to sensitivity are X. Variables of
interest are Z, which are related to W and X. In this simulation, W is also
independently related to D.

```{r, echo = TRUE, eval = TRUE,  fig.width = 7, fig.height = 4}

library(SAMBA)
library(MASS)
expit <- function(x) exp(x) / (1 + exp(x))
logit <- function(x) log(x / (1 - x))

nobs <- 5000

### Generate Predictors and Follow-up Information
set.seed(1234)
cov <- mvrnorm(n = nobs, mu = rep(0, 3), Sigma = rbind(c(1,   0, 0.4),
                                                       c(0,   1,   0),
                                                       c(0.4, 0,   1)))

data <- data.frame(Z = cov[, 1], X = cov[, 2], W = cov[, 3])

# Generate random uniforms
set.seed(5678)
U1 <- runif(nobs)
set.seed(4321)
U2 <- runif(nobs)
set.seed(8765)
U3 <- runif(nobs)

# Generate Disease Status
DISEASE <- expit(-2 + 0.5 * data$Z)
data$D   <- ifelse(DISEASE > U1, 1, 0)

# Relate W and D
data$W <- data$W + 1 * data$D

# Generate Misclassification
SENS <- expit(-0.4 + 1 * data$X)
SENS[data$D == 0] = 0
data$Dstar <- ifelse(SENS > U2, 1, 0)

# Generate Sampling Status
SELECT <- expit(-0.6 + 1 * data$D + 0.5 * data$W)
S  <- ifelse(SELECT > U3, T, F)

# Observed Data
data.samp <- data[S,]

# True marginal sampling ratio
prob1 <- expit(-0.6 + 1 * 1 + 0.5 * data$W)
prob0 <- expit(-0.6 + 1 * 0 + 0.5 * data$W)
r.marg.true <- mean(prob1[data$D == 1]) / mean(prob0[data$D == 0])

# True inverse probability of sampling weights
prob.WD <- expit(-0.6 + 1 * data.samp$D + 0.5 * data.samp$W)
weights <- nrow(data.samp) * (1  / prob.WD) / (sum(1 / prob.WD))

# True associations with D in population
trueX <- glm(D ~ X, binomial(), data = data)
trueZ <- glm(D ~ Z, binomial(), data = data)

# Initial Parameter Values
fitBeta  <- glm(Dstar ~ X, binomial(), data = data.samp)
fitTheta <- glm(Dstar ~ Z, binomial(), data = data.samp)
```

## Estimating sensitivity

In our paper, we develop several strategies
for estimating either marginal sensitivity or sensitivity as a function of covariates X. Here, we apply several of the proposed strategies.

```{r, results='hide', message=F}
# Using marginal sampling ratio r and P(D=1)
sens1 <- sensitivity(data.samp$Dstar, data.samp$X, mean(data$D),
                      r = r.marg.true)

# Using inverse probability of selection weights and P(D=1)
sens2 <- sensitivity(data.samp$Dstar, data.samp$X, prev = mean(data$D),
                     weights = weights)

# Using marginal sampling ratio r and P(D=1|X)
prev  <- predict(trueX, newdata = data.samp, type = 'response')
sens3 <- sensitivity(data.samp$Dstar, data.samp$X, prev, r = r.marg.true)

# Using inverse probability of selection weights and P(D=1|X)
prev  <- predict(trueX, newdata = data.samp, type = 'response')
sens4 <- sensitivity(data.samp$Dstar, data.samp$X, prev, weights = weights)
```
## Estimating log-odds ratio for D|Z

We propose several strategies for estimating the association between D and Z
from a logistic regression model in the presence of different biasing factors.
Below, we provide code for implementing these methods.

```{r, results='hide'}
# Approximation of D*|Z
approx1 <- approxdist(data.samp$Dstar, data.samp$Z, sens1$c_marg,
                      weights = weights)

# Non-logistic link function method
nonlog1 <- nonlogistic(data.samp$Dstar, data.samp$Z, c_X = sens3$c_X,
                       weights = weights)

# Direct observed data likelihood maximization without fixed intercept
start <- c(coef(fitTheta), logit(sens1$c_marg), coef(fitBeta)[2])
fit1 <- obsloglik(data.samp$Dstar, data.samp$Z, data.samp$X, start = start,
                 weights = weights)
obsloglik1 <- list(param = fit1$param, variance = diag(fit1$variance))

# Direct observed data likelihood maximization with fixed intercept
fit2   <- obsloglik(data.samp$Dstar, data.samp$Z, data.samp$X, start = start,
                 beta0_fixed = logit(sens1$c_marg), weights = weights)
obsloglik2 <- list(param = fit2$param, variance = diag(fit2$variance))

# Expectation-maximization algorithm without fixed intercept
fit3 <- obsloglikEM(data.samp$Dstar, data.samp$Z, data.samp$X, start = start,
                 weights = weights)
obsloglik3 <- list(param = fit3$param, variance = diag(fit3$variance))

# Expectation-maximization algorithm with fixed intercept
fit4 <- obsloglikEM(data.samp$Dstar, data.samp$Z, data.samp$X, start = start,
                  beta0_fixed = logit(sens1$c_marg), weights = weights)
obsloglik4 <- list(param = fit4$param, variance = diag(fit4$variance))
```
## Plotting sensitivity estimates

**Figure 2** shows the estimated individual-level sensitivity values when the
marginal sampling ratio (r-tilde) is correctly specified. We can see that there
is strong concordance with the true sensitivity values.
```{r, echo = FALSE, eval = TRUE,  fig.width = 5, fig.height= 5}
plot(sort(sens3$c_X), xlab = 'Patients', ylab = 'Sensitivity',
     main = 'Figure 2: Sensitivity Estimates', type = 'l', col = 'red', lwd = 2)
lines(sort(expit(obsloglik1$param[3] + obsloglik1$param[4]*data.samp$X)), col = 'blue', lwd = 2)
lines(sort(expit(obsloglik2$param[3] + obsloglik2$param[4]*data.samp$X)), col = 'green', lwd = 2)
abline(h=sens1$c_marg, col = 'purple', lwd = 2)
lines(sort(expit(-0.4 + 1*data.samp$X)), col = 'black', lwd = 2)
legend(x='topleft', fill = c('purple', 'red','blue', 'green', 'black'),
       legend = c('Estimated marginal sensitivity',
                  'Using non-logistic link method',
                  'Using obs. data log-lik',
                  'Using obs. data log-lik (fixed intercept)',
                  'Truth'), cex = 0.7)
```

**Figure 3** shows the estimated individual-level sensitivity values across
different marginal sampling ratio values. In reality, we will rarely know the
truth, and this strategy can help us obtain reasonable values for sensitivity
across plausible sampling ratio values.

```{r, echo = FALSE, eval = TRUE,  fig.width = 5, fig.height= 5,  results='hide', message=F}
rvals = c(1,1.5,2,2.5,5,10)
COL = c('red', 'orange', 'yellow', 'green', 'blue', 'purple')
true_prevs = predict(trueX, newdata = data.samp, type = 'response')
plot(sort(expit(-0.4 + 1*data.samp$X)), xlab = 'Patients', ylab = 'Sensitivity',
main = 'Figure 3: Estimated sensitivity across \n marginal sampling ratios',
type = 'l', col = 'black', lwd = 2, ylim = c(0,1))

for (i in 1:length(rvals)) {
  TEMP <- sensitivity(X = data.samp$X, Dstar = data.samp$Dstar,  r = rvals[i], prev = true_prevs)
  lines(sort(TEMP$c_X), col = COL[i])
}
legend(x='topleft', legend = c(rvals, 'Truth'), title = 'Sampling Ratio',
       fill = c(COL, 'black'), cex = 0.8)
```


## Evaluating observed data log-likelihood maximization

**Figure 4** shows the maximized log-likelihood values for fixed values of $\beta_0$ obtained using direct maximization of the observed data log-likelihood. The provide likelihood does an excellent job at recovering the true value of $\beta_0$.

**Figure 5** shows the log-likelihood values across iterations of the expectation-maximization algorithm estimation method when we do and do not fix $\beta_0$. We see faster and cleaner convergence to the MLE when we fix the intercept $\beta_0$.

```{r, echo = FALSE, eval = TRUE,  fig.width = 5, fig.height= 5}
plot(fit1$beta0_fixed, fit1$loglik.seq, xlab = 'Beta_0', ylab = 'Log-likelihood',
     main = 'Figure 4: Profile Log-Likelihood Values \n for Direct Maximization', pch = 16)
abline(v=-0.4, col = 'blue', lwd = 2)
points(logit(sens1$c_marg), fit2$loglik.seq, pch = 17, col = 'red')
legend(x='topright', legend = c('No fixed beta_0', 'Fixed beta_0'), col = c('black', 'red'), pch = c(16,17))
text(x=-0.1, y=mean(fit1$loglik.seq),label = 'True beta_0', srt = 90)

plot(fit3$loglik.seq[-c(1)], xlab = 'EM algorithm iteration', ylab = 'Log-likelihood',
     main = 'Figure 5: Log-Likelihood Values \n Across EM Iterations', pch = 16)
points(fit4$loglik.seq[-c(1)], pch = 17, col = 'red')
legend(x='bottomright', legend = c('No fixed beta_0', 'Fixed beta_0'), pch = c(16,17),
       col = c('black', 'red'))
```


## Plotting log-odds ratio estimates
```{r, echo = FALSE, eval = TRUE,  fig.width = 7, fig.height= 4, message=FALSE}
library(ggplot2)
library(scales)
```

**Figure 6** shows the estimated log-odds ratio relating $D$ and $Z$ for the
various analysis methods. Uncorrected (complete case with misclassified outcome)
analysis produces bias, and some methods reduce this bias. Recall, this is a
single simulated dataset, and the corrected estimators may not always equal
the truth for a given simulation. When $W$ and $D$ are independently associated,
the method using the approximated $D^* \vert Z$ relationship and marginal
sensitivity can sometimes perform poorly.

```{r, echo = FALSE, eval = TRUE,  fig.width = 7, fig.height= 4}
METHODS = c('True',  'Uncorrected', 'Approx D*|Z + IPW','Non-logistic Link + IPW','Obs. log-lik + IPW', 'Fixed intercept obs. log-lik + IPW')
PARAM = c( coef(trueZ)[2], coef(fitTheta)[2],approx1$param,  nonlog1$param[2], obsloglik1$param[2], obsloglik2$param[2] )
VARIANCE = c(diag(summary(trueZ)$cov.scaled)[2],diag(summary(fitTheta)$cov.scaled)[2],
             approx1$variance,nonlog1$variance[2],obsloglik1$variance[2], obsloglik2$variance[2])
pd = position_dodge(width=0.6)
a <- ggplot(data = data.frame(METHODS = METHODS, PARAM = PARAM, VARIANCE = VARIANCE),
       aes(xmin= METHODS, xmax = METHODS, ymin = PARAM - 1.96*sqrt(VARIANCE), ymax =  PARAM + 1.96*sqrt(VARIANCE),
           col = METHODS,x = METHODS, y = PARAM)) +
  geom_point(position = position_dodge(.7), size = 2) +
  geom_linerange(position = position_dodge(.7), size = 1.2) +
  xlab('') + ylab('logOR')+ggtitle('Figure 6: Estimated Log-Odds Ratio Across Methods')+
  scale_x_discrete(limits=METHODS)+
  geom_hline(yintercept = PARAM[1], linetype = 1, color = 'black')+
  guides(fill=guide_legend(nrow=2,byrow=TRUE))+
  theme(legend.position="top",panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), legend.text=element_text(size=8), legend.title = element_blank(),
        axis.text.x=element_text(angle=20,hjust=1,vjust=1), text = element_text(size=12))
print(a)
```

## Reference
Statistical inference for association studies using electronic health records:
handling both selection bias and outcome misclassification
Lauren J Beesley and Bhramar Mukherjee
medRxiv [2019.12.26.19015859](https://doi.org/10.1101/2019.12.26.19015859)