Health-related quality of life is a key outcome in health technology assessments because it is patient-relevant and it is needed to calculate quality-adjusted life years. Quality of life instruments typically measure health problems in multiple domains using ordinal Likert scales. Value sets or valuation functions convert these profiles of ordinal measures into cardinal health state utilities between 1 (perfect health) and minus infinity, where 0 represents death, and negative values represent health states worse than death. Because 100% quality of life represents perfect health, health state utilities are limited at 1. The lowest possible utility in a local value set further defines a lower limit of health state utilities in a local population. Thus, health state utilities are limited dependent variables. In addition, health state utilities often show gaps between 1 and the next smaller utility in the value set. These gaps occur more frequently in quality of life instruments with few levels in the Likert scales such as the EQ-5D-3L (Mulhern et al. 2018). A last but important particularity of health state utilities is that they can be the consequence of multiple latent classes, or they can exhibit multi-modal marginal densities (Hernández Alava et al. 2014).
Adjusted limited dependent variable mixture models are finite mixtures of normal distributions that account for limits, gaps between 1 and the next smaller utility value, and multi-modality (Hernández Alava, Wailoo, and Ara 2012; Hernández Alava et al. 2013, 2014; Hernández Alava and Wailoo 2015; Mukuria et al. 2019). These features can improve empirical fit, parameter identification and predictive accuracy compared to standard regression models. Thus, adjusted limited dependent variable mixture models are particularly useful for mapping studies (Gray, Wailoo, and Hernández Alava 2018; Gray, Hernández Alava, and Wailoo 2018; Dixon, Hollingworth, and Sparrow 2020; Yang et al. 2019; Xu et al. 2020; Fuller et al. 2017; Pennington et al. 2020).
The R ‘aldvmm’ package is an implementation of the adjusted limited dependent variable mixture model proposed by Hernández Alava and Wailoo (2015) using normal component distributions and a multinomial logit model of probabilities of component membership.
The objectives of this vignette are to demonstrate the usage of the ‘aldvmm’ package, show important challenges of fitting adjusted limited dependent variable mixture models and validate the R implementation against the STATA package (Hernández Alava and Wailoo 2015) using publicly available data.
Adjusted limited dependent variable mixture models are finite mixtures of normal distributions in \(C\) components \(c\) with conditional expectations \(E[y|X, c] = X\beta^{c}\) and standard deviations \(\sigma^{c}\). Probabilities of component membership are estimated using a multinomial logit model \(P[c|X]=exp(X\delta^{c})/\sum_{k=1}^{K}exp(X\delta^{k})\). The model accumulates the density mass of outcomes per component \(y_{i}|c\) below a minimum value \(\Psi_1\) at the value \(\Psi_1\), and the density mass above a maximum value \(\Psi_{2}\) at 1. If the maximum value \(\Psi_2\) is smaller than 1, the model emulates a value set with a gap between 1 and the next smaller value.
\[\begin{equation} \label{eq:limits} \begin{array}{ll} y_{i}|c =& \begin{cases} \begin{array}{ll} 1 & \text{if } y_{i}|c > \Psi_{2}\\ \Psi_{1} & \text{if } y_{i}|c \leq \Psi_{1}\\ y_{i}|c & \text{if } \Psi_{1} < y_{i}|c \leq \Psi_{2}\\ \end{array} \end{cases} \end{array} \end{equation}\]
In this vignette, we estimate the same models of post-operative EQ-5D-3L utilities as Hernández Alava and Wailoo (2015) and include post-operative Oxford Hip Scores (divided by 10) as the only explanatory variable \(x\).
\[\begin{equation} \label{eq:models} \begin{array}{lrl} \text{Model 1:}& E[y|c, X] &= \beta_{0}^{c} + \beta_{1}^{c}x\\ & P[c|X] &= \text{mlogit}(\delta_{0}^{c})\\ &&\\ \text{Model 2:}& E[y|c, X] &= \beta_{0}^{c} + \beta_{1}^{c}x\\ & P[c|X] &= \text{mlogit}(\delta_{0}^{c} + \delta_{1}^{c}x) \end{array} \end{equation}\]
The aldvmm() function fits an adjusted limited dependent variable mixture model using the likelihood function from Hernández Alava and Wailoo (2015) (appendix). The function calls optimx::optimr() to minimize the negative log-likelihood using analytical gradients from aldvmm.gr(). The aldvmm() function accepts all optimization methods available in optimx::optimr() except for “nlm”, which requires a different implementation of the likelihood function. The default optimization method is “BFGS”.
The model formula supplied to aldvmm() is an object of class “formula” with two parts on the right-hand side of ~. The first part on the left of the | delimiter represents the model of expected values of \(C\) normal distributions. The second part on the right of the | delimiter represents the model of probabilities of component membership from a multinomial logit model.
The ‘aldvmm’ package provides four options for the generation of starting values of the optimization algorithm.
“zero”: A vector of zeroes (default).
“random”: A vector of standard normal random values.
“constant”: Parameter estimates of a constant-only model as starting values for intercepts and standard deviations, and zeroes for all other parameters. The auxiliary models for obtaining starting values are fitted using zero starting values.
“sann”: Parameter estimates of a simulated annealing algorithm.
The ‘aldvmm’ package obtains fitted values using the expected value function from Hernández Alava and Wailoo (2015). Covariance matrices and standard errors of parameters are obtained from a numerical approximation of the hessian matrix using numDeriv::hessian(). Standard errors of fitted values in the estimation data \(SE^{fit}_{i}\) and standard errors of predicted values in new data \(SE^{pred}_{i}\) are calculated using the delta method (Dowd, Greene, and Norton 2014; Whitmore 1986). \(G_{i}\) denotes the gradient of a fitted value with respect to changes in parameter estimates, \(\Sigma\) denotes the covariance matrix of parameters, and \(MSE\) denotes the mean squared error of fitted versus observed values in the estimation data.
\[\begin{equation}
\begin{array}{rl}
SE^{fit}_{i} &= \sqrt{G_{i}'\Sigma G_{i}}
\end{array}
\end{equation}\]
\[\begin{equation}
\begin{array}{rl}
SE^{pred}_{i} &= \sqrt{MSE + G_{i}'\Sigma G_{i}}
\end{array}
\end{equation}\]
The aldvmm() function returns an object of S3 class “aldvmm” for which methods for generic functions print(), summary(), stats::predict(), stats::coef(), stats::nobs(), stats::vcov(), stats::model.matrix(), stats::formula(), stats::residuals(), stats::update() and sandwich::estfun() are available. Objects of class “aldvmm” can be supplied to sandwich::sandwich(), sandwich::vcovCL(), sandwich::vcovPL(), sandwich::vcovHAC() and sandwich::vcovBS() to estimate robust or clustered standard errors (see example in the appendix). sandwich::estfun() calls aldvmm.sc() to return analytical gradients of the log-likelihood with respect to parameters for each observation (see gradient function in appendix), which allows fast computation of robust standard errors. sandwich::vcovBS() allows re-estimating the covariance matrix using bootstrapping with and without clustering, which can be particularly useful in cases with no valid covariance matrix from model fitting.
The latest version of the ‘aldvmm’ package can be installed from cran.
We analyze the same publicly available EQ-5D-3L utility data from English patients after hip replacement in 2011 and 2012 (NHS Digital 2013) as Hernández Alava and Wailoo (2015) in their description of the STATA ALDVMM package.
temp <- tempfile()
url <- paste0("https://files.digital.nhs.uk/publicationimport/pub11xxx/",
"pub11359/final-proms-eng-apr11-mar12-data-pack-csv.zip")
download.file(url, temp)
rm(url)
df <- read.table(unz(description = temp,
filename = "Hip Replacement 1112.csv"),
sep = ",",
header = TRUE)
unlink(temp)
rm(temp)
df <- df[, c("AGEBAND", "SEX", "Q2_EQ5D_INDEX", "HR_Q2_SCORE")]
df <- df[df$AGEBAND != "*" & df$SEX != "*", ]
df$eq5d <- df$Q2_EQ5D_INDEX
df$hr <- df$HR_Q2_SCORE/10
df <- df[stats::complete.cases(df), ]
set.seed(101010101)
df <- df[sample(1:nrow(df), size = nrow(df)*0.3), ]The data includes 35’166 observations with complete information on patients’ post-operative utilities, Oxford Hip Scores, age and sex. Like Hernández Alava and Wailoo (2015), we draw a 30% sub-sample of 10’549 observations from the population of complete observations of these variables. Although we follow a similar approach in data preparation as Hernández Alava and Wailoo (2015), our random sample is not identical to the data used in their study. Post-operative EQ-5D-3L utilities from English value sets (Dolan 1997) show a bimodal distribution, limits at -0.594 and 1 and a gap between 1 and 0.883 (figure 4.1).
Figure 4.1: Frequency distribution of observed EQ-5D-3L utilities
We first fit model 1 with the default “BFGS” optimization method and “zero” initial values. The values 0.883 and -0.594 in the argument ‘psi’ represent the maximum and minimum values smaller than 1 in the English value set (Dolan 1997). As the data shows a bi-modal distribution (figure 4.1), we estimate a mixture of 2 normal distributions (‘ncmp’ = 2). aldvmm() returns an object of class “aldvmm”.
library("aldvmm")
fit <- aldvmm::aldvmm(eq5d ~ hr | 1,
data = df,
psi = c(0.883, -0.594),
ncmp = 2)
summary(fit)
pred <- predict(fit,
se.fit = TRUE,
type = "fit")We obtain a summary table of regression results using the generic function summary(). The model converges at a log-likelihood of 706 and an Akaike information criterion value of 1’399 (table 5.1). Larger values of the log-likelihood (ll), the Akaike information criterion (AIC), and the Bayesian information criterion (BIC) suggest better fit. The summary table returned by the generic function summary() reports negative goodness of fit measures.
The coefficients of the intercepts and covariates for the expected values \(E[y|c, X]\) of the normal distributions can be interpreted as marginal effects on component means. ‘lnsigma’ denotes the natural logarithm of the estimated standard deviation \(\sigma^{c}\). The coefficients of covariates in the multinomial logit model of probabilities of component membership are log-transformed relative probabilities. Our model only includes two components, and the multinomial logit model collapses to a binomial logit model. The intercept of -0.728 means that the average probability of an observation in the data to belong to component 1 is \(\text{exp}(-0.728)\) or 0.4828738 times the probability to belong to component 2.
| Estimate | Std. Error | t value | Pr(>|t|) | 2.5 % | 97.5 % | ||
|---|---|---|---|---|---|---|---|
| E[y|X, c] | |||||||
| Comp1 | (Intercept) | -0.4307 | 0.0224 | -19.2351 | 0 | -0.4746 | -0.3868 |
| hr | 0.3135 | 0.0065 | 47.9498 | 0 | 0.3007 | 0.3263 | |
| lnsigma | -1.2481 | 0.0215 | -58.0012 | 0 | -1.2903 | -1.2059 | |
| Comp2 | (Intercept) | 0.2358 | 0.0069 | 34.184 | 0 | 0.2223 | 0.2494 |
| hr | 0.1459 | 0.0019 | 76.3631 | 0 | 0.1421 | 0.1496 | |
| lnsigma | -2.4624 | 0.0178 | -138.1263 | 0 | -2.4973 | -2.4274 | |
| P[c|X] | |||||||
| Comp1 | (Intercept) | -0.728 | 0.0607 | -12.0018 | 0 | -0.8469 | -0.6091 |
| N = 10549 | ll = -706 | AIC = -1399 | BIC = -1348 |
We obtain expected values of observations in the estimation data using the generic function predict(). Standard errors of fitted (estimation data) or predicted (new data) values are calculated using the delta method. Expected values exhibit a smoother distribution than observed values and do not show a gap between 1 and 0.883, because they are weighted averages of component distributions and 1 (figure 5.1).
Figure 5.1: Expected values from base case model
Expected values of observations in the estimation data can also be used to calculate average incremental effects or average treatment effects on the treated. Average treatment effects on the treated compare predictions for treated individuals to predictions for the same individuals without the effect of the treatment indicator. Standard errors of average treatment effects on the treated can be calculated using the delta method (see example code in the appendix).
A visual inspection of mean residuals over deciles of expected values (see example code of the modified Hosmer-Lemeshow test in the appendix) shows that model 1 fits the data poorly (figure 5.2). Although the test is overpowered with large data, and confidence bands should not be interpreted directly, the absolute over-prediction among individuals with low expected values is quite large.
Figure 5.2: Mean residuals over deciles of expected values, BFGS with zero starting values
We can use the ‘sandwich’ package (Zeileis 2006) to calculate robust or clustered standard errors from objects of class “aldvmm” (see example code in the appendix).
Hernández Alava and Wailoo (2015) suggested that the likelihood function of the adjusted limited dependent variable mixture model with the English EQ-5D-3L data might have multiple local optima, and that the estimation is sensitive to initial values. We thus fit model 1 with all optimization algorithms and methods for generating initial values available in aldvmm() to assess the sensitivity of model fits to optimization settings and to find the maximum likelihood estimates.
init.method <- c("zero", "random", "constant", "sann")
optim.method <- c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "nlminb", "Rcgmin",
"Rvmmin","hjn")
fit1_all <- list()
for (i in init.method) {
for (j in optim.method) {
set.seed(101010101) # Seed for random starting values
fit1_all[[i]][[j]] <- aldvmm::aldvmm(eq5d ~ hr | 1,
data = df,
psi = c(0.883, -0.594),
ncmp = 2,
init.method = i,
optim.method = j)
}
}The maximum likelihood varies considerably across optimization methods and initial values which confirms the sensitivity of the model to changes in optimization settings (table 5.2). In 18 of 32scenarios, the algorithm converges at the maximum value of 706.32. The “CG” algorithm never converges at 706.32, while the “nlminb” algorithm always converges at the maximum value. Starting values from a simulated annealing algorithm lead to convergence at 706.32 most frequently but also require much more computation time (table 5.3).
The stabilized multinomial logit likelihood implemented in version 0.9.0 of the ‘aldvmm’ package increased the likelihood of the example model to converge at the maximum log-likelihood considerably.
| Nelder-Mead | BFGS | CG | L-BFGS-B | nlminb | Rcgmin | Rvmmin | hjn | |
|---|---|---|---|---|---|---|---|---|
| zero | -245.7 | 706.3 | -634.8 | -4513.3 | 706.3 | -634.8 | 706.3 | 706.3 |
| random | 213.9 | -612.7 | 698.3 | -2830.2 | 706.3 | 706.3 | -606.3 | -627.8 |
| constant | 706.3 | 706.3 | -634.8 | -634.8 | 706.3 | -634.8 | 706.3 | 706.3 |
| sann | 706.3 | 706.3 | 698.7 | 706.3 | 706.3 | 706.3 | 706.3 | 706.3 |
The computation times of optimization routines do vary quite considerably across methods. The “BFGS”, “L-BFGS-B”, “nlminb”, and “Rvmmin” algorithms converge faster than the other algorithms. The “hjn” algorithm and simulated annealing “sann” starting values require more than average computation time (table 5.3).
The stabilized multinomial logit likelihood implemented in version 0.9.0 of the ‘aldvmm’ package decreases computation time for the optimization methods that did not converge to the maximum value with version 0.8.8 but do with version 0.9.0.
| Nelder-Mead | BFGS | CG | L-BFGS-B | nlminb | Rcgmin | Rvmmin | hjn | |
|---|---|---|---|---|---|---|---|---|
| zero | 0.42 | 0.10 | 0.16 | 0.06 | 0.11 | 0.11 | 0.09 | 1.71 |
| random | 0.65 | 0.34 | 3.11 | 0.09 | 0.16 | 0.56 | 0.29 | 0.46 |
| constant | 0.37 | 0.11 | 0.21 | 0.07 | 0.12 | 0.11 | 0.10 | 0.99 |
| sann | 2.66 | 4.00 | 7.00 | 3.53 | 2.71 | 2.57 | 2.63 | 4.27 |
Parameter estimates from optimization methods that did not converge at the maximum log-likelihood differ considerably from the values from methods that did converge at the maximum log-likelihood (table 5.4). The solution of the “hjn” algorithm is identical to the solution of the “BFGS” algorithm, but the components are swapped in the table.
When the model converges at the maximum log-lokelihood, a complete covariance matrix is available (TRUE) but rarely otherwise (table 7.1 in the appendix)
| Nelder-Mead | BFGS | CG | L-BFGS-B | nlminb | Rcgmin | Rvmmin | hjn | ||
|---|---|---|---|---|---|---|---|---|---|
| E[y|X, c] | |||||||||
| Comp1 | (Intercept) | -0.0426 | -0.4307 | -0.0886 | 0.006 | -0.4307 | -0.0886 | -0.4307 | 0.2358 |
| hr | 0.2194 | 0.3135 | 0.2318 | 0.2935 | 0.3135 | 0.2318 | 0.3135 | 0.1459 | |
| lnsigma | -1.8595 | -1.2481 | -1.6363 | -0.8361 | -1.2481 | -1.6363 | -1.2481 | -2.4624 | |
| Comp2 | (Intercept) | 4.1853 | 0.2358 | -0.0886 | 0.006 | 0.2358 | -0.0886 | 0.2358 | -0.4307 |
| hr | -1.0081 | 0.1459 | 0.2318 | 0.2935 | 0.1459 | 0.2318 | 0.1459 | 0.3135 | |
| lnsigma | -0.0249 | -2.4624 | -1.6363 | -0.8361 | -2.4624 | -1.6363 | -2.4624 | -1.2481 | |
| P[c|X] | |||||||||
| Comp1 | (Intercept) | 3.5636 | -0.728 | 0 | 0 | -0.728 | 0 | -0.728 | 0.728 |
As adjusted limited dependent variable mixture models are frequently used for tasks that rely on predictions, we also compare expected values across algorithms with “zero” starting values. Predictions of models that do not converge at the maximum log-likelihood deviate from the predictions based on “BFGS” results. The deviation is least pronounced for the Nelder-Mead solution and most pronounced for the “L-BFGS-B” solution (figure 5.3).
Figure 5.3: Deviation of expected values from BFGS, Nelder-Mead and hjn versus nlminb with zero starting values
Based on the comparison of log-likelihoods, parameter estimates and predicted values, we deem the “nlminb” algorithm the most robust approach for the used data and model and use the “nlminb” algorithm with “zero” values for the remaining examples in this vignette.
We can also fit model 1 with user-defined starting values and box constraints. When constraints are imposed, the aldvmm() function uses the optimization method “L-BFGS-B”, which shows to be very sensitive to starting values. We use zero initial values for all parameters except for the intercept in the multinomial logit which we set to the estimate from the “nlminb” optimization method with “zero” starting values (0.7283) (table 5.4). We impose a lower limit of -3 to the log-standard deviations in both components.
The aldvmm() function returns a warning that the covariance matrix included negative values on the diagonal. We see that these values are the variances of the intercept and the log-standard deviation in component 2 (table 5.5). The log-likelihood amounts to -628, which is smaller than the value of the unconstrained model 1 (706.32) suggesting worse fit of the constrained model 1. The parameter estimates in component 1 resemble the coefficients in the poorly converged models with “zero” starting values (table 5.4), but the other coefficients do not. These results further emphasize the difficulties in finding a global optimum of the likelihood with English EQ-5D-3L utilities after hip replacement.
init <- c(0, 0, 0, 0, 0, 0, 0.7283)
lo <- c(-Inf, -Inf, -3, -Inf, -Inf, -3, -Inf)
hi <- c(Inf, Inf, Inf, Inf, Inf, Inf, Inf)
fit1_cstr <- aldvmm::aldvmm(eq5d ~ hr | 1,
data = df,
psi = c(0.883, -0.594),
ncmp = 2,
init.est = init,
init.lo = lo,
init.hi = hi)
summary(fit1_cstr)| Estimate | Std. Error | t value | Pr(>|t|) | 2.5 % | 97.5 % | ||
|---|---|---|---|---|---|---|---|
| E[y|X, c] | |||||||
| Comp1 | (Intercept) | -0.092 | 0.0085 | -10.8048 | 0 | -0.1087 | -0.0753 |
| hr | 0.2325 | 0.0023 | 101.8493 | 0 | 0.228 | 0.237 | |
| lnsigma | -1.6406 | 0.0089 | -183.9809 | 0 | -1.6581 | -1.6231 | |
| Comp2 | (Intercept) | 0.3931 | NaN | NaN | NaN | NaN | NaN |
| hr | 2.5249 | 7.4795 | 0.3376 | 0.7357 | -12.1363 | 17.1861 | |
| lnsigma | -0.7651 | NaN | NaN | NaN | NaN | NaN | |
| P[c|X] | |||||||
| Comp1 | (Intercept) | 6.852 | 0.5388 | 12.7168 | 0 | 5.7958 | 7.9082 |
| N = 10549 | ll = 628 | AIC = 1270 | BIC = 1320 |
As three algorithms converge at zero probability of belonging to component 2 (table 5.4), we also estimate a single-component model.
fit <- aldvmm::aldvmm(eq5d ~ hr,
data = df,
psi = c(0.883, -0.594),
ncmp = 1,
init.method = "zero",
optim.method = "nlminb")
summary(fit)The coefficients of the single-component model (table 5.6) are relatively similar to the parameters in the first component of the poorly converged model 1 from the “CG” and “Rcgmin” algorithms (table 5.4). The Akaike information criterion amounts to -1’276 which is smaller than the Akaike information criterion of the “nlminb” solution of the two-component model (1’399) and thus suggests worse fit of the single-component model even when model complexity is accounted for. Larger values of the log-likelihood (ll), the Akaike information criterion (AIC), and the Bayesian information criterion (BIC) suggest better fit. The summary table returned by the generic function summary() reports negative goodness of fit measures.
| Estimate | Std. Error | t value | Pr(>|t|) | 2.5 % | 97.5 % | ||
|---|---|---|---|---|---|---|---|
| E[y|X, c] | |||||||
| Comp1 | (Intercept) | -0.0886 | 0.0085 | -10.4099 | 0 | -0.1053 | -0.0719 |
| hr | 0.2318 | 0.0023 | 101.4681 | 0 | 0.2273 | 0.2362 | |
| lnsigma | -1.6363 | 0.0089 | -184.2422 | 0 | -1.6537 | -1.6189 | |
| N = 10549 | ll = 635 | AIC = 1276 | BIC = 1297 |
As an alternative specification, we explore model 2 with a coefficient of the Oxford Hip score in the multinomial logit model of component membership. For this fit, we use the method “nlminb” with estimates from Hernández Alava and Wailoo (2015) as starting values.
init <- c(-.40293118, .30502755, .22614716, .14801581, -.70755741, 0,
-1.2632051, -2.4541401)
fit2 <- aldvmm::aldvmm(eq5d ~ hr | hr,
data = df,
psi = c(0.883, -0.594),
ncmp = 2,
init.est = init,
optim.method = "nlminb")
summary(fit2)The Akaike information criterion of model 2 fitted using the “nlminb” method and user-defined starting values amounts to 1’867 which is larger than the Akaike information criterion of model 1 (1’399). The larger Akaike information criterion suggests that the increase in the log-likelihood after inclusion of a coefficient of the Oxford Hip Score on the probability of component membership is sufficiently large to justify the extra parameter. Larger values of the log-likelihood (ll), the Akaike information criterion (AIC), and the Bayesian information criterion (BIC) suggest better fit. The summary table returned by the generic function summary() reports negative goodness of fit measures.
| Estimate | Std. Error | t value | Pr(>|t|) | 2.5 % | 97.5 % | ||
|---|---|---|---|---|---|---|---|
| E[y|X, c] | |||||||
| Comp1 | (Intercept) | 0.0031 | 0.0281 | 0.1085 | 0.9136 | -0.0521 | 0.0582 |
| hr | 0.0965 | 0.0116 | 8.352 | 0 | 0.0739 | 0.1192 | |
| lnsigma | -1.2676 | 0.0314 | -40.4144 | 0 | -1.3291 | -1.2061 | |
| Comp2 | (Intercept) | 0.1816 | 0.0074 | 24.5104 | 0 | 0.1671 | 0.1961 |
| hr | 0.1609 | 0.0019 | 85.4201 | 0 | 0.1573 | 0.1646 | |
| lnsigma | -2.2809 | 0.0126 | -181.079 | 0 | -2.3056 | -2.2563 | |
| P[c|X] | |||||||
| Comp1 | (Intercept) | 2.4455 | 0.1719 | 14.2293 | 0 | 2.1086 | 2.7824 |
| hr | -1.3902 | 0.0566 | -24.5697 | 0 | -1.5011 | -1.2793 | |
| N = 10549 | ll = -941 | AIC = -1867 | BIC = -1809 |
To validate the R implementation of adjusted limited dependent variable mixture models, we estimate the four models presented in Hernández Alava and Wailoo (2015) as reference cases in R and STATA. The STATA and R code for model estimation is included in the appendix.
Model 1 with default options
Model 1 with parameter constraints
Model 1 with initial values from constant-only model
Model 2 with user-defined initial values
| Ref. case 1 | Ref. case 2 | Ref. case 3 | Ref. case 4 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| R | STATA | R | STATA | R | STATA | R | STATA | ||
| E[y|X, c] | |||||||||
| Comp1 | (Intercept) | -0.4307 | -0.427 | -0.092 | -0.092 | 0.2358 | 0.236 | 0.0031 | 0.006 |
| hr | 0.3135 | 0.312 | 0.2325 | 0.232 | 0.1459 | 0.146 | 0.0965 | 0.095 | |
| lnsigma | -1.2481 | -1.251 | -1.6406 | -1.641 | -2.4624 | -2.463 | -1.2676 | -1.274 | |
| Comp2 | (Intercept) | 0.2358 | 0.236 | 100 | 100.000 | -0.4307 | -0.427 | 0.1816 | 0.182 |
| hr | 0.1459 | 0.146 | 0 | 0.000 | 0.3135 | 0.312 | 0.1609 | 0.161 | |
| lnsigma | -2.4624 | -2.463 | 0 | 0.000 | -1.2481 | -1.251 | -2.2809 | -2.280 | |
| P[c|X] | |||||||||
| Comp1 | (Intercept) | -0.728 | -0.725 | 6.8564 | 6.855 | 0.728 | 0.725 | 2.4455 | 2.448 |
| hr | NA | NA | NA | NA | NA | NA | -1.3902 | -1.393 | |
| N = 10549 | ll | -706 | 715.84 | 628 | -613.7 | -706 | 715.84 | -941 | 953.2 |
The parameter estimates and standard errors obtained in R are very similar to the results from STATA (table 6.1 and table 6.2). R did not obtain any standard errors in reference model 2 while STATA returns standard errors for the first component and the probability of belonging to component 1. Although reference models 1 and 3 yield different parameter estimates, they converge at the same log-likelihood which further supports the hypothesis of multiple local optima of the likelihood. The relative ordering of models is consistent across platforms.
| Ref. case 1 | Ref. case 2 | Ref. case 3 | Ref. case 4 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| R | STATA | R | STATA | R | STATA | R | STATA | ||
| E[y|X, c] | |||||||||
| Comp1 | (Intercept) | 0.0224 | 0.022 | NA | 0.009 | 0.0069 | 0.007 | 0.0281 | 0.028 |
| hr | 0.0065 | 0.006 | NA | 0.002 | 0.0019 | 0.002 | 0.0116 | 0.012 | |
| lnsigma | 0.0215 | 0.021 | NA | 0.009 | 0.0178 | -0.018 | 0.0314 | 0.032 | |
| Comp2 | (Intercept) | 0.0069 | 0.007 | NA | 0.0224 | 0.022 | 0.0074 | 0.007 | |
| hr | 0.0019 | 0.002 | NA | 0.0065 | 0.006 | 0.0019 | 0.002 | ||
| lnsigma | 0.0178 | 0.018 | NA | 0.0215 | 0.021 | 0.0126 | 0.013 | ||
| P[c|X] | |||||||||
| Comp1 | (Intercept) | 0.0607 | 0.061 | NA | 0.540 | 0.0607 | 0.061 | 0.1719 | 0.172 |
| hr | NA | NA | NA | NA | NA | NA | 0.0566 | 0.056 | |
| N = 10549 | ll | -706 | 715.84 | 628 | -613.7 | -706 | 715.84 | -941 | 953.2 |
Fitted values show very similar marginal distributions on both platforms (figure 6.1). R does not return fitted values in reference case 2. The summary statistics of differences in fitted values between R and STATA suggest that individual predictions are quite similar across platforms as well (table 6.3).
Standard errors of fitted values differ visibly between platforms (figure 6.2 and table 6.4). The difference is particularly pronounced in reference case 1, but the standard errors from STATA seem quite extreme compared to all other reference cases. R does not return standard errors of fitted values in reference case 2.
Figure 6.1: Fitted values in R and STATA
Figure 6.2: Standard errors of fitted values in R and STATA
| Ref. case 1 | Ref. case 2 | Ref. case 3 | Ref. case 4 | |
|---|---|---|---|---|
| Min. | -0.000038 | NA | -0.000038 | -0.000051 |
| 1st Qu. | -0.000036 | NA | -0.000036 | -0.000044 |
| Median | -0.000033 | NA | -0.000033 | 0.000008 |
| Mean | 0.000037 | NaN | 0.000037 | 0.000035 |
| 3rd Qu. | 0.000047 | NA | 0.000047 | 0.000070 |
| Max. | 0.000567 | NA | 0.000567 | 0.002319 |
| Ref. case 1 | Ref. case 2 | Ref. case 3 | Ref. case 4 | |
|---|---|---|---|---|
| Min. | 0.005897 | NA | -0.002063 | -0.016984 |
| 1st Qu. | 0.006898 | NA | -0.000551 | -0.000823 |
| Median | 0.008690 | NA | 0.000189 | -0.000364 |
| Mean | 0.008917 | NaN | 0.000171 | -0.000515 |
| 3rd Qu. | 0.011050 | NA | 0.001054 | 0.000354 |
| Max. | 0.013830 | NA | 0.001530 | 0.000792 |
The comparison of the R and STATA packages shows that the R implementation sometimes behaves differently than the STATA package, but the results are not indicative of technical errors in the R implementation.
Adjusted limited dependent variable mixture models are powerful tools for regression analysis of health state utilities. Unlike standard regression models, adjusted limited dependent variable mixture models account for limits, gaps and multi-modal distributions.
The comparison of different optimization methods with EQ-5D-3L utility data from English patients after hip replacement in 2011 and 2012 (NHS Digital 2013) showed that the likelihood function can be challenging to maximize and can converge at local optima or extreme solutions. Parameter estimates varied considerably across optimization methods and even across methods with the same maximum log-likelihood. However, fitted values were very similar across the four reference cases which suggests that the model is more robust for the identification of incremental and average marginal effects than for parameter identification.
The ‘aldvmm’ package offers a variety of optimization algorithms and methods for generating initial values which is an important strength in such challenging situations. It is essential to assess different optimization algorithms and methods for generating initial values before interpreting the parameter estimates or predictions of adjusted limited dependent variable mixture models. The analysis of the EQ-5D-3L utility data also suggests that simpler models with fewer components should be considered when multi-component models are difficult to fit. Even single-component adjusted limited dependent variable mixture models can improve fit compared to traditional regression techniques because they account for limits and gaps.
Although coefficients can be interpreted as marginal effects within each component, they cannot be interpreted in terms of overall expected values. Thus, average marginal effects and average treatment effects need to be calculated from predictions using the generic function predict(). Standard errors of marginal effects or average treatment effects can be calculated using the delta method (see example code in the appendix).
When no valid covariance matrix of parameters can be obtained or the user questions the validity of standard errors, the function sandwich::vcovBS() can be used to obtain bootstrapped standard errors (see example code in the appendix). Due to the large variability of results across model fits, a large number of iterations might be needed which increases computation time. Other functions from the ‘sandwich’ package can also be used to estimate robust or clustered standard errors.
In situations with repeated utility measures, the ‘aldvmm’ package only allows fixed effects estimations with group and time fixed effects which can be an important limitation in the analysis of clinical data. However, time fixed effects can be an appropriate modeling strategy in the presence of general time trends and dynamic selection in the population, e.g. because health state utilities decrease over time and treated individuals survive longer and thus are over-represented in later measurements.
Possible extensions of the ‘aldvmm’ package could include additional component distributions, a mixed model implementation for repeated measures and an implementation of functions for calculating average marginal effects and their standard errors.
| Nelder-Mead | BFGS | CG | L-BFGS-B | nlminb | Rcgmin | Rvmmin | hjn |
|---|---|---|---|---|---|---|---|
| FALSE | TRUE | FALSE | FALSE | TRUE | FALSE | TRUE | TRUE |
| TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE | FALSE |
| TRUE | TRUE | FALSE | FALSE | TRUE | FALSE | TRUE | TRUE |
| TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE |
# Create treatment indicator
#---------------------------
df$treated <- as.numeric(df$SEX == "2")
# Fit model
#----------
formula <- eq5d ~ treated + hr | 1
fit <- aldvmm(formula,
data = df,
psi = c(-0.594, 0.883))
# Predict treated
#----------------
# Subsample of treated observations
tmpdf1 <- df[df$treated == 1, ]
# Design matrix for treated observations
X1 <- aldvmm.mm(data = tmpdf1,
formula = fit$formula,
ncmp = fit$k,
lcoef = fit$label$lcoef)
# Average expected outcome of treated observations
mean1 <- mean(predict(fit,
newdata = tmpdf1,
type = "fit",
se.fit = TRUE)[["yhat"]], na.rm = TRUE)
# Predict counterfactual
#-----------------------
# Subsample of counterfactual observations
tmpdf0 <- tmpdf1
rm(tmpdf1)
tmpdf0$treated <- 0
# Design matrix for counterfactual observations
X0 <- aldvmm.mm(data = tmpdf0,
formula = fit$formula,
ncmp = fit$k,
lcoef = fit$label$lcoef)
# Average expected outcome of counterfactual osbervations
mean0 <- mean(predict(fit,
newdata = tmpdf0,
type = "fit",
se.fit = TRUE)[["yhat"]], na.rm = TRUE)
rm(tmpdf0)
# Standard error of ATET
#-----------------------
atet.grad <- numDeriv::jacobian(func = function(z) {
yhat1 <- aldvmm.pred(par = z,
X = X1,
y = rep(0, nrow(X1[[1]])),
psi = fit$psi,
ncmp = fit$k,
dist = fit$dist,
lcoef = fit$label$lcoef,
lcmp = fit$label$lcmp,
lcpar = fit$label$lcpar)[["yhat"]]
yhat0 <- aldvmm.pred(par = z,
X = X0,
y = rep(0, nrow(X0[[1]])),
psi = fit$psi,
ncmp = fit$k,
dist = fit$dist,
lcoef = fit$label$lcoef,
lcmp = fit$label$lcmp,
lcpar = fit$label$lcpar)[["yhat"]]
mean(yhat1 - yhat0, na.rm = TRUE)
},
x = fit$coef)
se.atet <- sqrt(atet.grad %*% fit$cov %*% t(atet.grad))
# Summarize
#----------
out <- data.frame(atet = mean1 - mean0,
se = se.atet,
z = (mean1 - mean0) / se.atet)
out$p <- 2*stats::pnorm(abs(out$z), lower.tail = FALSE)
out$ul <- out$atet + stats::qnorm((1 + 0.95)/2) * out$se
out$ll <- out$atet - stats::qnorm((1 + 0.95)/2) * out$se
print(out)# Create cluster indicator
#-------------------------
df$grp <- as.factor(round(0.5 + runif(nrow(df)) * 5, 0))
# Fit model
#----------
formula <- eq5d ~ hr | 1
fit <- aldvmm(formula,
data = df,
psi = c(-0.594, 0.883))
# Calculate robust and clustered standard errors
#-----------------------------------------------
vc1 <- sandwich::sandwich(fit)
vc2 <- sandwich::vcovCL(fit, cluster = ~ grp)
vc3 <- sandwich::vcovPL(fit, cluster = ~ grp)
vc4 <- sandwich::vcovHAC(fit, cluster = ~ grp)
vc5 <- sandwich::vcovBS(fit)
vc6 <- sandwich::vcovBS(fit, cluster = ~ grp)
# Calculate test statistics
#--------------------------
lmtest::coeftest(fit)
lmtest::coeftest(fit, vcov = vc1)
lmtest::coeftest(fit, vcov = vc2)
lmtest::coeftest(fit, vcov = vc3)
lmtest::coeftest(fit, vcov = vc4)
lmtest::coeftest(fit, vcov = vc5)
lmtest::coeftest(fit, vcov = vc6)# Number of percentiles
ngroup <- 10
# Extract expected values and residuals
yhat <- fit1_all[["zero"]][["Nelder-Mead"]][["pred"]][["yhat"]]
res <- fit1_all[["zero"]][["Nelder-Mead"]][["pred"]][["res"]]
# Make groups
group <- as.numeric(cut(yhat, breaks = ngroup), na.rm = TRUE)
# Auxiliary regression
aux <- stats::lm(res ~ factor(group))
# Data set of predictions from auxiliary regressions
newdf <- data.frame(group = unique(group)[order(unique(group))])
predict <- predict(aux,
newdata = newdf,
se.fit = TRUE,
interval = 'confidence',
level = 0.95)
plotdat <- as.data.frame(rbind(
cbind(group = newdf$group,
outcome = "mean",
value = predict$fit[ , 'fit']),
cbind(group = newdf$group,
outcome = "ll",
value = predict$fit[ , 'lwr']),
cbind(group = newdf$group,
outcome = "ul",
value = predict$fit[ , 'upr'])
))
# Make plot
plot <- ggplot2::ggplot(plotdat, aes(x = factor(as.numeric(group)),
y = as.numeric(value),
group = factor(outcome))) +
geom_line(aes(linetype = factor(outcome)))# (1) Reference case 1: Model 1 with default optimization settings
fit1_default <- aldvmm::aldvmm(eq5d ~ hr | 1,
data = df,
psi = c(0.883, -0.594),
ncmp = 2,
init.method = "zero",
optim.method = "nlminb")
# (2) Reference case 2: Model 1 with user-defined initial values and constraints on parameters
init <- c(0, 0, 0, 0, 0, 0, 0.7283)
lo <- c(-Inf, -Inf, -3, -Inf, -Inf, -3, -Inf)
hi <- c(Inf, Inf, Inf, Inf, Inf, Inf, Inf)
fit1_cstr <- aldvmm::aldvmm(eq5d ~ hr | 1,
data = df,
psi = c(0.883, -0.594),
ncmp = 2,
init.est = init,
init.lo = lo,
init.hi = hi)
# (3) Reference case 3: Model 1 with initial values from constant-only model
fit1_const <- aldvmm::aldvmm(eq5d ~ hr | 1,
data = df,
psi = c(0.883, -0.594),
ncmp = 2,
init.method = "constant",
optim.method = "nlminb")
# (4) Reference case 4: Model 2 with user-defined initial values
init <- c(-.40293118, .30502755, .22614716, .14801581, -.70755741, 0,
-1.2632051, -2.4541401)
fit2 <- aldvmm::aldvmm(eq5d ~ hr | hr,
data = df,
psi = c(0.883, -0.594),
ncmp = 2,
init.est = init,
optim.method = "nlminb")* (1) Reference case 1: Model 1 with default optimization settings
aldvmm eq5d hr, ncomponents(2)
* (2) Reference case 2: Model 1 with user-defined initial values and constraints on parameters
matrix input a = (0, 0, 0, 0, 0, 0, 0.7283)
constraint 1 [Comp_2]:hr10 = 0
constraint 2 [Comp_2]:_cons = 100
constraint 3 [lns_2]:_cons = 1e-30
aldvmm eq5d hr, ncomp(2) from(a) c(1 2 3)
* (3) Reference case 3: Model 1 with initial values from constant-only model
aldvmm eq5d hr, ncomp(2) inim(cons)
* (4) Reference case 4: Model 2 with user-defined initial values
matrix input start = (.14801581, .22614716, .30502755, -.40293118, 0,
-.70755741, -2.4541401, -1.2632051)
aldvmm eq5d hr, ncomp(2) prob(hr) from(start)We write the ALDVMM log-likelihood function (Hernández Alava and Wailoo 2015) as the sum across observations of the log-transformed product of the multinomial logit density of component membership \(A_{ic}\) and the density of expected values per component \(f_{ic}\).
\[\begin{equation} \label{eq:loglik} ll=\sum_{i=1}^{n}\ln(A_{ic}f_{ic}) \end{equation}\]
We write the multinomial logit model as a softmax function and subtract the maximum linear predictor per observation across components from the observation’s linear predictors to improve numerical stability of the likelihood function. This stabilized softmax density function is equivalent to the original multinomial logit density but has better numerical properties.
\[\begin{equation} \label{eq:apart} \begin{array}{ll} A_{ic}&=\frac{\exp(w_{i}\delta_{c})}{\sum_{s=1}^{K}\exp(w_{i}\delta_{s})}\\ &=\frac{\exp(-\max(w_{i}\delta))}{\exp(-\max(w_{i}\delta))}\frac{\exp(w_{i}\delta_{c})}{\sum_{s=1}^{C}\exp(w_{i}\delta_{s})}\\ &=\frac{\exp(w_{i}\delta_{c} - \max(w_{i}\delta))}{\sum_{s=1}^{C}\exp(w_{i}\delta_{s} - \max(w_{i}\delta))}\\ \end{array} \end{equation}\]
The normal density of expected values per component depends on the range in which \(y_{i}\) falls relative to the maximum value \(\Psi_{2}\) and minimum value \(\Psi_{1}\) smaller than one in the value set.
\[\begin{equation} \label{eq:fpart} \begin{array}{ll} f_{ic} =& \begin{cases} \begin{array}{ll} 1 - \Phi(\frac{\Psi_{2}-x_{i}\beta_{c}}{\sigma_{c}}) & \text{if } y_{i}|c > \Psi_{2}\\ \Phi(\frac{\Psi_{1}-x_{i}\beta_{c}}{\sigma_{c}}) & \text{if } y_{i}|c \leq \Psi_{1}\\ \frac{\phi(\frac{y_{i} - x_{i}\beta_{c}}{\sigma_{c}})}{\sigma_{c}} & \text{if } \Psi_{1} < y_{i}|c \leq \Psi_{2}\\ \end{array} \end{cases} \end{array} \end{equation}\]
The gradient of the log-likelihood function with respect to all model parameters is the vector of first derivatives. The underlying score matrix includes the first derivatives of all observations’ likelihood contributions to the log-likelihood.
\[\begin{equation} \label{eq:gradll} \Delta = \left[\frac{\partial ll}{\partial \delta_{1}}, ..., \frac{\partial ll}{\partial \delta_{C}}, \frac{\partial ll}{\partial \beta_{1}}, ..., \frac{\partial ll}{\partial \beta_{C}}, \frac{\partial ll}{\partial \sigma_{1}}, ..., \frac{\partial ll}{\partial \sigma_{C}}\right] \end{equation}\]
The first derivative of the log-likelihood with respect to the parameters \(delta_{c}\) must consider the first derivative of all elements of the denominator and depends on whether the elements are of the same component (\(s = c\)) or not (\(s \neq c\)).
\[\begin{equation} \label{eq:dlldd} \begin{array}{ll} \frac{\partial ll}{\partial \delta_{c}} &= \sum_{i=1}^{n}\frac{1}{\sum_{c=1}^{C}A_{ic}f_{ic}}\cdot f_{ic}\cdot\frac{\partial A_{ic}}{\partial \delta_{c}}\\ \end{array} \end{equation}\]
\[\begin{equation} \label{eq:dadd} \begin{array}{ll} \frac{\partial A_{ic}}{\partial \delta_{c}} =& \begin{cases} \begin{array}{ll} w_{i} A_{ic}(1-A_{ic}) & \text{if } s = c\\ w_{i} A_{is}A_{ic}& \text{if } s \neq c\\ \end{array} \end{cases} \end{array} \end{equation}\]
The first derivative of the log-likelihood with respect to the parameters \(\beta_c\) depends on the range in which \(y_{i}\) falls relative to the maximum value \(\Psi_{2}\) and minimum value \(\Psi_{1}\) smaller than one in the value set.
\[\begin{equation} \label{eq:dllb} \begin{array}{ll} \frac{\partial ll}{\partial \beta_{c}} &=\sum_{i=1}^{n}A_{ic}\frac{\partial f_{ic}}{\partial \beta_{ic}} \\ \end{array} \end{equation}\]
\[\begin{equation} \label{eq:dfdb} \begin{array}{ll} \frac{\partial f_{ic}}{\partial \beta_{c}} =& \begin{cases} \begin{array}{ll} \frac{x_{i}}{\sigma_{c}}\phi (\frac{\Psi_{2} - x_{i}\beta_{c}}{\sigma_{c}}) & \text{if } y_{i}|c > \Psi_{2}\\ -\frac{x_{i}}{\sigma_{c}}\phi (\frac{\Psi_{1} - x_{i}\beta_{c}}{\sigma_{c}})& \text{if } y_{i}|c \leq \Psi_{1}\\ \frac{x_{i}}{\sigma_{c}} \frac{y_{i} - x_{i}\beta_{c}}{\sigma_{c}} \phi (\frac{y_{i} - x_{i}\beta_{c}}{\sigma_{c}}) & \text{if } \Psi_{1} < y_{i}|c \leq \Psi_{2}\\ \end{array} \end{cases} \end{array} \end{equation}\]
The first derivative of the log-likelihood with respect to the parameters \(\sigma_c\) depends on the range in which \(y_{i}\) falls relative to the maximum value \(\Psi_{2}\) and minimum value \(\Psi_{1}\) smaller than one in the value set.
\[\begin{equation} \label{eq:dllds} \begin{array}{ll} \frac{\partial ll}{\partial \sigma_{c}} &=\sum_{i=1}^{n}A_{ic}\frac{\partial f_{ic}}{\partial \sigma_{ic}} \\ \end{array} \end{equation}\]
\[\begin{equation} \label{eq:dfds} \begin{array}{ll} \frac{\partial f_{ic}}{\partial \sigma_{c}} =& \begin{cases} \begin{array}{ll} \phi (\frac{\Psi_{2} - x_{i}\beta_{c}}{\sigma_{c}})\frac{\Psi_{2} - x_{i}\beta_{c}}{\sigma_{c}^{2}} & \text{if } y_{i}|c > \Psi_{2}\\ -\phi (\frac{\Psi_{1} - x_{i}\beta_{c}}{\sigma_{c}})\frac{\Psi_{1} - x_{i}\beta_{c}}{\sigma_{c}^{2}} & \text{if } y_{i}|c \leq \Psi_{1}\\ \phi (\frac{y_{i} - x_{i}\beta_{c}}{\sigma_{c}})\left[ (\frac{y_{i} - x_{i}\beta_{c}}{\sigma_{c}})^2 - 1 \right]\frac{1}{\sigma_{c}^2} & \text{if } \Psi_{1} < y_{i}|c \leq \Psi_{2}\\ \end{array} \end{cases} \end{array} \end{equation}\]