Exercise 8. Diet data: Using Poisson regression to study the effect of energy intake adjusting for confounders on two different timescales
Use Poisson regression to study the association between energy intake (hieng) and CHD adjusted for potential confounders (job, BMI). We know that people who expend a lot of energy (i.e., are physically active) require a higher energy intake. We do not have data on physical activity but we are hoping that occupation (job) will serve as a surrogate measure of work-time physical activity (conductors on London double-decker busses expend energy walking up and down the stairs all day).
Fit models both without adjusting for ‘time’ and by adjusting for attained age (you will need to split the data) and time-since-entry and compare the results.
Load the diet data using time-on-study as the timescale.
You may have to install the required packages the first time you use them. You can install a package by install.packages("package_of_interest") for each package you require.
library(biostat3) 
library(dplyr)     # for data manipulation
library(bshazard)  # for bshazardLoad diet data and explore it.
diet <- biostat3::diet
head(diet)
summary(diet)The rates plotted on timescale attained age show a clear increasing trend as age increases, which is to be expected (older persons are more likely to suffer from CHD). The rates plotted on timescale time-since-entry are almost constant (if you have some imagination you can see that the rates are flat).
(a)
scale <- 365.24
plot.bshazard(bshazard(Surv((doe - dob)/scale, (dox - dob)/scale, chd) ~ 1, data=subset(diet,hieng=="low")),
                         ylim=c(0,0.03), conf.int=FALSE, xlab="Attained age (years)")## Iterations: relative error in phi-hat = 1e-04 
## phi= 2.123004   sv2= 0.02485179   df= 5.280754   lambda= 85.4266 
## phi= 2.160747   sv2= 0.003641273   df= 3.533208   lambda= 593.4042 
## phi= 2.124866   sv2= 0.0009705935   df= 2.570406   lambda= 2189.244 
## phi= 2.107271   sv2= 0.0003506369   df= 2.218929   lambda= 6009.838 
## phi= 2.103164   sv2= 0.0001396506   df= 2.090053   lambda= 15060.19lines(bshazard(Surv((doe - dob)/scale, (dox - dob)/scale, chd) ~ 1, data=subset(diet,hieng=="high")),
      col="red", conf.int=FALSE)## Iterations: relative error in phi-hat = 1e-04 
## phi= 1.89958   sv2= 0.01820315   df= 5.406518   lambda= 104.3544 
## phi= 2.016237   sv2= 0.001851403   df= 3.443732   lambda= 1089.032 
## phi= 2.070698   sv2= 0.000138238   df= 2.351894   lambda= 14979.23legend("topleft", legend=c('hieng=="low"','hieng=="high"'), col=1:2, lty=1, bty="n")scale <- 365.24
plot(muhaz2(Surv((dox - doe)/scale, chd) ~ hieng, data=diet), lty=2,
     xlab="Time since study entry (years)", ylim=c(0,0.025),
     legend=FALSE)
lines(bshazard(Surv((dox - doe)/scale, chd) ~ 1, data=subset(diet,hieng=="low")),
      conf.int=FALSE)## Iterations: relative error in phi-hat = 1e-04 
## phi= 2.235959   sv2= 0.07486988   df= 6.59953   lambda= 29.86459 
## phi= 2.417459   sv2= 0.02283781   df= 5.288999   lambda= 105.8534 
## phi= 2.562882   sv2= 0.004970585   df= 4.113921   lambda= 515.6097 
## phi= 2.775805   sv2= 0.0009967037   df= 3.051224   lambda= 2784.986 
## phi= 3.104534   sv2= 0.0002592808   df= 2.34643   lambda= 11973.64lines(bshazard(Surv((dox - doe)/scale, chd) ~ 1, data=subset(diet,hieng=="high")),
      col="red", conf.int=FALSE)## Iterations: relative error in phi-hat = 1e-04 
## phi= 0.9665759   sv2= 0.02663575   df= 5.684564   lambda= 36.28866 
## phi= 1.016078   sv2= 0.006372641   df= 4.404778   lambda= 159.4438 
## phi= 1.053943   sv2= 0.002188221   df= 3.358888   lambda= 481.644 
## phi= 1.09428   sv2= 0.001338729   df= 2.80305   lambda= 817.4017 
## phi= 1.119151   sv2= 0.001015902   df= 2.592819   lambda= 1101.633 
## phi= 1.133828   sv2= 0.0008474499   df= 2.49013   lambda= 1337.929 
## phi= 1.143282   sv2= 0.00074534   df= 2.429672   lambda= 1533.906 
## phi= 1.149776   sv2= 0.0006777671   df= 2.390227   lambda= 1696.418 
## phi= 1.154442   sv2= 0.0006304254   df= 2.362804   lambda= 1831.212 
## phi= 1.157905   sv2= 0.0005959215   df= 2.342908   lambda= 1943.05 
## phi= 1.160537   sv2= 0.0005700457   df= 2.328029   lambda= 2035.867 
## phi= 1.162574   sv2= 0.0005502227   df= 2.316652   lambda= 2112.915 
## phi= 1.164172   sv2= 0.0005347879   df= 2.307805   lambda= 2176.886 
## phi= 1.16544   sv2= 0.0005226174   df= 2.300835   lambda= 2230.007 
## phi= 1.166455   sv2= 0.0005129252   df= 2.295288   lambda= 2274.123 
## phi= 1.167273   sv2= 0.0005051456   df= 2.290838   lambda= 2310.765 
## phi= 1.167936   sv2= 0.0004988616   df= 2.287245   lambda= 2341.202 
## phi= 1.168475   sv2= 0.0004937596   df= 2.284328   lambda= 2366.485 
## phi= 1.168916   sv2= 0.0004896003   df= 2.28195   lambda= 2387.49 
## phi= 1.169277   sv2= 0.0004861979   df= 2.280006   lambda= 2404.94 
## phi= 1.169573   sv2= 0.0004834071   df= 2.278411   lambda= 2419.437 
## phi= 1.169817   sv2= 0.0004811128   df= 2.277101   lambda= 2431.482 
## phi= 1.170018   sv2= 0.0004792231   df= 2.276021   lambda= 2441.49 
## phi= 1.170185   sv2= 0.0004776643   df= 2.27513   lambda= 2449.805 
## phi= 1.170322   sv2= 0.0004763768   df= 2.274395   lambda= 2456.715 
## phi= 1.170435   sv2= 0.0004753124   df= 2.273787   lambda= 2462.455 
## phi= 1.170529   sv2= 0.0004744315   df= 2.273284   lambda= 2467.225 
## phi= 1.170607   sv2= 0.0004737021   df= 2.272867   lambda= 2471.189 
## phi= 1.170672   sv2= 0.0004730977   df= 2.272522   lambda= 2474.482 
## phi= 1.170725   sv2= 0.0004725967   df= 2.272236   lambda= 2477.219 
## phi= 1.17077   sv2= 0.0004721812   df= 2.271999   lambda= 2479.493 
## phi= 1.170807   sv2= 0.0004718365   df= 2.271802   lambda= 2481.382 
## phi= 1.170837   sv2= 0.0004715505   df= 2.271638   lambda= 2482.952 
## phi= 1.170863   sv2= 0.0004713131   df= 2.271503   lambda= 2484.257 
## phi= 1.170884   sv2= 0.000471116   df= 2.27139   lambda= 2485.341 
## phi= 1.170901   sv2= 0.0004709523   df= 2.271297   lambda= 2486.242 
## phi= 1.170916   sv2= 0.0004708164   df= 2.271219   lambda= 2486.99 
## phi= 1.170928   sv2= 0.0004707035   df= 2.271155   lambda= 2487.612 
## phi= 1.170938   sv2= 0.0004706098   df= 2.271101   lambda= 2488.129 
## phi= 1.170946   sv2= 0.0004705319   df= 2.271057   lambda= 2488.559 
## phi= 1.170953   sv2= 0.0004704672   df= 2.27102   lambda= 2488.915 
## phi= 1.170959   sv2= 0.0004704135   df= 2.270989   lambda= 2489.212 
## phi= 1.170964   sv2= 0.0004703688   df= 2.270963   lambda= 2489.458legend("topleft", legend=c('hieng=="low" (bshazard)','hieng=="high" (bshazard)',
                           'hieng=="low" (muhaz)','hieng=="high" (muhaz)'), col=1:2, lty=c(1,1,2,2), bty="n")(b)
Patients with high energy intake have 48% less CHD rate. The underlying shape of the rates is assumed to be constant (i.e. the baseline is flat) over time.
diet <- mutate(diet, y1k = y / 1000)
poisson8b <- glm( chd ~ hieng + offset( log( y1k ) ), family=poisson, data=diet)
summary(poisson8b)## 
## Call:
## glm(formula = chd ~ hieng + offset(log(y1k)), family = poisson, 
##     data = diet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7382  -0.6337  -0.4899  -0.3891   3.0161  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.6098     0.1890  13.811   <2e-16 ***
## hienghigh    -0.6532     0.3021  -2.162   0.0306 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 258.00  on 335  degrees of freedom
## AIC: 354
## 
## Number of Fisher Scoring iterations: 6eform(poisson8b)##              exp(beta)     2.5 %     97.5 %
## (Intercept) 13.5959916 9.3877130 19.6907371
## hienghigh    0.5203599 0.2878432  0.9407012(c)
The effect of high energy intake is slightly confounded by bmi and job, since the point estimate changes a little.
## Create BMI variable
diet$bmi <- diet$weight/((diet$height/100)^2)
## Create orderly varable instead of categorical, start at zero
levels(diet$job)## [1] "driver"    "conductor" "bank"diet <- mutate(diet, jobNumber = unclass(job) - 1)
poisson8c <- glm( chd ~ hieng + jobNumber + bmi + offset( log( y1k ) ), family=poisson, data=diet)
summary(poisson8c)## 
## Call:
## glm(formula = chd ~ hieng + jobNumber + bmi + offset(log(y1k)), 
##     family = poisson, data = diet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8376  -0.6007  -0.4815  -0.3919   3.0614  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.49616    1.18209   1.266   0.2056  
## hienghigh   -0.69995    0.30986  -2.259   0.0239 *
## jobNumber   -0.08706    0.17170  -0.507   0.6121  
## bmi          0.05091    0.04757   1.070   0.2845  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 255.01  on 331  degrees of freedom
## Residual deviance: 249.03  on 328  degrees of freedom
##   (5 observations deleted due to missingness)
## AIC: 347.03
## 
## Number of Fisher Scoring iterations: 6eform(poisson8c)##             exp(beta)     2.5 %     97.5 %
## (Intercept) 4.4644913 0.4401212 45.2868008
## hienghigh   0.4966098 0.2705588  0.9115256
## jobNumber   0.9166234 0.6546985  1.2833364
## bmi         1.0522316 0.9585556  1.1550622##########################################################(d)
The y variable is not correct since it is kept for all splitted records, and contains the complete follow-up rather than the risktime in that specific timeband. The risktime variable contains the correct amount of risktime for each timeband. The event variable chd is not correct since it is kept constant for all splitted records, while it should only be 1 for the last record (if the person has the event). For all other records (timebands) for that person it should be 0.
## Split time at 30,50,60 and 72 with time scale age at entry to attained age
scale <- 365.24
age.cuts <- c(30,50,60,72)
diet.spl.dob <- survSplit(Surv((doe - dob)/scale, (dox - dob)/scale, chd) ~ ., data=diet, cut=age.cuts,start="tstart",end="tstop")
## Tabulate ageband
diet.spl.dob %>% select(id, tstart, tstop, y) %>% filter(id<=3) %>% arrange(id, tstart)##   id   tstart    tstop        y
## 1  1 49.61669 50.00000 12.29295
## 2  1 50.00000 60.00000 12.29295
## 3  1 60.00000 61.90998 12.29295
## 4  2 50.53937 60.00000 11.95893
## 5  2 60.00000 62.49863 11.95893
## 6  3 58.78600 60.00000 11.04175
## 7  3 60.00000 69.82806 11.04175## Create an agespan variable
diet.spl.dob <- mutate(diet.spl.dob,
                       agespan = cut(tstop, age.cuts))
## Make the numeric variables factors since we want to model them with dummie variables and calculate time at risk
diet.spl.dob <- mutate(diet.spl.dob,
                       jobNumber = as.factor(jobNumber),
                       risk_time = (tstop-tstart))
## Tabulate ageband including risk_time
diet.spl.dob %>% select(id,  tstart, tstop, y,risk_time) %>% filter(id<=3) %>% arrange(id, tstart)##   id   tstart    tstop        y  risk_time
## 1  1 49.61669 50.00000 12.29295  0.3833096
## 2  1 50.00000 60.00000 12.29295 10.0000000
## 3  1 60.00000 61.90998 12.29295  1.9099770
## 4  2 50.53937 60.00000 11.95893  9.4606286
## 5  2 60.00000 62.49863 11.95893  2.4986310
## 6  3 58.78600 60.00000 11.04175  1.2139963
## 7  3 60.00000 69.82806 11.04175  9.8280583## Tabulate number of events per agespan
xtabs(~agespan+chd,diet.spl.dob)##          chd
## agespan     0   1
##   (30,50] 190   6
##   (50,60] 275  18
##   (60,72] 218  22The effect of high energy intake is somewhat confounded by age, but also confounded by job and bmi.
poisson8d <- glm( chd ~ hieng + agespan + offset( log( risk_time) ),
                 family=poisson,
                 data=diet.spl.dob)
summary(poisson8d)## 
## Call:
## glm(formula = chd ~ hieng + agespan + offset(log(risk_time)), 
##     family = poisson, data = diet.spl.dob)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6253  -0.4312  -0.3246  -0.2050   2.8938  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -4.7798     0.4320 -11.065   <2e-16 ***
## hienghigh       -0.6233     0.3026  -2.060   0.0394 *  
## agespan(50,60]   0.3025     0.4721   0.641   0.5217    
## agespan(60,72]   0.8451     0.4613   1.832   0.0670 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 321.05  on 728  degrees of freedom
## Residual deviance: 311.40  on 725  degrees of freedom
## AIC: 411.4
## 
## Number of Fisher Scoring iterations: 6eform(poisson8d)##                  exp(beta)       2.5 %     97.5 %
## (Intercept)    0.008397598 0.003601145 0.01958257
## hienghigh      0.536168916 0.296269765 0.97032212
## agespan(50,60] 1.353254649 0.536460711 3.41366685
## agespan(60,72] 2.328213936 0.942643672 5.75040208Fitting the model for CHD, with adjustment for job and bmi.
poisson8d <- glm( chd ~ hieng + agespan + jobNumber + bmi + offset( log( risk_time) ),
                 family=poisson,
                 data=diet.spl.dob)
summary(poisson8d)## 
## Call:
## glm(formula = chd ~ hieng + agespan + jobNumber + bmi + offset(log(risk_time)), 
##     family = poisson, data = diet.spl.dob)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7714  -0.4027  -0.3083  -0.1940   3.0174  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.79183    1.31326  -5.172 2.32e-07 ***
## hienghigh      -0.71303    0.31387  -2.272   0.0231 *  
## agespan(50,60]  0.53692    0.50868   1.056   0.2912    
## agespan(60,72]  1.07421    0.49668   2.163   0.0306 *  
## jobNumber1      0.43510    0.40670   1.070   0.2847    
## jobNumber2     -0.13791    0.37184  -0.371   0.7107    
## bmi             0.07388    0.04852   1.523   0.1278    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 313.24  on 718  degrees of freedom
## Residual deviance: 298.77  on 712  degrees of freedom
##   (10 observations deleted due to missingness)
## AIC: 402.77
## 
## Number of Fisher Scoring iterations: 6eform(poisson8d)##                  exp(beta)        2.5 %     97.5 %
## (Intercept)    0.001122909 8.560401e-05 0.01472974
## hienghigh      0.490157708 2.649506e-01 0.90678996
## agespan(50,60] 1.710737260 6.312339e-01 4.63635138
## agespan(60,72] 2.927691811 1.105992e+00 7.74994872
## jobNumber1     1.545112410 6.962690e-01 3.42880771
## jobNumber2     0.871175509 4.203362e-01 1.80557060
## bmi            1.076677746 9.790156e-01 1.18408219(e)
time.cuts <- c(0, 5, 10, 15, 22)
diet.spl.t_entry <- survSplit(Surv((dox-doe)/365.24, chd) ~ ., data=diet, cut=time.cuts, end="tstop", start="tstart", event="chd")
##Tabulate ageband
diet.spl.t_entry %>% select(id, tstart, tstop, y) %>% filter(id<=3) %>% arrange(id, tstart)##   id tstart    tstop        y
## 1  1      0  5.00000 12.29295
## 2  1      5 10.00000 12.29295
## 3  1     10 12.29329 12.29295
## 4  2      0  5.00000 11.95893
## 5  2      5 10.00000 11.95893
## 6  2     10 11.95926 11.95893
## 7  3      0  5.00000 11.04175
## 8  3      5 10.00000 11.04175
## 9  3     10 11.04205 11.04175diet.spl.t_entry <- mutate(diet.spl.t_entry,
                           fu = cut(tstop, time.cuts),
                           risk_time = (tstop-tstart))
##Tabulate ageband including risk_time
diet.spl.t_entry %>% select(id, fu, tstart, y, risk_time) %>% filter(id<=3) %>% arrange(id, tstart)##   id      fu tstart        y risk_time
## 1  1   (0,5]      0 12.29295  5.000000
## 2  1  (5,10]      5 12.29295  5.000000
## 3  1 (10,15]     10 12.29295  2.293287
## 4  2   (0,5]      0 11.95893  5.000000
## 5  2  (5,10]      5 11.95893  5.000000
## 6  2 (10,15]     10 11.95893  1.959260
## 7  3   (0,5]      0 11.04175  5.000000
## 8  3  (5,10]      5 11.04175  5.000000
## 9  3 (10,15]     10 11.04175  1.042055poisson8e1 <- glm( chd ~ fu + hieng + offset( log( risk_time) ),
                 family=poisson,
                 data=diet.spl.t_entry )
summary(poisson8e1)## 
## Call:
## glm(formula = chd ~ fu + hieng + offset(log(risk_time)), family = poisson, 
##     data = diet.spl.t_entry)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3994  -0.3344  -0.2717  -0.2417   4.2937  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.25958    0.26933 -15.815   <2e-16 ***
## fu(5,10]    -0.23369    0.37705  -0.620   0.5354    
## fu(10,15]    0.12151    0.36841   0.330   0.7415    
## fu(15,22]   -0.05012    0.55562  -0.090   0.9281    
## hienghigh   -0.64923    0.30212  -2.149   0.0316 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 391.17  on 1099  degrees of freedom
## Residual deviance: 385.52  on 1095  degrees of freedom
## AIC: 487.52
## 
## Number of Fisher Scoring iterations: 6eform(poisson8e1)##              exp(beta)       2.5 %     97.5 %
## (Intercept) 0.01412826 0.008333566 0.02395225
## fu(5,10]    0.79160514 0.378061629 1.65750408
## fu(10,15]   1.12919965 0.548506063 2.32466319
## fu(15,22]   0.95111381 0.320099619 2.82604985
## hienghigh   0.52244901 0.288986787 0.94451712There seems to be no confounding by time-since-entry, but there is confounding by bmi and job.