Exercise 6. Diet data: tabulating incidence rates and modelling with Poisson regression
You may have to install the required packages the first time you use them. You can install a package by install.packages("package_of_interest") for each package you require.
Load the diet data using time-on-study as the timescale. We look at the first six rows of the data using head and look at a summary for the variables using summary:
##    id chd         y hieng  energy       job month   height   weight        doe        dox
## 1 127   0 16.791239   low 2023.25 conductor     2 173.9900 61.46280 1960-02-16 1976-12-01
## 2 200   0 19.958933   low 2448.68      bank    12 177.8000 73.48320 1956-12-16 1976-12-01
## 3 198   0 19.958933   low 2281.38      bank    12       NA       NA 1956-12-16 1976-12-01
## 4 222   0 15.394935   low 2467.95      bank     2 158.7500 58.24224 1957-02-16 1972-07-10
## 5 305   1  1.494866   low 2362.93      bank     1       NA       NA 1960-01-16 1961-07-15
## 6 173   0 15.958932   low 2380.11 conductor    12 164.4904 79.01712 1960-12-16 1976-12-01
##          dob      yoe      yox      yob         y1k      bmi jobNumber
## 1 1910-09-27 1960.126 1976.917 1910.737 0.016791239 20.30316         1
## 2 1909-06-18 1956.958 1976.917 1909.460 0.019958933 23.24473         2
## 3 1910-06-30 1956.958 1976.917 1910.493 0.019958933       NA         2
## 4 1902-07-11 1957.126 1972.523 1902.523 0.015394935 23.11057         2
## 5 1913-06-30 1960.041 1961.534 1913.493 0.001494866       NA         2
## 6 1915-06-28 1960.958 1976.917 1915.487 0.015958932 29.20385         1##        id           chd               y            hieng         energy    
##  Min.   :  1   Min.   :0.0000   Min.   : 0.2875   low :155   Min.   :1748  
##  1st Qu.: 85   1st Qu.:0.0000   1st Qu.:10.7762   high:182   1st Qu.:2537  
##  Median :169   Median :0.0000   Median :15.4606              Median :2803  
##  Mean   :169   Mean   :0.1365   Mean   :13.6607              Mean   :2829  
##  3rd Qu.:253   3rd Qu.:0.0000   3rd Qu.:17.0431              3rd Qu.:3110  
##  Max.   :337   Max.   :1.0000   Max.   :20.0411              Max.   :4396  
##                                                                            
##         job          month            height          weight            doe            
##  driver   :102   Min.   : 1.000   Min.   :152.4   Min.   : 46.72   Min.   :1956-11-16  
##  conductor: 84   1st Qu.: 3.000   1st Qu.:168.9   1st Qu.: 64.64   1st Qu.:1959-01-16  
##  bank     :151   Median : 6.000   Median :173.0   Median : 72.80   Median :1960-02-16  
##                  Mean   : 6.231   Mean   :173.4   Mean   : 72.54   Mean   :1960-06-22  
##                  3rd Qu.:10.000   3rd Qu.:177.8   3rd Qu.: 79.83   3rd Qu.:1961-06-16  
##                  Max.   :12.000   Max.   :190.5   Max.   :106.14   Max.   :1966-09-16  
##                                   NA's   :5       NA's   :4                            
##       dox                  dob                  yoe            yox            yob      
##  Min.   :1958-08-29   Min.   :1892-01-10   Min.   :1957   Min.   :1959   Min.   :1892  
##  1st Qu.:1972-09-29   1st Qu.:1906-01-18   1st Qu.:1959   1st Qu.:1973   1st Qu.:1906  
##  Median :1976-12-01   Median :1911-02-25   Median :1960   Median :1977   Median :1911  
##  Mean   :1974-02-19   Mean   :1911-01-04   Mean   :1960   Mean   :1974   Mean   :1911  
##  3rd Qu.:1976-12-01   3rd Qu.:1915-01-30   3rd Qu.:1961   3rd Qu.:1977   3rd Qu.:1915  
##  Max.   :1976-12-01   Max.   :1930-09-19   Max.   :1967   Max.   :1977   Max.   :1931  
##                                                                                        
##       y1k                 bmi          jobNumber    
##  Min.   :0.0002875   Min.   :15.88   Min.   :0.000  
##  1st Qu.:0.0107762   1st Qu.:21.59   1st Qu.:0.000  
##  Median :0.0154606   Median :24.11   Median :1.000  
##  Mean   :0.0136607   Mean   :24.12   Mean   :1.145  
##  3rd Qu.:0.0170431   3rd Qu.:26.50   3rd Qu.:2.000  
##  Max.   :0.0200411   Max.   :33.29   Max.   :2.000  
##                      NA's   :5(a)
diet <- biostat3::diet
diet$y1k <- diet$y/1000
diet.ir6a <- survRate(Surv(y/1000,chd) ~ hieng, data=diet)
## or
diet %>%
    group_by(hieng) %>%
    summarise(Event = sum(chd), Time = sum(y1k), Rate = Event/Time,      # group sums
              CI_low = poisson.test(Event,Time)$conf.int[1],
              CI_high = poisson.test(Event,Time)$conf.int[2]) ## # A tibble: 2 × 6
##   hieng Event  Time  Rate CI_low CI_high
##   <fct> <int> <dbl> <dbl>  <dbl>   <dbl>
## 1 low      28  2.06 13.6    9.03    19.6
## 2 high     18  2.54  7.07   4.19    11.2## 
##  Comparison of Poisson rates
## 
## data:  event time base: tstop
## count1 = 28, expected count1 = 20.578, p-value = 0.03681
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  1.026173 3.688904
## sample estimates:
## rate ratio 
##   1.921747## 
##  Comparison of Poisson rates
## 
## data:  rev(event) time base: rev(tstop)
## count1 = 18, expected count1 = 25.422, p-value = 0.03681
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  0.2710832 0.9744943
## sample estimates:
## rate ratio 
##  0.5203599We see that individuals with a high energy intake have a lower CHD incidence rate. The estimated crude incidence rate ratio is 0.52 (95% CI: 0.27, 0.97).
(b)
The code is:
## 
## Call:
## glm(formula = chd ~ hieng + offset(log(y1k)), family = poisson, 
##     data = diet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7382  -0.6337  -0.4899  -0.3891   3.0161  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.6098     0.1890  13.811   <2e-16 ***
## hienghigh    -0.6532     0.3021  -2.162   0.0306 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 258.00  on 335  degrees of freedom
## AIC: 354
## 
## Number of Fisher Scoring iterations: 6##              exp(beta)     2.5 %     97.5 %
## (Intercept) 13.5959916 9.3877130 19.6907371
## hienghigh    0.5203599 0.2878432  0.9407012## Waiting for profiling to be done...##              exp(beta)     2.5 %     97.5 %
## (Intercept) 13.5959916 9.1614552 19.2715805
## hienghigh    0.5203599 0.2829171  0.9328392The point estimate for the IRR calculated by the Poisson regression is the same as the IRR calculated in 6(a). The regression model can be defined by:
\[\begin{align*} E(\text{chd}) &= \frac{y}{1000}\exp\left(\beta_0 + \beta_1I(\text{hieng}="high") \right) \\ &= \exp\left(\beta_0 + \beta_1I(\text{hieng}="high") + \log(y/1000) \right) \end{align*}\]
where \(E(\text{chd})\) is the expected count for CHD, \(\beta_0\) is the intercept parameter for the log rate and \(\beta_1\) is the parameter for the log rate ratio for high energy diets. We have also used \(I(\text{condition})\) as an indicator function, which will take a value of 1 when the condition is true and 0 otherwise.
A theoretical observation: if we consider the data as being cross-classified solely by hieng then the Poisson regression model with one parameter is a saturated model (that is, the number of observations is equal to the number of parameters) so the IRR estimated from the model will be identical to the ‘observed’ IRR. That is, the model is a perfect fit.
(c)
hist6c <- hist(diet$energy, breaks=25, probability=TRUE, xlab="Energy (units)")
curve(dnorm(x, mean=mean(diet$energy), sd=sd(diet$energy)), col = "red", add=TRUE)##       1%       5%      10%      25%      50%      75%      90%      95%      99% 
## 1887.268 2177.276 2314.114 2536.690 2802.980 3109.660 3365.644 3588.178 4046.820The histogram gives us an idea of the distribution of energy intake. We can also tabulate moments and percentiles of the distribution.
(d)
diet$eng3 <- cut(diet$energy, breaks=c(1500,2500,3000,4500),labels=c("low","medium","high"), 
                 right = FALSE)
cbind(Freq=table(diet$eng3),
      Prop=table(diet$eng3)/nrow(diet))##        Freq      Prop
## low      75 0.2225519
## medium  150 0.4451039
## high    112 0.3323442(e)
##               eng3     tstop event      rate    lower     upper
## eng3=low       low 0.9466338    16 16.901995 9.660951 27.447781
## eng3=medium medium 2.0172621    22 10.905871 6.834651 16.511619
## eng3=high     high 1.6397728     8  4.878725 2.106287  9.613033## [1] 0.6452416## 
##  Comparison of Poisson rates
## 
## data:  event time base: tstop
## count1 = 22, expected count1 = 25.863, p-value = 0.2221
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  0.3237007 1.3143509
## sample estimates:
## rate ratio 
##  0.6452416## [1] 0.2886479## 
##  Comparison of Poisson rates
## 
## data:  event time base: tstop
## count1 = 8, expected count1 = 15.216, p-value = 0.004579
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  0.1069490 0.7148284
## sample estimates:
## rate ratio 
##  0.2886479We see that the CHD incidence rate decreases as the level of total energy intake increases.
(f)
diet <- mutate(diet, 
               X1 = as.numeric(eng3 == "low"),
               X2 = as.numeric(eng3 == "medium"),
               X3 = as.numeric(eng3 == "high"))
# or
diet <- biostat3::addIndicators(diet, ~eng3+0) %>%
    mutate(X1 = eng3low, X2 = eng3medium, X3 = eng3high)
colSums(diet[c("X1","X2","X3")])##  X1  X2  X3 
##  75 150 112(g)
##    energy eng3 X1 X2 X3
## 1 2023.25  low  1  0  0
## 2 2448.68  low  1  0  0
## 3 2281.38  low  1  0  0
## 4 2467.95  low  1  0  0
## 5 2362.93  low  1  0  0
## 6 2380.11  low  1  0  0##     energy   eng3 X1 X2 X3
## 76 2664.64 medium  0  1  0
## 77 2533.33 medium  0  1  0
## 78 2854.08 medium  0  1  0
## 79 2673.77 medium  0  1  0
## 80 2766.88 medium  0  1  0
## 81 2586.69 medium  0  1  0##      energy eng3 X1 X2 X3
## 226 3067.36 high  0  0  1
## 227 3298.95 high  0  0  1
## 228 3147.60 high  0  0  1
## 229 3180.47 high  0  0  1
## 230 3045.81 high  0  0  1
## 231 3060.03 high  0  0  1(h)
poisson6h <- glm( chd ~ X2 + X3 + offset( log( y1k ) ), family=poisson, data=diet )
summary(poisson6h)## 
## Call:
## glm(formula = chd ~ X2 + X3 + offset(log(y1k)), family = poisson, 
##     data = diet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8231  -0.6052  -0.4532  -0.3650   2.9434  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.8274     0.2500  11.312  < 2e-16 ***
## X2           -0.4381     0.3285  -1.334  0.18233    
## X3           -1.2425     0.4330  -2.870  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 253.62  on 334  degrees of freedom
## AIC: 351.62
## 
## Number of Fisher Scoring iterations: 6##              exp(beta)      2.5 %     97.5 %
## (Intercept) 16.9019960 10.3556032 27.5867534
## X2           0.6452416  0.3389050  1.2284761
## X3           0.2886478  0.1235407  0.6744143Level 1 of the categorized total energy is the reference category. The estimated rate ratio comparing level 2 to level 1 is 0.6452 and the estimated rate ratio comparing level 3 to level 1 is 0.2886.
The regression equation can be represented by:
\[\begin{align*} E(\text{chd}) &= \frac{y}{1000}\exp\left(\beta_0 + \beta_2 X_2 + \beta_3 X_3 \right) \\ &= \exp\left(\beta_0 + \beta_2 X_2 + \beta_3 X_3 + \log(y/1000) \right) \end{align*}\] where \(\beta_2\) and \(\beta_3\) are the log rate ratios for \(X_2=\text{X2}\) and \(X_3=\text{X3}\), respectively.
(i)
poisson6i <- glm( chd ~ X1 + X3 + offset( log( y1k ) ), family=poisson, data=diet )
# or 
poisson6i <- glm( chd ~ I(eng3=="low") + I(eng3=="high") + offset( log( y1k ) ), family=poisson, data=diet )
summary( poisson6i )## 
## Call:
## glm(formula = chd ~ I(eng3 == "low") + I(eng3 == "high") + offset(log(y1k)), 
##     family = poisson, data = diet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8231  -0.6052  -0.4532  -0.3650   2.9434  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             2.3893     0.2132  11.207   <2e-16 ***
## I(eng3 == "low")TRUE    0.4381     0.3285   1.334   0.1823    
## I(eng3 == "high")TRUE  -0.8044     0.4129  -1.948   0.0514 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 253.62  on 334  degrees of freedom
## AIC: 351.62
## 
## Number of Fisher Scoring iterations: 6##                        exp(beta)     2.5 %    97.5 %
## (Intercept)           10.9058706 7.1810131 16.562846
## I(eng3 == "low")TRUE   1.5498071 0.8140167  2.950679
## I(eng3 == "high")TRUE  0.4473485 0.1991681  1.004783Now use level 2 as the reference (by omitting X2 but including X1 and X3). The estimated rate ratio comparing level 1 to level 2 is 1.5498 and the estimated rate ratio comparing level 3 to level 2 is 0.4473.
(j)
poisson6j <- glm( chd ~ eng3 + offset( log( y1k ) ), family=poisson, data=diet )
summary( poisson6j )## 
## Call:
## glm(formula = chd ~ eng3 + offset(log(y1k)), family = poisson, 
##     data = diet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8231  -0.6052  -0.4532  -0.3650   2.9434  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.8274     0.2500  11.312  < 2e-16 ***
## eng3medium   -0.4381     0.3285  -1.334  0.18233    
## eng3high     -1.2425     0.4330  -2.870  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 253.62  on 334  degrees of freedom
## AIC: 351.62
## 
## Number of Fisher Scoring iterations: 6##              exp(beta)      2.5 %     97.5 %
## (Intercept) 16.9019960 10.3556032 27.5867534
## eng3medium   0.6452416  0.3389050  1.2284761
## eng3high     0.2886478  0.1235407  0.6744143The estimates are identical (as we would hope) when we have R create indicator variables for us.
(k)
Somehow (there are many different alternatives) you’ll need to calculate the total number of events and the total person-time at risk and then calculate the incidence rate as events/person-time. For example,
##          rate
## 1 0.009992031The estimated incidence rate is 0.00999 events per person-year.