Exercise 6. Diet data: tabulating incidence rates and modelling with Poisson regression
You may have to install the required packages the first time you use them. You can install a package by install.packages("package_of_interest") for each package you require.
library(biostat3)
library(dplyr)    # for data manipulationLoad the diet data using time-on-study as the timescale.
head(diet)##    id chd         y hieng  energy       job month   height   weight
## 1 127   0 16.791239   low 2023.25 conductor     2 173.9900 61.46280
## 2 200   0 19.958933   low 2448.68      bank    12 177.8000 73.48320
## 3 198   0 19.958933   low 2281.38      bank    12       NA       NA
## 4 222   0 15.394935   low 2467.95      bank     2 158.7500 58.24224
## 5 305   1  1.494866   low 2362.93      bank     1       NA       NA
## 6 173   0 15.958932   low 2380.11 conductor    12 164.4904 79.01712
##          doe        dox        dob         y1k      bmi jobNumber
## 1 1960-02-16 1976-12-01 1910-09-27 0.016791239 20.30316         1
## 2 1956-12-16 1976-12-01 1909-06-18 0.019958933 23.24473         2
## 3 1956-12-16 1976-12-01 1910-06-30 0.019958933       NA         2
## 4 1957-02-16 1972-07-10 1902-07-11 0.015394935 23.11057         2
## 5 1960-01-16 1961-07-15 1913-06-30 0.001494866       NA         2
## 6 1960-12-16 1976-12-01 1915-06-28 0.015958932 29.20385         1summary(diet)##        id           chd               y            hieng    
##  Min.   :  1   Min.   :0.0000   Min.   : 0.2875   low :155  
##  1st Qu.: 85   1st Qu.:0.0000   1st Qu.:10.7762   high:182  
##  Median :169   Median :0.0000   Median :15.4606             
##  Mean   :169   Mean   :0.1365   Mean   :13.6607             
##  3rd Qu.:253   3rd Qu.:0.0000   3rd Qu.:17.0431             
##  Max.   :337   Max.   :1.0000   Max.   :20.0411             
##                                                             
##      energy            job          month            height     
##  Min.   :1748   driver   :102   Min.   : 1.000   Min.   :152.4  
##  1st Qu.:2537   conductor: 84   1st Qu.: 3.000   1st Qu.:168.9  
##  Median :2803   bank     :151   Median : 6.000   Median :173.0  
##  Mean   :2829                   Mean   : 6.231   Mean   :173.4  
##  3rd Qu.:3110                   3rd Qu.:10.000   3rd Qu.:177.8  
##  Max.   :4396                   Max.   :12.000   Max.   :190.5  
##                                                  NA's   :5      
##      weight            doe                  dox            
##  Min.   : 46.72   Min.   :1956-11-16   Min.   :1958-08-29  
##  1st Qu.: 64.64   1st Qu.:1959-01-16   1st Qu.:1972-09-29  
##  Median : 72.80   Median :1960-02-16   Median :1976-12-01  
##  Mean   : 72.54   Mean   :1960-06-22   Mean   :1974-02-19  
##  3rd Qu.: 79.83   3rd Qu.:1961-06-16   3rd Qu.:1976-12-01  
##  Max.   :106.14   Max.   :1966-09-16   Max.   :1976-12-01  
##  NA's   :4                                                 
##       dob                  y1k                 bmi          jobNumber    
##  Min.   :1892-01-10   Min.   :0.0002875   Min.   :15.88   Min.   :0.000  
##  1st Qu.:1906-01-18   1st Qu.:0.0107762   1st Qu.:21.59   1st Qu.:0.000  
##  Median :1911-02-25   Median :0.0154606   Median :24.11   Median :1.000  
##  Mean   :1911-01-04   Mean   :0.0136607   Mean   :24.12   Mean   :1.145  
##  3rd Qu.:1915-01-30   3rd Qu.:0.0170431   3rd Qu.:26.50   3rd Qu.:2.000  
##  Max.   :1930-09-19   Max.   :0.0200411   Max.   :33.29   Max.   :2.000  
##                                           NA's   :5(a)
diet <- biostat3::diet
diet$y1k <- diet$y/1000
diet.ir6a <- survRate(Surv(y/1000,chd) ~ hieng, data=diet)
## or
diet %>%
    group_by(hieng) %>%
    summarise(Event = sum(chd), Time = sum(y1k), Rate = Event/Time,      # group sums
              CI_low = poisson.test(Event,Time)$conf.int[1],
              CI_high = poisson.test(Event,Time)$conf.int[2]) ## # A tibble: 2 x 6
##   hieng Event  Time  Rate CI_low CI_high
##   <fct> <int> <dbl> <dbl>  <dbl>   <dbl>
## 1 low      28  2.06 13.6    9.03    19.6
## 2 high     18  2.54  7.07   4.19    11.2## IRR
with(diet.ir6a, poisson.test(event,tstop)) ## 
##  Comparison of Poisson rates
## 
## data:  event time base: tstop
## count1 = 28, expected count1 = 20.578, p-value = 0.03681
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  1.026173 3.688904
## sample estimates:
## rate ratio 
##   1.921747with(diet.ir6a, poisson.test(rev(event),rev(tstop)))## 
##  Comparison of Poisson rates
## 
## data:  rev(event) time base: rev(tstop)
## count1 = 18, expected count1 = 25.422, p-value = 0.03681
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  0.2710832 0.9744943
## sample estimates:
## rate ratio 
##  0.5203599We see that individuals with a high energy intake have a lower CHD incidence rate. The estimated crude incidence rate ratio is 0.52 (95% CI: 0.27, 0.97).
(b)
poisson6b <- glm( chd ~ hieng + offset( log( y1k ) ), family=poisson, data=diet)
summary(poisson6b)## 
## Call:
## glm(formula = chd ~ hieng + offset(log(y1k)), family = poisson, 
##     data = diet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7382  -0.6337  -0.4899  -0.3891   3.0161  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.6098     0.1890  13.811   <2e-16 ***
## hienghigh    -0.6532     0.3021  -2.162   0.0306 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 258.00  on 335  degrees of freedom
## AIC: 354
## 
## Number of Fisher Scoring iterations: 6eform(poisson6b)##              exp(beta)     2.5 %     97.5 %
## (Intercept) 13.5959916 9.3877130 19.6907371
## hienghigh    0.5203599 0.2878432  0.9407012eform(poisson6b, method="Profile")## Waiting for profiling to be done...##              exp(beta)     2.5 %     97.5 %
## (Intercept) 13.5959916 9.1614552 19.2715805
## hienghigh    0.5203599 0.2829171  0.9328392The point estimate for the IRR calculated by the Poisson regression is the same as the IRR calculated in 6(a). A theoretical observation: if we consider the data as being cross-classified solely by hieng then the Poisson regression model with one parameter is a saturated model so the IRR estimated from the model will be identical to the ‘observed’ IRR. That is, the model is a perfect fit.
(c)
hist6c <- hist(diet$energy, breaks=25, probability=TRUE, xlab="Energy (units)")
curve(dnorm(x, mean=mean(diet$energy), sd=sd(diet$energy)), col = "red", add=TRUE)quantile(diet$energy, probs=c(0.01,0.05,0.1,0.25,0.5,0.75,0.90,0.95,0.99))##       1%       5%      10%      25%      50%      75%      90%      95% 
## 1887.268 2177.276 2314.114 2536.690 2802.980 3109.660 3365.644 3588.178 
##      99% 
## 4046.820# For kurtosis and skewness, see package e1071The histogram gives us an idea of the distribution of energy intake. We can also tabulate moments and percentiles of the distribution.
(d)
diet$eng3 <- cut(diet$energy, breaks=c(1500,2500,3000,4500),labels=c("low","medium","high"), 
                 right = FALSE)
cbind(Freq=table(diet$eng3),
      Prop=table(diet$eng3)/nrow(diet))##        Freq      Prop
## low      75 0.2225519
## medium  150 0.4451039
## high    112 0.3323442(e)
## rates and IRRs
diet.ir6e <- survRate(Surv(y/1000,chd) ~ eng3, data=diet)
print(diet.ir6e)##               eng3     tstop event      rate    lower     upper
## eng3=low       low 0.9466338    16 16.901995 9.660951 27.447781
## eng3=medium medium 2.0172621    22 10.905871 6.834651 16.511619
## eng3=high     high 1.6397728     8  4.878725 2.106287  9.613033# calculate IRR and confidence intervals
with(diet.ir6e, rate[eng3=="medium"] / rate[eng3=="low"])## [1] 0.6452416with(diet.ir6e[c(2,1),], { # compare second row with first row
  poisson.test(event, tstop)
})## 
##  Comparison of Poisson rates
## 
## data:  event time base: tstop
## count1 = 22, expected count1 = 25.863, p-value = 0.2221
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  0.3237007 1.3143509
## sample estimates:
## rate ratio 
##  0.6452416with(diet.ir6e, rate[eng3=="high"] / rate[eng3=="low"])## [1] 0.2886479with(diet.ir6e[c(3,1),], { # compare third row with first row
  poisson.test(event, tstop)
})## 
##  Comparison of Poisson rates
## 
## data:  event time base: tstop
## count1 = 8, expected count1 = 15.216, p-value = 0.004579
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  0.1069490 0.7148284
## sample estimates:
## rate ratio 
##  0.2886479We see that the CHD incidence rate decreases as the level of total energy intake increases.
(f)
diet <- mutate(diet, 
               X1 = as.numeric(eng3 == "low"),
               X2 = as.numeric(eng3 == "medium"),
               X3 = as.numeric(eng3 == "high"))
# or
diet <- biostat3::addIndicators(diet, ~eng3+0) %>%
    mutate(X1 = eng3low, X2 = eng3medium, X3 = eng3high)
colSums(diet[c("X1","X2","X3")])##  X1  X2  X3 
##  75 150 112(g)
filter(diet, eng3=="low")    %>% select(c(energy,eng3,X1,X2,X3)) %>% head##    energy eng3 X1 X2 X3
## 1 2023.25  low  1  0  0
## 2 2448.68  low  1  0  0
## 3 2281.38  low  1  0  0
## 4 2467.95  low  1  0  0
## 5 2362.93  low  1  0  0
## 6 2380.11  low  1  0  0filter(diet, eng3=="medium") %>% select(c(energy,eng3,X1,X2,X3)) %>% head##    energy   eng3 X1 X2 X3
## 1 2664.64 medium  0  1  0
## 2 2533.33 medium  0  1  0
## 3 2854.08 medium  0  1  0
## 4 2673.77 medium  0  1  0
## 5 2766.88 medium  0  1  0
## 6 2586.69 medium  0  1  0filter(diet, eng3=="high")   %>% select(c(energy,eng3,X1,X2,X3)) %>% head##    energy eng3 X1 X2 X3
## 1 3067.36 high  0  0  1
## 2 3298.95 high  0  0  1
## 3 3147.60 high  0  0  1
## 4 3180.47 high  0  0  1
## 5 3045.81 high  0  0  1
## 6 3060.03 high  0  0  1(h)
poisson6h <- glm( chd ~ X2 + X3 + offset( log( y1k ) ), family=poisson, data=diet )
summary(poisson6h)## 
## Call:
## glm(formula = chd ~ X2 + X3 + offset(log(y1k)), family = poisson, 
##     data = diet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8231  -0.6052  -0.4532  -0.3650   2.9434  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.8274     0.2500  11.312  < 2e-16 ***
## X2           -0.4381     0.3285  -1.334  0.18233    
## X3           -1.2425     0.4330  -2.870  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 253.62  on 334  degrees of freedom
## AIC: 351.62
## 
## Number of Fisher Scoring iterations: 6eform(poisson6h)##              exp(beta)      2.5 %     97.5 %
## (Intercept) 16.9019960 10.3556032 27.5867534
## X2           0.6452416  0.3389050  1.2284761
## X3           0.2886478  0.1235407  0.6744143Level 1 of the categorized total energy is the reference category. The estimated rate ratio comparing level 2 to level 1 is 0.6452 and the estimated rate ratio comparing level 3 to level 1 is 0.2886.
(i)
poisson6i <- glm( chd ~ X1 + X3 + offset( log( y1k ) ), family=poisson, data=diet )
# or 
poisson6i <- glm( chd ~ I(eng3=="low") + I(eng3=="high") + offset( log( y1k ) ), family=poisson, data=diet )
summary( poisson6i )## 
## Call:
## glm(formula = chd ~ I(eng3 == "low") + I(eng3 == "high") + offset(log(y1k)), 
##     family = poisson, data = diet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8231  -0.6052  -0.4532  -0.3650   2.9434  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             2.3893     0.2132  11.207   <2e-16 ***
## I(eng3 == "low")TRUE    0.4381     0.3285   1.334   0.1823    
## I(eng3 == "high")TRUE  -0.8044     0.4129  -1.948   0.0514 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 253.62  on 334  degrees of freedom
## AIC: 351.62
## 
## Number of Fisher Scoring iterations: 6eform( poisson6i )##                        exp(beta)     2.5 %    97.5 %
## (Intercept)           10.9058706 7.1810131 16.562846
## I(eng3 == "low")TRUE   1.5498071 0.8140167  2.950679
## I(eng3 == "high")TRUE  0.4473485 0.1991681  1.004783Now use level 2 as the reference (by omitting X2 but including X1 and X3). The estimated rate ratio comparing level 1 to level 2 is 1.5498 and the estimated rate ratio comparing level 3 to level 2 is 0.4473.
(j)
poisson6j <- glm( chd ~ eng3 + offset( log( y1k ) ), family=poisson, data=diet )
summary( poisson6j )## 
## Call:
## glm(formula = chd ~ eng3 + offset(log(y1k)), family = poisson, 
##     data = diet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8231  -0.6052  -0.4532  -0.3650   2.9434  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.8274     0.2500  11.312  < 2e-16 ***
## eng3medium   -0.4381     0.3285  -1.334  0.18233    
## eng3high     -1.2425     0.4330  -2.870  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 253.62  on 334  degrees of freedom
## AIC: 351.62
## 
## Number of Fisher Scoring iterations: 6eform( poisson6j )##              exp(beta)      2.5 %     97.5 %
## (Intercept) 16.9019960 10.3556032 27.5867534
## eng3medium   0.6452416  0.3389050  1.2284761
## eng3high     0.2886478  0.1235407  0.6744143The estimates are identical (as we would hope) when we have R create indicator variables for us.
(k)
Somehow (there are many different alternatives) you’ll need to calculate the total number of events and the total person-time at risk and then calculate the incidence rate as events/person-time. For example,
summarise(diet, rate = sum(chd) / sum(y))##          rate
## 1 0.009992031The estimated incidence rate is 0.00999 events per person-year.