--- title: "Getting Started with Risk Differences" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with Risk Differences} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4, fig.align = "center" ) ``` ```{r load libraries} library(riskdiff) library(dplyr) library(ggplot2) ``` ## Why Risk Differences Matter in Public Health When communicating health risks to policymakers, patients, or the public, **absolute measures** like risk differences are often more meaningful than **relative measures** like odds ratios or risk ratios. Consider these two statements about betel nut (areca nut) chewing and cancer risk: 1. "Betel nut chewing increases the odds of cancer by 5.2 times" 2. "Betel nut chewing increases cancer risk by 12 percentage points" The second statement is immediately actionable: in a screening group of 1,000 people, you would expect to find approximately 120 additional cancer cases among betel nut chewers compared to non-chewers. This directly informs: - Resource allocation for screening programs - Expected yield from targeted interventions - Number needed to screen calculations - Public health messaging priorities > **Key Concept**: Risk differences represent the **additional burden of disease** attributable to an exposure. They are on the same scale as the outcome, making them intuitive for non-statistical audiences. ## Understanding the Example Data The `riskdiff` package includes example data from a cancer screening program in Northeast India: ```{r load example data} data(cachar_sample) # Basic structure glimpse(cachar_sample) # Key variables for our analysis cachar_sample %>% select(abnormal_screen, areca_nut, smoking, alcohol, age, sex, residence) %>% summary() ``` This dataset represents a cross-sectional screening study where: - `abnormal_screen` = 1 indicates an abnormal cancer screening result - `areca_nut` indicates areca (or betel) nut chewing status - Other variables capture demographics and risk factors ## Your First Risk Difference Analysis Let's calculate the risk difference for cancer associated with betel nut chewing: ```{r simple unadjusted analysis} # Simple unadjusted analysis rd_simple <- calc_risk_diff( data = cachar_sample, outcome = "abnormal_screen", exposure = "areca_nut" ) print(rd_simple) ``` ### Understanding the Output The output shows: - **rd**: The risk difference (e.g., 0.082 = 8.2 percentage points) - **ci_lower, ci_upper**: 95% confidence interval bounds - **p_value**: Test of whether the risk difference equals zero - **model_type**: Which GLM link function successfully converged - **n_obs**: Number of observations used in the analysis ### Interpreting Risk Differences ```{r visualisation} # Let's visualize what this risk difference means exposure_summary <- cachar_sample %>% group_by(areca_nut) %>% summarise( n = n(), cases = sum(abnormal_screen), risk = mean(abnormal_screen), se = sqrt(risk * (1 - risk) / n) ) %>% mutate( risk_percent = risk * 100, se_percent = se * 100 ) print(exposure_summary) # The risk difference is simply: rd_value <- diff(exposure_summary$risk) cat("Risk difference:", round(rd_value * 100, 1), "percentage points\n") ``` ### Adjusted Analyses Real-world associations are often confounded. Let's adjust for age and sex: ```{r adjusted analysis} # Age and sex adjusted analysis rd_adjusted <- calc_risk_diff( data = cachar_sample, outcome = "abnormal_screen", exposure = "areca_nut", adjust_vars = c("age", "sex") ) print(rd_adjusted) # Show comparison in a readable format cat("\n=== COMPARISON: UNADJUSTED vs ADJUSTED ===\n") cat("Unadjusted Risk Difference:", sprintf("%.2f%%", rd_simple$rd * 100), sprintf("(%.2f%%, %.2f%%)", rd_simple$ci_lower * 100, rd_simple$ci_upper * 100), "\n") cat("Adjusted Risk Difference: ", sprintf("%.2f%%", rd_adjusted$rd * 100), sprintf("(%.2f%%, %.2f%%)", rd_adjusted$ci_lower * 100, rd_adjusted$ci_upper * 100), "\n") cat("Difference in estimates: ", sprintf("%.2f%%", (rd_adjusted$rd - rd_simple$rd) * 100), "percentage points\n") ``` > **Note**: Adjustment often changes the risk difference estimate. This indicates that age and/or sex were confounders of the chewing-cancer association. ### Stratified Analysis Sometimes we want to know if effects differ across subgroups: ```{r stratified analysis} # Stratified by residence (urban vs rural) rd_stratified <- calc_risk_diff( data = cachar_sample, outcome = "abnormal_screen", exposure = "areca_nut", strata = "residence" ) print(rd_stratified) ``` This shows separate risk differences for urban and rural areas, which might reflect: - Different exposure patterns - Access to healthcare - Environmental co-factors ### Visualizing Your Results A picture is worth a thousand p-values: ```{r visualise stratified results} # Create a forest plot of risk differences plot_data <- rd_stratified %>% mutate( label = paste0(residence, "\n(n=", n_obs, ")"), rd_percent = rd * 100, ci_lower_percent = ci_lower * 100, ci_upper_percent = ci_upper * 100 ) ggplot(plot_data, aes(x = rd_percent, y = label)) + geom_point(size = 3) + geom_errorbarh(aes(xmin = ci_lower_percent, xmax = ci_upper_percent), height = 0.2) + geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) + labs( title = "Risk Difference for Cancer by Betel Nut Chewing", subtitle = "Stratified by Urban/Rural Residence", x = "Risk Difference (percentage points)", y = "" ) + theme_minimal() + theme( plot.title = element_text(face = "bold"), axis.text.y = element_text(size = 11) ) ``` ### Common Pitfalls and Solutions #### 1. Model Convergence Issues The identity link GLM (which directly estimates risk differences) often fails to converge. The `riskdiff` package handles this automatically: ```{r convergence handling} # Force different link functions rd_identity <- calc_risk_diff( cachar_sample, "abnormal_screen", "areca_nut", link = "identity" ) rd_log <- calc_risk_diff( cachar_sample, "abnormal_screen", "areca_nut", link = "log" ) # Compare model types used cat("Identity link model type:", rd_identity$model_type, "\n") cat("Log link model type:", rd_log$model_type, "\n") ``` #### 2. Very Rare Outcomes With rare outcomes, risk differences become very small: ```{r rare outcomes} # Create a rare outcome (1% prevalence) cachar_sample$rare_outcome <- rbinom(nrow(cachar_sample), 1, 0.01) rd_rare <- calc_risk_diff( cachar_sample, "rare_outcome", "areca_nut" ) print(rd_rare) ``` > **Tip**: For very rare outcomes (\<1%), consider whether risk ratios might be more appropriate for your research question. #### 3. Missing Data The package automatically handles missing data by complete case analysis: ```{r missing data} # Create a copy with some missing data for demonstration set.seed(123) # For reproducibility cachar_with_missing <- cachar_sample %>% mutate( # Introduce more modest missing data (~3% in age, ~2% in alcohol) age_with_missing = ifelse(runif(n()) < 0.03, NA, age), alcohol_with_missing = ifelse(runif(n()) < 0.02, NA, alcohol) ) # Check the missing data patterns missing_summary <- cachar_with_missing %>% summarise( total_observations = n(), age_missing = sum(is.na(age_with_missing)), alcohol_missing = sum(is.na(alcohol_with_missing)), total_missing_any = sum(!complete.cases(select(., age_with_missing, alcohol_with_missing, abnormal_screen, areca_nut))), complete_cases = sum(complete.cases(select(., age_with_missing, alcohol_with_missing, abnormal_screen, areca_nut))) ) print(missing_summary) # Analysis with variables that have missing data rd_missing <- calc_risk_diff( cachar_with_missing, "abnormal_screen", "areca_nut", adjust_vars = c("age_with_missing", "alcohol_with_missing") ) # Compare with complete case analysis rd_complete <- calc_risk_diff( cachar_sample, "abnormal_screen", "areca_nut", adjust_vars = c("age", "alcohol") ) cat("\n=== IMPACT OF MISSING DATA ===\n") cat("Complete data analysis (n=", rd_complete$n_obs, "): ", sprintf("%.2f%%", rd_complete$rd * 100), sprintf(" (%.2f%%, %.2f%%)", rd_complete$ci_lower * 100, rd_complete$ci_upper * 100), "\n") # Check if missing data analysis succeeded if (!is.na(rd_missing$rd)) { cat("Missing data analysis (n=", rd_missing$n_obs, "): ", sprintf("%.2f%%", rd_missing$rd * 100), sprintf(" (%.2f%%, %.2f%%)", rd_missing$ci_lower * 100, rd_missing$ci_upper * 100), "\n") cat("Cases lost to missing data: ", rd_complete$n_obs - rd_missing$n_obs, "\n") } else { cat("Missing data analysis: FAILED (insufficient data or convergence issues)\n") cat("Attempted to use n =", rd_missing$n_obs, "complete cases\n") cat("Cases lost to missing data: ", nrow(cachar_with_missing) - rd_missing$n_obs, "\n\n") cat("📚 LESSON: This demonstrates why missing data can be problematic:\n") cat(" • Listwise deletion can dramatically reduce sample size\n") cat(" • Small samples may cause model convergence failures\n") cat(" • Consider multiple imputation for better missing data handling\n") cat(" • The riskdiff package gracefully handles these failures\n") } ``` ### Quick Reference #### Basic Syntax ```{r basic syntax example, eval=FALSE} # Example usage: result <- calc_risk_diff( data = cachar_sample, # Your dataset outcome = "abnormal_screen", # Binary outcome variable (0/1) exposure = "areca_nut", # Exposure of interest adjust_vars = c("age", "sex"), # Variables to adjust for strata = "residence", # Stratification variables link = "auto", # Link function: "auto", "identity", "log", "logit" alpha = 0.05, # Significance level (0.05 = 95% CI) verbose = FALSE # Print diagnostic messages if TRUE ) ``` #### Interpretation Guide | Risk Difference | Interpretation | Public Health Meaning | |------------------------|------------------------|------------------------| | 0.05 (5%) | 5 percentage point increase | 50 extra cases per 1,000 screened | | 0.01 (1%) | 1 percentage point increase | 10 extra cases per 1,000 screened | | -0.03 (-3%) | 3 percentage point decrease | 30 fewer cases per 1,000 screened | ### When to Use Risk Differences ✅ **Use risk differences when:** - Communicating to non-statistical audiences - Making policy decisions about interventions - The outcome is relatively common (\>5%) - You need to calculate number needed to treat/screen - Comparing absolute impact across different populations ❌ **Consider alternatives when:** - Outcome is very rare (\<1%) - You need to compare across populations with very different baseline risks - Your research question is about biological/etiological mechanisms ## Next Steps Now that you understand the basics: 1. See the **"Complete Example"** vignette for a full analysis workflow 2. Check the **"Technical Details"** vignette for statistical methodology 3. Use `?calc_risk_diff` for detailed function documentation ## Getting Help - **Issues or bugs**: [https://github.com/jackmurphy2351/riskdiff/issues](https://github.com/jackmurphy2351/riskdiff) - **Function help**: `?calc_risk_diff`, `?format_risk_diff` - **All vignettes**: `browseVignettes("riskdiff")` ------------------------------------------------------------------------ *This vignette is part of the riskdiff package (v0.2.0), developed to make risk difference calculations accessible to public health researchers.*