--- title: "GRM Forests for Robust DIF Detection" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{GRM Forests for Robust DIF Detection} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` # Introduction GRM Forests extend GRM Trees by creating ensembles of trees to provide more robust variable importance measures. This vignette covers: * GRM Forest implementation * Variable importance and variable importance plot. See the vignette on getting started with grmtree package for a more detailed walkthrough of the tree-based graded response theory model (GRMTree). # Install and Load required packages To implement the tree-based GRM (GRMTree), you will install the following packages if not previously installed. ```{r, eval=F} ## Install packages from CRAN repository install.packages(c("dplyr", "grmtree")) ``` Once installed, load the packages as follows: ```{r, message=FALSE, warning=FALSE} library(dplyr) # For data manipulation library(grmtree) # For tree-based GRM DIF Test ``` # Import and prepare the data The data set used in this demonstration is a test/sample data for the package. ```{r, message=FALSE} ## Load the data data("grmtree_data", package = "grmtree") ## Take a glimpse at the data glimpse(grmtree_data) ## Prepare the data resp.data <- grmtree_data %>% mutate_at(vars(starts_with("MOS")), as.ordered) %>% mutate_at(vars(c(sex, residency, depressed, Education, job, smoker, multimorbidity)), as.factor) ## Explore the data head(resp.data) ## Check the structure of the data glimpse(resp.data) ## Create response as outcomes resp.data$resp <- data.matrix(resp.data[, 1:8]) ``` # GRM Forests Implementation ## Define the forest control parameters ```{r} ## Get help on the control parameter # ?grmforest.control ## GRMTree control parameters with Benjamini-Hochberg grm_control <- grmtree.control( minbucket = 350, p_adjust = "BH", alpha = 0.05) ## Define the forest control parameters forest_control <- grmforest.control( n_tree = 3, # Number of trees (Reduced for vignette build time) sampling = "bootstrap", # Bootstrap method; resampling also available sample_fraction = 0.632, mtry = sqrt(9), # Usually the square root of the number of covariates control = grm_control, remove_dead_trees = TRUE, # Remove any null GRMTree seed = 123 ) ``` ## Grow the GRM Forest ```{r, eval=FALSE} ## Fit the GRM forest mos_forest <- grmforest( resp ~ sex + age + bmi + Education + residency + depressed + job + multimorbidity + smoker, data = resp.data, control = forest_control ) ## Get the summary of the fitted forest summary(mos_forest) print(mos_forest) ## Plot a tree in the forest plot(mos_forest$trees[[1]]) ``` # Variable Importance ## Compute the variable importance of each covariate ```{r, eval=FALSE} ## Calculate the variable importance importance <- varimp(mos_forest, seed = 123, verbose = T) ## Print the result of the variable importance print(importance) ``` Example output: ``` age smoker bmi multimorbidity sex 403.07554 220.37908 39.02621 37.00120 32.06389 Education residency depressed job 0.00000 0.00000 0.00000 0.00000 ``` ## Plot the variable importance of each variable Here `plot.varimp` creates a bar plot of variable importance scores with options for both ggplot2 and base R graphics. ```{r, eval=FALSE} ## Plot the variable importance scores (ggplot is the default) plot(importance) ## Plot onlt the top 5 importance variables plot(importance, xlab = "", top_n = 5) ## Plot the base R version plot(importance, use_ggplot = FALSE) ## Custom colors plot(importance, col = c("green", "red")) ## Rename the variable names in the order from the variable importance result names(importance) <- c("Age", "Smoking Status", "BMI", "Multimorbidity", "Sex", "Education", "Residency", "Depression", "Employment") ## Now create the plot with informative names plot(importance) ``` # Conclusion GRM Forests provide more stable DIF detection by aggregating across many trees. Key advantages include robust variable importance measures, reduced overfitting, and better handling of complex interactions.