An R package for random-forest-empowered imputation of missing Data
This is the repository for R package RfEmpImp, for multiple imputation using random forests (RF).
This R package is an implementation for the RfPred and RfNode algorithms and currently operates under the multiple imputation computation framework mice.
The R package contains both newly proposed and improved existing algorithms for random-forest-based multiple imputation of missing data.
For details of the newly proposed algorithms, please refer to: arXiv:2004.14823 (further updates pending).
With version 2.0.0, the names of parameters were further simplified, please refer to the documentation for details.
For data with mixed types of variables, RfEmp method is a short cut for using RfPred.Emp for continuous variables and RfPred.Cate for categorical variables (of type logical or factor).
Example:
For continuous variables, in RfPred.Emp method, the empirical distribution of random forest’s out-of-bag prediction errors is used to construct the conditional distributions of the variable under imputation, providing conditional distributions with better quality. Users can set method = "rfpred.emp" in function call to mice to use it.
Also, in RfPred.Norm method, normality was assumed for RF prediction errors, as proposed by Shah et al., and users can set method = "rfpred.norm" in function call to mice to use it.
For categorical variables, in RfPred.Cate method, the probability machine theory is used, and the predictions of missing categories are based on the predicted probabilities for each missing observation. Users can set method = "rfpred.cate" in function call to mice to use it.
For both continuous variables, the observations under the predicting nodes of random forest are used as candidates for imputation.
Two methods are now available for the RfNode algorithm.
Example:
# Prepare data
df <- nhanes
df[, c("age", "hyp")] <- lapply(X = nhanes[, c("age", "hyp")], FUN = as.factor)
# Do imputation
imp <- imp.rfnode.cond(df)
# Or: imp <- imp.rfnode.prox(df)
# Do analyses
regObj <- with(imp, lm(chl ~ bmi + hyp))
# Pool analyzed results
poolObj <- pool(regObj)
# Extract estimates
res <- reg.ests(poolObj)RfNode.Cond uses the conditional distribution formed by the prediction nodes, i.e. the weight changes of observations caused by the bootstrapping of random forest are considered, and uses “in-bag” observations only. Users can set method = "rfnode.cond" in function call to mice to use it.
RfNode.Prox uses the concepts of proximity matrices of random forests, and observations fall under the same predicting nodes are used as candidates for imputation. Users can set method = "rfnode.prox" in function call to mice to use it.
The model building for random forest is accelerated using parallel computation powered by ranger. The ranger software package provides support for parallel computation using native C++. In our simulations, parallel computation can provide impressive performance boost for multiple imputation process (about 4x faster on a quad-core laptop).