First we can split the nhanes3_newborn dataset into
training data and test data.
library(mixgb)
set.seed(2022)
n <- nrow(nhanes3_newborn)
idx <- sample(1:n, size = round(0.7 * n), replace = FALSE)
train.data <- nhanes3_newborn[idx, ]
test.data <- nhanes3_newborn[-idx, ]We can use the training data to obtain m imputed
datasets and save their imputation models. To achieve this, users need
to set save.models = TRUE. By default
save.vars = NULL, imputation models for variables with
missing data in the training data will be saved. However, the unseen
data may also have missing values in other variables. Users can be
comprehensive by saving models for all variables by setting
save.vars = colnames(train.data). Note that this would take
much longer as we need to train and save a model for each variable. If
users are confident that only certain variables will have missing values
in the new data, we recommend specifying the names or indices of these
variables in save.vars instead of saving models for all
variables.
# obtain m imputed datasets for train.data and save imputation models
mixgb.obj <- mixgb(data = train.data, m = 5, save.models = TRUE, save.vars = NULL)When save.models = TRUE, mixgb() will
return an object containing the following:
imputed.data: a list of m imputed
dataset for training data
XGB.models: a list of m sets of XGBoost
models for variables specified in save.vars.
params: a list of parameters that are required for
imputing new data using impute_new() later on.
We can extract m imputed datasets from the saved imputer
object by $imputed.data.
train.imputed <- mixgb.obj$imputed.data
# the 5th imputed dataset
head(train.imputed[[5]])
#>    HSHSIZER HSAGEIR HSSEX DMARACER DMAETHNR DMARETHN BMPHEAD BMPRECUM BMPSB1
#> 1:        7       2     1        1        1        3    43.0     67.1    9.2
#> 2:        4       3     2        2        3        2    42.6     67.1    8.8
#> 3:        3       9     2        2        3        2    46.5     64.3    8.6
#> 4:        3       9     2        1        3        1    46.2     68.5   10.8
#> 5:        5       4     1        1        3        1    44.7     63.0    6.0
#> 6:        5      10     1        1        3        1    45.2     72.0    5.4
#>    BMPSB2 BMPTR1 BMPTR2 BMPWT DMPPIR HFF1 HYD1
#> 1:    8.5    8.8    8.8  7.80  1.701    2    1
#> 2:    8.8   13.3   12.2  8.70  0.102    2    1
#> 3:    8.0   10.4    9.2  8.00  0.359    1    3
#> 4:   10.0   16.6   16.0  8.98  0.561    1    3
#> 5:    5.8    9.0    9.0  7.60  2.379    2    1
#> 6:    5.4    9.2    9.4  9.00  2.173    2    2To impute new data with this saved imputer object, we use the
impute_new() function. User can also specify whether to use
new data for initial imputation. By default,
initial.newdata = FALSE, we will use the information of
training data to initially impute the new data. New data will be imputed
with the saved models. This process will be considerably faster as we
don’t need to build the imputation models again.
test.imputed <- impute_new(object = mixgb.obj, newdata = test.data)If PMM is used when we call mixgb(), predicted values of
missing entries in the new dataset are matched with donors from training
data. Users can also set the number of donors for PMM when imputing new
data. By default, pmm.k = NULL , which means the same
setting as the training object will be used.
Similarly, users can set the number of imputed datasets
m. Note that this value has to be smaller than or equal to
the m in mixgb(). If it is not specified, it
will use the same m value as the saved object.
test.imputed <- impute_new(object = mixgb.obj, newdata = test.data, initial.newdata = FALSE, pmm.k = 3, m = 4)