install.packages("simBKMRdata")
Simulating Multivariate Environmental Exposure Data and Estimating Feature Selection Thresholds
5/14/25
In environmental health studies, the relationship between multiple environmental exposures and health outcomes, such as cognitive development, is often investigated using complex datasets that exhibit non-normality. This paper introduces an R package designed to simulate multivariate environmental exposure data and estimate feature selection thresholds to aid in research to apply and extend the Bayesian Kernel Machine Regression (BKMR) methodology. The package enables researchers to generate data from multivariate normal and multivariate skewed Gamma distributions, common in environmental exposure research, and calculate multivariate statistical features such as mean, variance, skewness, shape, and rate vectors, and correlation matrices. This package also facilitates the calculation of a Posterior Inclusion Probability (PIP) threshold for feature selection in BKMR, offering an approach that accounts for non-normal data (based on the forthcoming work of Hasan et al. (2025)). The effectiveness of the package is demonstrated through a real-world application using data from an adolescent environmental exposure study, where metal exposures are related to IQ scores.
R package, Bayesian Kernel Machine Regression, multivariate simulation, feature selection, environmental exposure, statistical moments, multivariate Gamma distribution, skewed data
Environmental exposure to metals such as Cadmium, Mercury, Arsenic, and Lead is linked to various adverse health outcomes, especially among children (Horton et al. 2013). In particular, these exposures may influence cognitive functions, with effects that often vary based on the concentration of metals in the environment (Sanders, Claus Henn, and Wright 2015; Trasande and Liu 2011; Tyler and Allan 2014; Tolins, Ruchirawat, and Landrigan 2014). Research into these relationships frequently involves complex, multivariate datasets where environmental factors are correlated, and data distributions often exhibit skewness rather than normality (Bobb et al. 2014, 2018; Hasan et al. 2024).
The Bayesian Kernel Machine Regression (BKMR) model is a powerful method for identifying relevant environmental factors and interactions (Bobb et al. 2014, 2018; Wilson et al. 2022; Devick et al. 2022; Gerwinn 2010; Ibrahimou et al. 2024). However, its effectiveness is typically constrained by its assumption of normality in the data, which is frequently violated in environmental health studies (Bobb et al. 2014, 2018; Hasan et al. 2024; Meulen EA 2021). To allow BKMR to flexibly handle multivariate skewed data, ongoing methodological innovation is needed, and synthetic data sets are a critical component of this research. To fill this need for realistic synthetic data sets, we have developed an R package that facilitates parametric bootstrap data simulation for multivariate environmental exposures. Further, this package includes computational results to estimate a Posterior Inclusion Probability (PIP) threshold which enables feature selection for BKMR via a hypothesis testing framework with controlled test size, which is particularly useful when data deviate from multivariate normality (Hasan et al. 2024). In this chapter, we show how to estimate key statistical moments from a real data set which includes effects of metals exposures on children’s IQ scores. Then, to facilitate future methodological innovation, we show how to use these moments as parameter estimates to simulate multiple synthetic environmental exposure data sets from multivariate normal and multivariate (skewed) Gamma distributions. Finally, for the real data analysis application examining exposure to mixture metals and IQ in a cohort of Italian children, we show how to apply our threshold equation and to use BKMR results for feature selection.
The package aims to provide a robust framework for simulating non-normal environmental exposure data and estimating feature selection thresholds in the context of BKMR. The function names were carefully chosen to reflect their purpose and actions within the package. We chose simple, descriptive names to maintain consistency with R’s convention of function names being indicative of their functionality, ensuring that users can quickly grasp their purpose without extensive documentation. The core functions of the package include:
simulate_group_data()
function; this generates data using multivariate Normal or Gamma distributions.calculate_stats_gaussian()
calculates key statistical moments, including mean, variance, skewness, and correlation matrix, which are essential for understanding the underlying structure of the normally distributed data. This function computes the sample size, gamma distribution parameters (shape and rate), and Spearman correlation matrix for each group, based on the grouping column.trans_ratio()
function scales data by either the standard deviation (SD) or the median absolute deviation (MAD).trans_log()
function applies a logarithmic transformation to the data, and the trans_root()
function allows for fractional root transformations, helping to manage skewed data.calculate_pip_threshold()
function. This function estimates the threshold based on a Four-Parameter Logistic Regression model (Richard’s curve), which adjusts for variations in data and sample size(RICHARDS 1959). The formula used to calculate the PIP threshold is:In this equation:
A dynamic threshold function was derived from the logistic model to adjust thresholds based on CV and sample size(Hasan et al. 2024).
The simBKMRdata software is implemented as an R package. It depends on the following packages: bkmr(Bobb 2022), fields( et al. 2021), base(2024) and MASS(Venables and Ripley 2002). The R software and these required packages can be obtained from the CRAN website. Furthermore, daily builds of package simBKMRdata are provided on the CRAN website [https://cran.r-project.org/web/packages/simBKMRdata/index.html]. It has been published under GPL version 3. The source code is available on GitHub at [https://github.com/khasa006/simBKMRdata].
To install the package from CRAN, use:
install.packages("simBKMRdata")
For GitHub installation of development version (potentially unstable), use:
::install_github("khasa006/simBKMRdata") devtools
Now we load the necessary libraries:
library(tidyverse)
library(simBKMRdata)
library(gt)
We began by examining the real-world environmental exposure dataset derived from the cohort study led by Lucchini and colleagues. The dataset comprises measurements of five key metal concentrations in children—Cadmium, Mercury and Arsenic from urine, Lead from blood and Manganese from hair, along with corresponding intelligence quotient (IQ) scores, evaluated using the Wechsler Intelligence Scale for Children (WISC-IV), a standardized instrument for assessing children’s Full-Scale IQ (FSIQ)(Wechsler 2003; Orsini, Pezzuti, and Picone 2012; Grizzle 2011; Renzetti et al. 2021).
Our example pediatric heavy metals exposure data set is included in the package as metalExposChildren_df
. We include biological sex as a covariate. The data set includes other variables, but we ignore these for the scope of this work.
data("metalExposChildren_df")
<- metalExposChildren_df %>%
analysisData_df select(QI, Cadmium:Manganese, Sex)
head(analysisData_df) %>%
gt() %>%
tab_header(
title = "Top Rows of the Data"
%>%
) fmt_number(
decimals = 2
%>%
) opt_table_outline()
Top Rows of the Data | ||||||
QI | Cadmium | Mercury | Arsenic | Lead | Manganese | Sex |
---|---|---|---|---|---|---|
123.00 | 0.36 | 0.03 | 4.41 | 6.33 | 439.74 | Female |
123.00 | 1.01 | 4.12 | 6.21 | 10.12 | 119.55 | Female |
104.00 | 0.28 | 0.62 | 16.96 | 13.46 | 176.16 | Female |
135.00 | 0.27 | 0.53 | 2.69 | 2.94 | 110.49 | Male |
94.00 | 0.36 | 1.03 | 24.39 | 6.09 | 630.43 | Male |
119.00 | 0.44 | 0.03 | 5.68 | 8.22 | 170.04 | Female |
We show the density plots of these raw exposure levels:
ggplot(
data = analysisData_df %>%
select(Cadmium:Manganese) %>%
pivot_longer(
cols = everything(), names_to = "Metal", values_to = "Value"
)+
) aes(x = Value, fill = Metal) +
theme_minimal() +
theme(legend.position = "none") +
labs(
title = "Histogram with Density Plot for Metals",
x = "Concentration",
y = "Density"
+
) geom_histogram(
aes(y = ..density..), bins = 30, alpha = 0.5, position = "identity"
+
) geom_density(aes(color = Metal), linewidth = 1) +
facet_wrap(~Metal, scales = "free")
These exposure profiles are not Normally distributed, as we see in Figure 1. Visualization of the marginal distributions of these metal concentrations using histograms and kernel density plots revealed marked right-skewness across all analytes, suggesting the need for distribution-sensitive modeling approaches.
To capture the empirical distributional features of the observed data, we implemented the calculate_stats_gamma()
function to estimate parameters for Gamma distributions within male and female subgroups. This method-of-moments approach yielded group-specific descriptors, including sample sizes, means, Gamma shape and rate parameters, and Spearman correlation matrices among the metals. These parameters served as foundational elements for generating realistic synthetic data that preserve sex-specific distribution and correlation structures of the environmental exposures.
Given the skewness present in the metals data, the Multivariate Gamma distribution is a better fit if we plan to use parametric bootstrap techniques downstream in the analysis. The calculate_stats_gamma()
function allows us to calculate group-specific estimators using the group_col
argument (in our case, for Males and Females), and it has options with the using
argument to estimate parameters via Method of Moments ("MoM"
) and generalized Maximum Likelihood Estimation ("gMLE"
)(Weiss and Wolfowitz 1966; El Adlouni et al. 2007). For this example, we calculate the Method of Moments parameter estimates of this estimated distribution via "MoM"
.
<- calculate_stats_gamma(
param_list data_df = analysisData_df,
group_col = "Sex",
using = "MoM"
)
# Examine list contents
str(
param_list,max.level = 2,
give.attr = FALSE
)
List of 2
$ Female:List of 5
..$ sampSize : int 242
..$ mean_vec : Named num [1:6] 101.893 0.554 0.701 24.566 10.439 ...
..$ sampCorr_mat: num [1:6, 1:6] 1 -0.0942 -0.0625 -0.114 -0.019 ...
..$ shape_num : Named num [1:6] 44.706 0.343 0.242 0.897 1.033 ...
..$ rate_num : Named num [1:6] 0.4388 0.6185 0.3444 0.0365 0.0989 ...
$ Male :List of 5
..$ sampSize : int 186
..$ mean_vec : Named num [1:6] 101.108 0.506 0.723 25.67 11.042 ...
..$ sampCorr_mat: num [1:6, 1:6] 1 -0.08159 -0.00122 -0.00365 0.16943 ...
..$ shape_num : Named num [1:6] 46.1 0.757 0.282 0.174 0.544 ...
..$ rate_num : Named num [1:6] 0.45595 1.49634 0.38964 0.00679 0.04928 ...
With the estimated parameters, we used the simulate_group_data()
function to generate synthetic multivariate Gamma-distributed exposure data and preserve the distributional properties of the original data. The simulated data maintained inter-variable correlations and group-specific skewness patterns, effectively replicating real-world exposure profiles. Based on the Method of Moments estimates that were calculated, we can draw a parametric bootstrap sample by passing in the list of group-specific parameter estimates found above:
<- simulate_group_data(
gamma_samples param_list = param_list,
data_gen_fn = generate_mvGamma_data,
group_col_name = "Sex"
)
head(gamma_samples) %>%
gt() %>%
tab_header(
title = "Top Rows of the gamma Sample"
%>%
) fmt_number(
decimals = 2
%>%
) opt_table_outline()
Top Rows of the gamma Sample | ||||||
V1 | V2 | V3 | V4 | V5 | V6 | Sex |
---|---|---|---|---|---|---|
113.07 | 0.18 | 0.06 | 22.13 | 7.95 | 1,970.83 | Female |
88.04 | 0.03 | 0.08 | 21.92 | 3.51 | 1,025.22 | Female |
104.84 | 5.61 | 1.29 | 12.82 | 6.09 | 105.75 | Female |
113.67 | 0.01 | 0.06 | 19.65 | 9.84 | 635.99 | Female |
118.07 | 0.00 | 0.00 | 0.78 | 0.95 | 642.75 | Female |
97.72 | 0.05 | 0.02 | 6.14 | 0.68 | 836.97 | Female |
We account for the grouping variable column name with group_col_name
, and we specify the Multivariate Gamma distribution by invoking the helper function generate_mvGamma_data()
. This function can be repeatedly invoked to create hundreds of parametric bootstrap resamples to test future methodological developments. These simulation results are extensively discussed in Hasan et al. (2024) and Hasan et al. (2025).
As with any hypothesis-testing paradigm, we must set the threshold of rejection before we perform an experiment. For variable selection with BKMR, this is no different. In previous studies(Ibrahimou et al. 2024; Pesenti et al. 2023; Li et al. 2022; Zhang et al. 2019; Coker et al. 2018; Barbieri and Berger 2004; Zheng, Luo, and Xue 2024), a static threshold 50% was used to delineate which exposures were statistically related to the outcome. However, as shown in Hasan et al. (2024) and Hasan et al. (2025), using a static threshold often results in either highly conservative (underpowered) tests, or tests that have a poorly controlled α size (greater than the expected 5%).
To address the challenge of determining meaningful PIP thresholds under skewed data conditions, we applied the calculate_pip_threshold()
function. This function dynamically estimates a threshold based on the dataset’s coefficient of variation (CV) and sample size using an estimate for the Four-Parameter Logistic Regression model (Richard’s Curve) described above.
<- calculate_pip_threshold(
pipThresh_num y = na.omit(analysisData_df$QI)
) pipThresh_num
[1] 0.11728
Therefore, based on this response vector’s distribution, we have shown that a threshold of approximately 0.117 should yield a better controlled, more powerful test than using a static threshold of 0.5.
To assess the relationship between metal exposures and cognitive outcomes, we will need to execute BKMR on the real data. To assess the relationship between metal exposures and cognitive outcomes, we employed BKMR using the kmbayes() function from the bkmr package. Prior to analysis, we omit the missing values and the metal concentration variables were log-transformed using the trans_log()
function to mitigate skewness. Because we want to transform by log10(x+1), we will use the function trans_log()
which uses base 10 and a one-unit positive shift by default. The exposure matrix thus comprised log-transformed values of the five metals, with IQ as the continuous outcome variable.
<-
bkmrAnalysisData_df %>%
analysisData_df drop_na() %>%
mutate_at(
vars(Cadmium:Manganese), ~ trans_log(.) %>% as.vector
)
The BKMR package expects the response as a vector and the exposure as a data.matrix()
object. Covariates must be numeric, so we will recode Sex as a binary variable. To enhance computational efficiency, we select a matrix of knots that span the exposure space, allowing the Gaussian process to be approximated using a reduced set of representative points.
set.seed(2025)
<- bkmrAnalysisData_df %>%
expos select(Cadmium:Manganese) %>%
data.matrix()
<- (bkmrAnalysisData_df$Sex == "Female") + 0L
is_female
# Generate knot points using a cover design for Bayesian modeling
<- fields::cover.design(expos, nd = 50)$design knots50
Now that we have set up our data and search grid, we can run BKMR using MCMC. The BKMR model was fit with 2,000 MCMC iterations, allowing for variable selection via computation of Posterior Inclusion Probabilities (PIPs).
library(bkmr)
set.seed(2025)
# Fit the BKMR model using MCMC
<- kmbayes(
modelFit # Response variable
y = bkmrAnalysisData_df$QI,
# Exposure matrix (metal concentrations)
Z = expos,
# Sex at birth covariate
X = is_female,
# Number of MCMC iterations; set to at least 10000 for real analysis
iter = 2000,
family = "gaussian", # Gaussian response
verbose = FALSE, # Output progress for each iteration
varsel = TRUE, # Perform variable selection
knots = knots50 # Knot points generated earlier
)
To further explore the nature of these associations, we examined the exposure-response relationships between each metal and IQ using univariate predictor-response functions. The PredictorResponseUnivar() function generated marginal effects, where the value of each metal varied while keeping the remaining exposures fixed at the 75th percentile.
# Generate univariate predictor-response relationships
<- PredictorResponseUnivar(fit = modelFit)
predRespUnivar
# Plot univariate predictor-response functions
<- ggplot(
predRespUnivarPlot
predRespUnivar, aes(
z,
est, ymin = est - 1.96 * se,
ymax = est + 1.96 * se
)+
) geom_smooth(stat = "identity") + # Add smooth lines with confidence intervals
facet_wrap(~ variable) + # Create separate plots for each variable
ylab("h(z)") # Label the y-axis
We want to know which metals are related to cognitive ability, as measured by IQ. To extraction the PIPs from the modelFit
model object, we use the ExtractPIPs()
function.
Posterior Inclusion Probabilities (PIPs) | |
variable | PIP |
---|---|
Cadmium | 0.993 |
Lead | 0.611 |
Manganese | 0.307 |
Mercury | 0.276 |
Arsenic | 0.151 |
Notably, Cadmium and Lead emerged as the strongest contributors to cognitive development outcomes, as indicated by elevated PIP values (Table 1) substantially above the baseline. Cadmium demonstrated a near-certain inclusion probability (PIP = 0.993), suggesting a strong association with IQ outcomes, while Lead also displayed a high PIP (0.611), reinforcing previous epidemiological findings on its neurotoxic effects(Bakulski et al. 2020; Sanders, Claus Henn, and Wright 2015; Henn et al. 2012; Ciesielski et al. 2012).
However, proper feature selection in the BKMR framework requires a calibrated, data-driven threshold rather than a fixed conventional cutoff (e.g., 0.5). Using the calculate_pip_threshold()
function, we computed a dynamic PIP threshold of 0.117. This threshold adapts to sample size and characteristics and adjusts the selection criterion accordingly, making it especially suitable for non-normal data distributions.
By comparing each metal’s PIP against the calculated threshold, we identified a refined subset of significant exposures, which included not only cadmium and lead but also manganese, mercury and arsenic. This approach allows for a statistically principled method of feature selection within the BKMR framework, ensuring robust and interpretable results even when the data deviate from normality.
Figure 2 shows the relationship of each individual metals with the outcome, where all of the other exposures are fixed to a particular quantile (e.g., 75th percentile). The results indicate non-linear associations for several metals. For instance, Cadmium exhibited a consistently negative association with IQ across a wide range of concentrations, whereas the relationship for Lead suggested a decline at higher concentrations. Associations for Manganese, Mercury, and Arsenic were comparatively weaker but statistically distinguishable from zero, consistent with their PIP values exceeding the dynamic threshold. The consistent patterns identified between PIP-based feature selection and univariate exposure-response functions validate the efficacy of the dynamic threshold method for BKMR feature selection in practical, non-normal environmental datasets.
# Plot univariate predictor-response functions
<- ggplot(
predRespUnivarPlot
predRespUnivar, aes(
z,
est, ymin = est - 1.96 * se,
ymax = est + 1.96 * se
)+
) geom_smooth(stat = "identity") + # Add smooth lines with confidence intervals
facet_wrap(~ variable) + # Create separate plots for each variable
ylab("h(z)") # Label the y-axis
predRespUnivarPlot
The R package we developed is a valuable tool for simulating multivariate environmental exposure data and estimating feature selection thresholds for BKMR. By supporting multivariate normal and multivariate skewed Gamma distributions, the package is particularly well-suited for handling non-normal environmental exposure data, which is common in health studies.
In addition to describing the software package, we present an analysis of an anonymous but original real-world dataset (which includes measurements of metal exposure and children’s IQ scores). Using BKMR within this framework allowed us to rigorously assess the relationships between multiple environmental exposures and cognitive development. The analysis of the original dataset adds to the growing body of literature on the impacts of heavy metal exposure on cognitive outcomes(Porru et al. 2024; Karri, Schuhmacher, and Kumar 2016; Rafiee et al. 2020). However, supporting literature is sparse when applied to pediatric exposure cases(Ouyang et al. 2023; Saxena et al. 2022; Kim et al. 2023; Bora et al. 2019), so this analysis offers a valuable corroborating perspective.
The results highlighted cadmium, and lead as significant contributors to cognitive decline—findings well supported by prior epidemiological research highlighting their neurotoxic effects (Bakulski et al. 2020; Sanders, Claus Henn, and Wright 2015; Henn et al. 2012; Ciesielski et al. 2012). In addition, the dynamic, data-driven PIP threshold, included additional metals - manganese, mercury and arsenic, which showed strong evidence for neurotoxicity(Balachandran et al. 2020; Levin-Schwartz et al. 2021; Oliveira et al. 2018; Grandjean and Landrigan 2006; Tian et al. 2025; Rosado et al. 2007). Notably, if we relied on a static threshold (e.g., PIP > 0.5), these additional exposures would have been incorrectly excluded, potentially underestimating the true burden of environmental neurotoxicity. The univariate exposure-response plots further corroborated these associations, providing visualization of dose-response relationships for each metal with IQ outcomes. These results underscore the advantage of the dynamic threshold approach: it enhances the sensitivity of feature selection while maintaining control over false positive rates, thereby improving the robustness and interpretability of BKMR analyses in complex, real-world exposure settings.
In this analysis, we show how to simulate multiple datasets to assess the robustness of new methodological findings under various conditions of environmental exposure. The purpose of generating these datasets is to test the applicability of the BKMR model and feature selection process in the presence of non-normal (skewed) data. By simulating multiple datasets from both multivariate normal and skewed Gamma distributions, we can compare how the model performs across different data distributions. This approach ensures that the conclusions drawn from a target (real-world) dataset are robust and generalizable, not merely an artifact of that specific dataset. By simulating multiple datasets and applying the same analysis framework, we test the reproducibility of the results and gain confidence in the model’s ability to handle different data conditions.
We showed how this package can be used to generate multiple new simulated datasets which allow us to explore the model’s performance under controlled, synthetic conditions that mimic real-world skewed exposure data. We could then analyze the multiple generated datasets using the same statistical and modeling procedures we would plan to apply to any new data. Specifically, we calculate the statistical moments for each simulated dataset, then use those to generate exposure data. The results from the multiple datasets can then be summarized by comparing the feature selection outcomes (PIPs) across different simulations. These PIPs would help identify which metals have the most significant association with cognitive outcomes. We could aggregate these findings and provide a summary of how consistent or variable the results are across different simulations, strengthening the validity of newly developed modeling algorithms.
This R package provides essential tools for simulating multivariate environmental exposure data, estimating statistical moments, and calculating PIP thresholds for feature selection in Bayesian Kernel Machine Regression. While BKMR is an excellent tool for feature selection, its reliance on normality can limit its application to real-world datasets. The ability to simulate data from multivariate normal and skewed Gamma distributions makes the package particularly valuable for environmental health studies where data often deviate from normality. Our package fills this gap by enabling researchers to model and analyze non-normally distributed data, ensuring that feature selection remains robust even in complex scenarios.