---
title: "Helper Functions for Bayesian Kernel Machine Regression"
subtitle: "Simulating Multivariate Environmental Exposure Data and Estimating Feature Selection Thresholds"
date: "`r Sys.Date()`"
author: "Kazi Tanvir Hasan, Dr. Gabriel Odom, Cristian Guerini, Dr. Zoran Bursac, Dr. Roberto Lucchini,  and Dr. Boubakari Ibrahimou"
output:
  rmarkdown::html_document:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Helper Functions for Bayesian Kernel Machine Regression}
  %\VignetteEngine{quarto::html}
  %\VignetteLanguage{en}
bibliography: references.bib
---

# Abstract

In environmental health studies, the relationship between multiple environmental exposures and health outcomes, such as cognitive development, is often investigated using complex datasets that exhibit non-normality. This paper introduces an R package designed to simulate multivariate environmental exposure data and estimate feature selection thresholds to aid in research to apply and extend the Bayesian Kernel Machine Regression (BKMR) methodology. The package enables researchers to generate data from multivariate normal and multivariate skewed Gamma distributions, common in environmental exposure research, and calculate multivariate statistical features such as mean, variance, skewness, shape, and rate vectors, and correlation matrices. This package also facilitates the calculation of a Posterior Inclusion Probability (PIP) threshold for feature selection in BKMR, offering an approach that accounts for non-normal data (based on the forthcoming work of @hasan2025). The effectiveness of the package is demonstrated through a real-world application using data from an adolescent environmental exposure study, where metal exposures are related to IQ scores.

## Keywords

R package, Bayesian Kernel Machine Regression, multivariate simulation, feature selection, environmental exposure, statistical moments, multivariate Gamma distribution, skewed data

# Introduction

Environmental exposure to metals such as Cadmium, Mercury, Arsenic, and Lead is linked to various adverse health outcomes, especially among children [@Horton2013]. In particular, these exposures may influence cognitive functions, with effects that often vary based on the concentration of metals in the environment [@Sanders2015; @trasande2011; @tyler2014; @tolins2014]. Research into these relationships frequently involves complex, multivariate datasets where environmental factors are correlated, and data distributions often exhibit skewness rather than normality [@bobb2014; @bobb2018; @hasan2024].

The Bayesian Kernel Machine Regression (BKMR) model is a powerful method for identifying relevant environmental factors and interactions [@bobb2014; @bobb2018; @wilson2022; @devick2022; @gerwinn2010; @ibrahimou2024]. However, its effectiveness is typically constrained by its assumption of normality in the data, which is frequently violated in environmental health studies [@bobb2014; @bobb2018; @hasan2024; @Van2021]. To allow BKMR to flexibly handle multivariate skewed data, ongoing methodological innovation is needed, and synthetic data sets are a critical component of this research. To fill this need for realistic synthetic data sets, we have developed an R package that facilitates parametric bootstrap data simulation for multivariate environmental exposures. Further, this package includes computational results to estimate a Posterior Inclusion Probability (PIP) threshold which enables feature selection for BKMR via a hypothesis testing framework with controlled test size, which is particularly useful when data deviate from multivariate normality [@hasan2024]. In this chapter, we show how to estimate key statistical moments from a real data set which includes effects of metals exposures on children's IQ scores. Then, to facilitate future methodological innovation, we show how to use these moments as parameter estimates to simulate multiple synthetic environmental exposure data sets from multivariate normal and multivariate (skewed) Gamma distributions. Finally, for the real data analysis application examining exposure to mixture metals and IQ in a cohort of Italian children, we show how to apply our threshold equation and to use BKMR results for feature selection.

# Materials and Methods

## Functionality in this R Package

The package aims to provide a robust framework for simulating non-normal environmental exposure data and estimating feature selection thresholds in the context of BKMR. The function names were carefully chosen to reflect their purpose and actions within the package. We chose simple, descriptive names to maintain consistency with R's convention of function names being indicative of their functionality, ensuring that users can quickly grasp their purpose without extensive documentation. The core functions of the package include:

1.  Data Simulation: Functions for generating multivariate data from Normal, Gamma, and skewed Gamma distributions with the `simulate_group_data()` function; this generates data using multivariate Normal or Gamma distributions.
2.  Statistical Parameters Estimation: The function `calculate_stats_gaussian()` calculates key statistical moments, including mean, variance, skewness, and correlation matrix, which are essential for understanding the underlying structure of the normally distributed data. This function computes the sample size, gamma distribution parameters (shape and rate), and Spearman correlation matrix for each group, based on the grouping column.
3.  Data Transformation Functions:
    -   Scaling by Standard Deviation or MAD: The `trans_ratio()` function scales data by either the standard deviation (SD) or the median absolute deviation (MAD).
    -   Logarithmic and Root Transformations: The `trans_log()` function applies a logarithmic transformation to the data, and the `trans_root()` function allows for fractional root transformations, helping to manage skewed data.
4.  Feature Selection Threshold Calculation: The PIP threshold for feature selection is computed using the `calculate_pip_threshold()` function. This function estimates the threshold based on a Four-Parameter Logistic Regression model (Richard's curve), which adjusts for variations in data and sample size[@richards1959]. The formula used to calculate the PIP threshold is:

```{=tex}
\begin{equation}
y = A + \frac{K - A}{\left( C + e^{-\beta_1 x_1} \right)^{\beta_2 x_2}}.\label{eq1}
\end{equation}
```
In this equation:

-   $y$: the 95th percentile of the PIP values $PIP_{q_{95}}$
-   $A$: Left asymptote, fixed at 0.
-   $K$: Right asymptote, bounded above by 1.
-   $C$: Scaling fit constant.
-   $\beta_1, \beta_2$: Midpoint shift parameters for CV and sample size.
-   $x_1$: Log-transformed CV $\log_2(\text{CV})$.
-   $x_2$: Log-transformed sample size $\log_{10} (\text{Sample Size})$.

A dynamic threshold function was derived from the logistic model to adjust thresholds based on CV and sample size[@hasan2024].

```{=tex}
\begin{equation}
    \text{Threshold} = \frac{1}{\left( 1.30460 + e^{-0.59867 \log_2 (\text{CV})} \right)^{0.43565 \log_{10} (\text{SampleSize})}}
\end{equation}
```
## Software Implementation

The simBKMRdata software is implemented as an R package. It depends on the following packages: bkmr[@bkmr], fields[@fields], base[@base] and MASS[@MASS]. The R software and these required packages can be obtained from the CRAN website. Furthermore, daily builds of package simBKMRdata are provided on the CRAN website \[https://cran.r-project.org/web/packages/simBKMRdata/index.html\]. It has been published under GPL version 3. The source code is available on GitHub at \[https://github.com/khasa006/simBKMRdata\].

To install the package from CRAN, use:

```{r}
#| eval: false
install.packages("simBKMRdata")
```

For GitHub installation of development version (potentially unstable), use:

```{r}
#| eval: false
devtools::install_github("khasa006/simBKMRdata")
```

Now we load the necessary libraries:

```{r}
#| message: false
library(tidyverse)
library(simBKMRdata)
library(gt)
```

# Example Analysis of Heavy Metal Exposure

We began by examining the real-world environmental exposure dataset derived from the cohort study led by Lucchini and colleagues. The dataset comprises measurements of five key metal concentrations in children---Cadmium, Mercury and Arsenic from urine, Lead from blood and Manganese from hair, along with corresponding intelligence quotient (IQ) scores, evaluated using the Wechsler Intelligence Scale for Children (WISC-IV), a standardized instrument for assessing children's Full-Scale IQ (FSIQ)[@wechsler2003; @orsini2012; @grizzle2011; @renzetti2021].

## Data Exploration

Our example pediatric heavy metals exposure data set is included in the package as `metalExposChildren_df`. We include biological sex as a covariate. The data set includes other variables, but we ignore these for the scope of this work.

```{r}
data("metalExposChildren_df")

analysisData_df <- metalExposChildren_df %>%
  select(QI, Cadmium:Manganese, Sex)

head(analysisData_df) %>%
  gt() %>%
  tab_header(
    title = "Top Rows of the Data"
  ) %>%
  fmt_number(
    decimals = 2
  ) %>%
  opt_table_outline()
```

We show the density plots of these raw exposure levels:

```{r}
#| label: fig-metals-density
#| fig-cap: Histogram with Density Plot for Metals
#| warning: false

ggplot(
  data = analysisData_df %>%
    select(Cadmium:Manganese) %>%
    pivot_longer(
      cols = everything(), names_to = "Metal", values_to = "Value"
    )
) + 
  aes(x = Value, fill = Metal) + 
  theme_minimal() + 
  theme(legend.position = "none") + 
  labs(
    title = "Histogram with Density Plot for Metals", 
    x = "Concentration", 
    y = "Density"
  ) +
  geom_histogram(
    aes(y = ..density..), bins = 30, alpha = 0.5, position = "identity"
  ) + 
  geom_density(aes(color = Metal), linewidth = 1) + 
  facet_wrap(~Metal, scales = "free")
```

These exposure profiles are not Normally distributed, as we see in @fig-metals-density. Visualization of the marginal distributions of these metal concentrations using histograms and kernel density plots revealed marked right-skewness across all analytes, suggesting the need for distribution-sensitive modeling approaches.

## Parameter Estimation

To capture the empirical distributional features of the observed data, we implemented the `calculate_stats_gamma()` function to estimate parameters for Gamma distributions within male and female subgroups. This method-of-moments approach yielded group-specific descriptors, including sample sizes, means, Gamma shape and rate parameters, and Spearman correlation matrices among the metals. These parameters served as foundational elements for generating realistic synthetic data that preserve sex-specific distribution and correlation structures of the environmental exposures.

Given the skewness present in the metals data, the Multivariate Gamma distribution is a better fit if we plan to use parametric bootstrap techniques downstream in the analysis. The `calculate_stats_gamma()` function allows us to calculate group-specific estimators using the `group_col` argument (in our case, for Males and Females), and it has options with the `using` argument to estimate parameters via Method of Moments (`"MoM"`) and generalized Maximum Likelihood Estimation (`"gMLE"`)[@weiss1966; @eladlouni2007]. For this example, we calculate the Method of Moments parameter estimates of this estimated distribution via `"MoM"`.

```{r}
param_list <- calculate_stats_gamma(
  data_df = analysisData_df, 
  group_col = "Sex", 
  using = "MoM"
)

# Examine list contents
str(
  param_list,
  max.level = 2,
  give.attr = FALSE
)
```

## Data Simulation

With the estimated parameters, we used the `simulate_group_data()` function to generate synthetic multivariate Gamma-distributed exposure data and preserve the distributional properties of the original data. The simulated data maintained inter-variable correlations and group-specific skewness patterns, effectively replicating real-world exposure profiles. Based on the Method of Moments estimates that were calculated, we can draw a parametric bootstrap sample by passing in the list of group-specific parameter estimates found above:

```{r}
gamma_samples <- simulate_group_data(
  param_list = param_list, 
  data_gen_fn = generate_mvGamma_data, 
  group_col_name = "Sex"
)

head(gamma_samples) %>% 
  gt() %>%
  tab_header(
    title = "Top Rows of the gamma Sample"
  ) %>%
  fmt_number(
    decimals = 2
  ) %>%
  opt_table_outline()
```

We account for the grouping variable column name with `group_col_name`, and we specify the Multivariate Gamma distribution by invoking the helper function `generate_mvGamma_data()`. This function can be repeatedly invoked to create hundreds of parametric bootstrap resamples to test future methodological developments. These simulation results are extensively discussed in @hasan2024 and @hasan2025.

## Dynamic Threshold Calculation

As with any hypothesis-testing paradigm, we must set the threshold of rejection before we perform an experiment. For variable selection with BKMR, this is no different. In previous studies[@ibrahimou2024; @pesenti2023; @li2022; @zhang2019; @coker2018; @barbieri2004; @zheng2024], a static threshold 50% was used to delineate which exposures were statistically related to the outcome. However, as shown in @hasan2024 and @hasan2025, using a static threshold often results in either highly conservative (underpowered) tests, or tests that have a poorly controlled $\alpha$ size (greater than the expected 5%).

To address the challenge of determining meaningful PIP thresholds under skewed data conditions, we applied the `calculate_pip_threshold()` function. This function dynamically estimates a threshold based on the dataset's coefficient of variation (CV) and sample size using an estimate for the Four-Parameter Logistic Regression model (Richard's Curve) described above.

```{r}
pipThresh_num <- calculate_pip_threshold(
  y = na.omit(analysisData_df$QI)
)
pipThresh_num
```

Therefore, based on this response vector's distribution, we have shown that a threshold of approximately `r round(pipThresh_num, 3)` should yield a better controlled, more powerful test than using a static threshold of 0.5.

## Bayesian Kernel Machine Regression Analysis

To assess the relationship between metal exposures and cognitive outcomes, we will need to execute BKMR on the real data. To assess the relationship between metal exposures and cognitive outcomes, we employed BKMR using the kmbayes() function from the bkmr package. Prior to analysis, we omit the missing values and the metal concentration variables were log-transformed using the `trans_log()` function to mitigate skewness. Because we want to transform by $\log_{10}(x+1)$, we will use the function `trans_log()` which uses base 10 and a one-unit positive shift by default. The exposure matrix thus comprised log-transformed values of the five metals, with IQ as the continuous outcome variable.

```{r}
#| label: transform-data-for-bkmr

bkmrAnalysisData_df <- 
  analysisData_df %>%
  drop_na() %>%
  mutate_at(
    vars(Cadmium:Manganese), ~ trans_log(.) %>% as.vector
  )
```

The BKMR package expects the response as a vector and the exposure as a `data.matrix()` object. Covariates must be numeric, so we will recode Sex as a binary variable. To enhance computational efficiency, we select a matrix of knots that span the exposure space, allowing the Gaussian process to be approximated using a reduced set of representative points.

```{r}
#| label: create-bkmr-data-objects
set.seed(2025)
expos <- bkmrAnalysisData_df %>%
  select(Cadmium:Manganese) %>% 
  data.matrix()
is_female <- (bkmrAnalysisData_df$Sex == "Female") + 0L

# Generate knot points using a cover design for Bayesian modeling
knots50 <- fields::cover.design(expos, nd = 50)$design
```

Now that we have set up our data and search grid, we can run BKMR using MCMC. The BKMR model was fit with 2,000 MCMC iterations, allowing for variable selection via computation of Posterior Inclusion Probabilities (PIPs).

```{r}
#| label: run-bkmr
#| message: false
library(bkmr)

set.seed(2025)

# Fit the BKMR model using MCMC
modelFit <- kmbayes(
  # Response variable
  y = bkmrAnalysisData_df$QI,
  # Exposure matrix (metal concentrations)
  Z = expos,     
  # Sex at birth covariate
  X = is_female,   
  # Number of MCMC iterations; set to at least 10000 for real analysis
  iter = 2000,         
  family = "gaussian", # Gaussian response
  verbose = FALSE,     # Output progress for each iteration
  varsel = TRUE,       # Perform variable selection
  knots = knots50      # Knot points generated earlier
)
```

To further explore the nature of these associations, we examined the exposure-response relationships between each metal and IQ using univariate predictor-response functions. The PredictorResponseUnivar() function generated marginal effects, where the value of each metal varied while keeping the remaining exposures fixed at the 75th percentile.

```{r}
#| label: fig-Univariate-exposure-response-function
#| warning: false
# Generate univariate predictor-response relationships
predRespUnivar <- PredictorResponseUnivar(fit = modelFit)

# Plot univariate predictor-response functions
predRespUnivarPlot <- ggplot(
  predRespUnivar, 
  aes(
    z, 
    est, 
    ymin = est - 1.96 * se, 
    ymax = est + 1.96 * se
  )
) + 
  geom_smooth(stat = "identity") +  # Add smooth lines with confidence intervals
  facet_wrap(~ variable) +          # Create separate plots for each variable
  ylab("h(z)")                      # Label the y-axis

```

# Results

We want to know which metals are related to cognitive ability, as measured by IQ. To extraction the PIPs from the `modelFit` model object, we use the `ExtractPIPs()` function.

```{r}
#| label: tbl-PIPs
#| tbl-cap: Posterior Inclusion Probabilities (PIPs) for Metal Exposures
#| echo: false

pipFit <- ExtractPIPs(modelFit) %>% arrange(desc(PIP))
pipFit %>%
  gt() %>%
  tab_header(
    title = "Posterior Inclusion Probabilities (PIPs)"
  )
```

Notably, Cadmium and Lead emerged as the strongest contributors to cognitive development outcomes, as indicated by elevated PIP values (@tbl-PIPs) substantially above the baseline. Cadmium demonstrated a near-certain inclusion probability (PIP = 0.993), suggesting a strong association with IQ outcomes, while Lead also displayed a high PIP (0.611), reinforcing previous epidemiological findings on its neurotoxic effects[@bakulski2020; @Sanders2015; @henn2012; @ciesielski2012].

However, proper feature selection in the BKMR framework requires a calibrated, data-driven threshold rather than a fixed conventional cutoff (e.g., 0.5). Using the `calculate_pip_threshold()` function, we computed a dynamic PIP threshold of `r round(pipThresh_num, 3)`. This threshold adapts to sample size and characteristics and adjusts the selection criterion accordingly, making it especially suitable for non-normal data distributions.

By comparing each metal's PIP against the calculated threshold, we identified a refined subset of significant exposures, which included not only cadmium and lead but also manganese, mercury and arsenic. This approach allows for a statistically principled method of feature selection within the BKMR framework, ensuring robust and interpretable results even when the data deviate from normality.

@fig-Univariate-exposure-response-plot shows the relationship of each individual metals with the outcome, where all of the other exposures are fixed to a particular quantile (e.g., 75th percentile). The results indicate non-linear associations for several metals. For instance, Cadmium exhibited a consistently negative association with IQ across a wide range of concentrations, whereas the relationship for Lead suggested a decline at higher concentrations. Associations for Manganese, Mercury, and Arsenic were comparatively weaker but statistically distinguishable from zero, consistent with their PIP values exceeding the dynamic threshold. The consistent patterns identified between PIP-based feature selection and univariate exposure-response functions validate the efficacy of the dynamic threshold method for BKMR feature selection in practical, non-normal environmental datasets.

```{r}
#| label: fig-Univariate-exposure-response-plot
#| fig-cap:  Univariate exposure-response function plot
#| warning: false

# Plot univariate predictor-response functions
predRespUnivarPlot <- ggplot(
  predRespUnivar, 
  aes(
    z, 
    est, 
    ymin = est - 1.96 * se, 
    ymax = est + 1.96 * se
  )
) + 
  geom_smooth(stat = "identity") +  # Add smooth lines with confidence intervals
  facet_wrap(~ variable) +          # Create separate plots for each variable
  ylab("h(z)")                      # Label the y-axis

predRespUnivarPlot
```

# Discussion

The R package we developed is a valuable tool for simulating multivariate environmental exposure data and estimating feature selection thresholds for BKMR. By supporting multivariate normal and multivariate skewed Gamma distributions, the package is particularly well-suited for handling non-normal environmental exposure data, which is common in health studies.

## Implications of Pediatric Heavy Metal Exposure

In addition to describing the software package, we present an analysis of an anonymous but original real-world dataset (which includes measurements of metal exposure and children's IQ scores). Using BKMR within this framework allowed us to rigorously assess the relationships between multiple environmental exposures and cognitive development. The analysis of the original dataset adds to the growing body of literature on the impacts of heavy metal exposure on cognitive outcomes[@porru2024; @karri2016; @rafiee2020a]. However, supporting literature is sparse when applied to pediatric exposure cases[@ouyang2023; @saxena2022; @Kim2023; @bora2019], so this analysis offers a valuable corroborating perspective.

The results highlighted cadmium, and lead as significant contributors to cognitive decline---findings well supported by prior epidemiological research highlighting their neurotoxic effects [@bakulski2020; @Sanders2015; @henn2012; @ciesielski2012]. In addition, the dynamic, data-driven PIP threshold, included additional metals - manganese, mercury and arsenic, which showed strong evidence for neurotoxicity[@balachandran2020; @levin-schwartz2021; @oliveira2018; @grandjean2006; @tian2025; @rosado2007]. Notably, if we relied on a static threshold (e.g., PIP \> 0.5), these additional exposures would have been incorrectly excluded, potentially underestimating the true burden of environmental neurotoxicity. The univariate exposure-response plots further corroborated these associations, providing visualization of dose-response relationships for each metal with IQ outcomes. These results underscore the advantage of the dynamic threshold approach: it enhances the sensitivity of feature selection while maintaining control over false positive rates, thereby improving the robustness and interpretability of BKMR analyses in complex, real-world exposure settings.

## Utility in Future Methodological Research

In this analysis, we show how to simulate multiple datasets to assess the robustness of new methodological findings under various conditions of environmental exposure. The purpose of generating these datasets is to test the applicability of the BKMR model and feature selection process in the presence of non-normal (skewed) data. By simulating multiple datasets from both multivariate normal and skewed Gamma distributions, we can compare how the model performs across different data distributions. This approach ensures that the conclusions drawn from a target (real-world) dataset are robust and generalizable, not merely an artifact of that specific dataset. By simulating multiple datasets and applying the same analysis framework, we test the reproducibility of the results and gain confidence in the model's ability to handle different data conditions.

We showed how this package can be used to generate multiple new simulated datasets which allow us to explore the model's performance under controlled, synthetic conditions that mimic real-world skewed exposure data. We could then analyze the multiple generated datasets using the same statistical and modeling procedures we would plan to apply to any new data. Specifically, we calculate the statistical moments for each simulated dataset, then use those to generate exposure data. The results from the multiple datasets can then be summarized by comparing the feature selection outcomes (PIPs) across different simulations. These PIPs would help identify which metals have the most significant association with cognitive outcomes. We could aggregate these findings and provide a summary of how consistent or variable the results are across different simulations, strengthening the validity of newly developed modeling algorithms.

# Conclusion

This R package provides essential tools for simulating multivariate environmental exposure data, estimating statistical moments, and calculating PIP thresholds for feature selection in Bayesian Kernel Machine Regression. While BKMR is an excellent tool for feature selection, its reliance on normality can limit its application to real-world datasets. The ability to simulate data from multivariate normal and skewed Gamma distributions makes the package particularly valuable for environmental health studies where data often deviate from normality. Our package fills this gap by enabling researchers to model and analyze non-normally distributed data, ensuring that feature selection remains robust even in complex scenarios.