--- title: "Cram Bandit Helpers" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Cram Bandit Helpers} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(cramR) library(data.table) ``` # 🌟 What is this article about? In order to use `cram_bandit()`, users must supply a matrix of **action selection probabilities** \( \pi_t(X_j, A_j) \) for each combination of policy update \( t\) and context \( j\) in the historical dataset. While some environments log these probabilities directly, many contextual bandit libraries (such as [`contextual`](https://github.com/Nth-iteration-labs/contextual)) only store **policy parameters** (e.g., regression coefficients) without explicit probability tracking. This article explains how **Cram Bandit Helpers** reconstruct \( \pi_t(X_j, A_j) \) from these parameters for common policies: | Policy Type | Class Name | |------------------|------------------------------------------| | Epsilon-Greedy | `BatchContextualEpsilonGreedyPolicy` | | LinUCB Disjoint with \( \varepsilon \)-greedy exploration | `BatchLinUCBDisjointPolicyEpsilon` | | Thompson Sampling| `BatchContextualLinTSPolicy` | Both **theoretical formulas** and **practical code snippets** are provided. --- # πŸ› οΈPolicy parameters explained When using linear bandit algorithms like Epsilon-Greedy, LinUCB, or Thompson Sampling, each arm \(k\) maintains **summary statistics** (parameters) to estimate the expected reward: - \( A_k \) is the **Gram matrix**: \[ A_k = X_k^T X_k \] where \(X_k\) is the matrix of feature vectors (contexts) for all rounds where arm \(k\) was selected. βž” **Interpretation**: \(A_k\) captures the amount of information (and correlation structure) about the features for arm \(k\). It plays the role of a "feature covariance matrix." - \( b_k \) is the **response vector**: \[ b_k = X_k^T y_k \] where \(y_k\) are the observed rewards for arm \(k\). βž” **Interpretation**: \(b_k\) captures the relationship between the observed rewards and the contexts for arm \(k\). These sufficient statistics allow the policy to compute the **Least Squares estimate** for the reward model: \[ \theta_k = A_k^{-1} b_k \] where: - \(\theta_k\) is the estimated coefficient vector that predicts the expected reward of arm \(k\) as a function of the context. Thus: - \(A_k\) tells us **how confident** we are about \(\theta_k\) (smaller eigenvalues of \(A_k\) imply more uncertainty). - \(b_k\) provides the **observed signal** (reward-weighted context features) to fit \(\theta_k\). The policy selects an action based on the \(\theta_k\) of each arm \( k \) and then observe the reward associated with this choice, which is used to update the parameters \(A_k\) and \(b_k\) of the policy. --- # ✨ Epsilon-Greedy Policy ### πŸ€– Theoretical computation In Epsilon-Greedy, with exploration rate \( \varepsilon \), the probability of selecting one of the best arms is: \[ P(A_t | X_t) = (1 - \varepsilon) \times \frac{1}{\# \text{best arms}} + \varepsilon \times \frac{1}{K} \] While the probability of selecting an arm that is not among the best arms is: \[ P(A_t | X_t) = \varepsilon \times \frac{1}{K} \] where: - Best arms are those with maximal estimated rewards. - \( K \) is the total number of available arms. We define the least squares estimate as: \[ \theta_k = A_k^{-1} b_k \quad \text{(Least Squares estimate)} \] where: - \( A_k \) is the Gram matrix \( X_k^T X_k \) - \( b_k \) is the vector \( X_k^T Y_k \) Best arms are identified via the estimated expected reward: \[ \text{Expected reward} = X_t^T \theta_k \] ### πŸ“Š Code helper In `cramR`, this is implemented by: ```r get_proba_c_eps_greedy(eps, A_list, b_list, contexts, chosen_arms) ``` This function: - Computes \( \theta_k \) for each arm - Calculates expected rewards \( X_t^T \theta_k \) - Identifies the best arms - Applies the above formula --- # πŸ”’ LinUCB Disjoint Policy with \( \varepsilon \)-Greedy ### πŸ€– Theoretical computation LinUCB selects arms based on **Upper Confidence Bounds (UCBs)**: \[ \text{UCB}_k(X_t) = \mu_k(X_t) + \alpha \sigma_k(X_t) \] where: - \( \mu_k(X_t) = X_t^T \theta_k \) - \( \sigma_k(X_t) = \sqrt{X_t^T A_k^{-1} X_t} \) measures uncertainty - \( \alpha \) controls the exploration strength The action probabilities follow the same structure as Epsilon-Greedy but with UCB scores instead of plain expected rewards i.e. the probability to select one of the best arms is: \[ P(A_t | X_t) = (1 - \varepsilon) \times \frac{1}{\# \text{best arms}} + \varepsilon \times \frac{1}{K} \] While the probability to select an arm that is not among the best arms is: \[ P(A_t | X_t) = \varepsilon \times \frac{1}{K} \] where "best arms" are those with highest UCB scores. ### πŸ“Š Code helper In `cramR`, this is implemented by: ```r get_proba_ucb_disjoint(alpha, eps, A_list, b_list, contexts, chosen_arms) ``` This function: - Computes \( \theta_k \) - Computes \( \mu_k(X_t) \) and \( \sigma_k(X_t) \) - Identifies arms maximizing \( \text{UCB}_k(X_t) \) - Applies the Epsilon-Greedy selection formula --- # πŸ€“ Thompson Sampling (LinTS) ### πŸ€– Theoretical computation In Thompson Sampling, actions are sampled according to posterior draws and the action associated with the maximum value is chosen. The probability that the arm \( A_t \) is optimal is: \[ P(A_t | X_t) = \mathbb{P}\left( \theta_{A_t}^T X_t > \theta_{k}^T X_t \quad \forall k \neq A_t \right) \] where \( \theta_k \sim \mathcal{N}(A_k^{-1} b_k, \sigma^2 A_k^{-1}) \). This requires **computing a multivariate probability**, which we approximate via **adaptive numerical integration**. ### πŸ“Š Code helper In `cramR`, this is implemented by: ```r get_proba_thompson(sigma, A_list, b_list, contexts, chosen_arms) ``` This function: - Computes posterior means and variances - Integrates over the space where chosen arm \( A_t \) has the highest sampled reward - Returns clipped probabilities for numerical stability --- # πŸ‘¨β€πŸ’» Practical Workflow When using your bandit policy in practice: 1. Record action choices, contexts, and policy parameters (e.g., \( A \), \( b \)) 2. Calculate the action selection probabilities. If your policy is within the ones presented above, please feel free to rely on our helper functions to build \( \pi \). 3. Feed `pi`, `arm`, and `reward` into `cram_bandit()` for evaluation of your policy. --- # πŸ§ͺ Estimand Calculation in `cram_bandit_sim()` The following only concerns the simulation framework we implemented for benchmarking purposes. Once the policies are reconstructed, we compute their true expected value β€” referred to as the estimand β€” by applying the learned policy to independent contexts and evaluating it against the known reward function used in the simulation. This is done via: ```r compute_estimand(data_group, list_betas, policy, policy_name, batch_size, bandit) ``` Accurately computing the estimand is critical for properly assessing the bias and confidence interval coverage of the Cram estimate in our simulations. # πŸ“‚ Useful Links - [`contextual`](https://github.com/Nth-iteration-labs/contextual) package: original framework - `cram_bandit()`: Cram evaluation for contextual bandits - `cram_bandit_sim()`: Full simulation engine with automatic pi estimation --- # 🌟 Acknowledgments These helper functions were designed to faithfully reconstruct action probabilities for the policies implemented in [`contextual`](https://github.com/Nth-iteration-labs/contextual), while enabling reproducible Cram-based evaluation. ```{r cleanup-autograph, include=FALSE} autograph_files <- list.files(tempdir(), pattern = "^__autograph_generated_file.*\\.py$", full.names = TRUE) if (length(autograph_files) > 0) { try(unlink(autograph_files, recursive = TRUE, force = TRUE), silent = TRUE) }