Cram Bandit Simulation

This vignette demonstrates the simulation capabilities included in the cramR package. The simulation code is primarily intended for reproducing experimental results from the associated theoretical papers and for validating the performance of the Cram method under controlled data-generating processes. While not intended for direct use in practical applications, these simulations allow users to benchmark and understand the empirical behavior of the method in synthetic environments.

What is cram_bandit_sim()?

The cram_bandit_sim() function runs on-policy simulation for contextual bandit algorithms using the Cram method. It evaluates the statistical properties of policy value estimates such as:

This is useful for benchmarking bandit policies under controlled, simulated environments.


📋 Inputs

You need to provide:


Example: Cram Bandit Simulation


# Number of time steps
horizon       <- 500L

# Number of simulations 
simulations   <- 100L

# Number of arms
k = 4

# Number of context features
d= 3

# Reward beta parameters of linear model (the outcome generation models, one for each arm, are linear with arm-specific parameters betas)
list_betas <- cramR::get_betas(simulations, d, k)

# Define the contextual linear bandit, where sigma is the scale of the noise in the outcome linear model
bandit        <- cramR::ContextualLinearBandit$new(k = k, d = d, list_betas = list_betas, sigma = 0.3)

# Define the policy object (choose between Contextual Epsilon Greedy, UCB Disjoint and Thompson Sampling)
policy <- cramR::BatchContextualEpsilonGreedyPolicy$new(epsilon=0.1, batch_size=5)
# policy <- cramR::BatchLinUCBDisjointPolicyEpsilon$new(alpha=1.0, epsilon=0.1, batch_size=1)
# policy <- cramR::BatchContextualLinTSPolicy$new(v = 0.1, batch_size=1)


sim <- cram_bandit_sim(horizon, simulations,
                            bandit, policy,
                            alpha=0.05, do_parallel = FALSE)
#> Simulation horizon: 500
#> Number of simulations: 101
#> Number of batches: 1
#> Starting main loop.
#> Finished main loop.
#> Completed simulation in 0:00:04.500
#> Computing statistics.

Cumulative regret curve over time for the selected policy


What Does It Return?

The output contains:

A data.table with one row per simulation, including:

Result tables (raw and interactive), reporting:


Example Output Preview

head(sim$estimates)
#>      sim  estimate variance_est  estimand prediction_error est_rel_error
#>    <int>     <num>        <num>     <num>            <num>         <num>
#> 1:     1 0.5934946  0.007485342 0.5008213      0.092673298    0.18504265
#> 2:     2 0.5738572  0.004488658 0.5472302      0.026626936    0.04865765
#> 3:     3 0.3082747  0.005971414 0.2685309      0.039743854    0.14800479
#> 4:     4 0.4875583  0.001785894 0.5025160     -0.014957724   -0.02976567
#> 5:     5 0.6279068  0.002812175 0.6234795      0.004427278    0.00710092
#> 6:     6 0.6756592  0.001788615 0.6278958      0.047763481    0.07606913
#>    variance_prediction_error  std_error  ci_lower  ci_upper
#>                        <num>      <num>     <num>     <num>
#> 1:                -0.1543576 0.08651787 0.4239227 0.7630665
#> 2:                -0.4929024 0.06699744 0.4425446 0.7051697
#> 3:                -0.3253907 0.07727493 0.1568186 0.4597308
#> 4:                -0.7982420 0.04225984 0.4047305 0.5703860
#> 5:                -0.6822998 0.05302995 0.5239700 0.7318436
#> 6:                -0.7979345 0.04229202 0.5927684 0.7585501
sim$interactive_table

Notes


References

This simulation builds on: