--- title: "Introduction to CDsampling" author: "Y.H." output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to CDsampling} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{css, echo=FALSE} .note-box { border: 1px solid #ccc; background: white; padding: 10px; margin: 10px 0; border-radius: 5px; } ``` # Table of Contents - [Computation of Fisher information matrix](#Computation_Fisher) - [Example 1: GLM Fisher information matrix](#example1_GLM_Fisher) - [Example 2: MLM Fisher information matrix](#example2_MLM_Fisher) - [Applications in paid research studies](#Applications) - [Example 3: trial_data & constrained sampling with GLM](#example3_trial_data) - [Example 4: trauma_data & constrained sampling with MLM](#example4_trauma_data) - [References](#References) ```{r setup} library(CDsampling) ``` In the context of paid research studies and clinical trials, budget considerations and patient sampling from available populations are subject to inherent constraints. **CDsampling** integrates optimal design theories within the framework of constrained sampling. * This package offers the possibility to find both D-optimal approximate and exact allocations for sampling with or without constraints. * Additionally, it provides functions to find constrained uniform sampling as a robust sampling strategy with limited model information. * It also provides tool for computation of Fisher information matrix of the generalized linear models (GLMs) including regular linear regression model and the multinomial logistic models (MLMs). * Two datasets are embedded in the package for application examples. # Computation of Fisher information matrix ## Example 1: GLM Fisher information matrix Consider a research study with a simple logistic regression model $$\log(\frac{\mu_i}{1-\mu_i}) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}$$ where $\mu_i = E(Y_i\mid {\mathbf x}_i)$, ${\mathbf x}_i = (x_{i1}, x_{i2})^\top \in \{(-1, -1), (-1, +1), (+1, -1)\}$ and parameters $\boldsymbol \beta = (\beta_0, \beta_1, \beta_2) = (0.5, 0.5, 0.5)$. In this example, we have $m=3$ design points $(x_{i1}, x_{i2})^\top \in \{(-1, -1), (-1, +1), (+1, -1)\}$, the design matrix $\mathbf X$ (with the first column for the intercept): $$\begin{bmatrix} 1 & -1 & -1 \\ 1 & -1 & 1 \\ 1 & 1 & -1\\ \end{bmatrix}$$. To calculate Fisher information matrix of the design with GLM, we can use *F_func_GLM( )* in the package with input of approximate allocation $w$, coefficients $\boldsymbol \beta$, and design matrix $\mathbf X$. ```{r} beta = c(0.5, 0.5, 0.5) #coefficients X = matrix(data=c(1,-1,-1,1,-1,1,1,1,-1), byrow=TRUE, nrow=3) #design matrix w = c(1/3,1/3,1/3) #approximate allocation CDsampling::F_func_GLM(w=w, beta=beta, X=X, link='logit') ``` ## Example 2: MLM Fisher information matrix Consider a research study with cumulative non-proportional odds multinomial logit model with $J=5$ response levels and covariates $(x_{i1}, x_{i2})=\{(1,0),(2,0),(3,0), (4,0),(1,1),(2,1),(3,1),(4,1)\}$. The model can be written as:$$\log(\frac{\pi_{i1}}{\pi_{i2}+\dots+\pi_{i5}}) = \beta_{11}+\beta_{12}x_{i1}+\beta_{13}x_{i2}$$ $$\log(\frac{\pi_{i1}+\pi_{i2}}{\pi_{i3}+\pi_{i4}+\pi_{i5}}) = \beta_{21}+\beta_{22}x_{i1}+\beta_{23}x_{i2}$$ $$\log(\frac{\pi_{i1}+\pi_{i2}+\pi_{i3}}{\pi_{i4}+\pi_{i5}}) = \beta_{31}+\beta_{32}x_{i1}+\beta_{33}x_{i2}$$ $$\log(\frac{\pi_{i1}+\dots+\pi_{i4}}{\pi_{i5}}) = \beta_{41}+\beta_{42}x_{i1}+\beta_{43}x_{i2}$$ where $i=1,\dots,8$. We have $m=8$ design points and $p=12$ parameters. We assume the parameters $\boldsymbol \beta = (\beta_{11}, \beta_{12}, \beta_{13}, \beta_{21}, \beta_{22}, \beta_{23}, \beta_{31}, \beta_{32}, \beta_{33}, \beta_{41}, \beta_{42}, \beta_{43})^\top = (-4.047, -0.131, 4.214, -2.225, -0.376,$ $3.519, -0.302, -0.237, 2.420, 1.386, -0.120, 1.284)^\top$. The approximate allocation for the eight design points is $\mathbf w = (1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8)^\top$. To calculate the Fisher information matrix with MLM, we can use the *F_func_MLM( )* with input of approximate allocation $w$, covariate coefficients $\boldsymbol \beta$, design matrix $\mathbf X$, and multinomial logit model (cumulative for this example). The design matrix incorporates all $8$ design points of covariates $(x_{i1}, x_{i2})$ specified by the cumulative logit model's four equations. For example, when $x_{i1}=1$, $x_{i2}=0$, the design matrix takes the following format: $$\begin{bmatrix} 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \end{bmatrix}$$. ```{r} J=5; p=12; m=8; #response levels; num of parameters; num of design points beta = c(-4.047, -0.131, 4.214, -2.225, -0.376, 3.519, -0.302, -0.237, 2.420, 1.386, -0.120, 1.284) Xi=rep(0,J*p*m) #design matrix dim(Xi)=c(J,p,m) Xi[,,1] = rbind(c( 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,2] = rbind(c( 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,3] = rbind(c( 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,4] = rbind(c( 1, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 4, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 4, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,5] = rbind(c( 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,6] = rbind(c( 1, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 2, 1, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,7] = rbind(c( 1, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 3, 1, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 3, 1, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 1), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) Xi[,,8] = rbind(c( 1, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 1, 4, 1, 0, 0, 0, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 1, 4, 1, 0, 0, 0), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 1), c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)) alloc = rep(1/8,m) #approximate allocation CDsampling::F_func_MLM(w=alloc, beta=beta, X=Xi, link='cumulative') ``` # Applications in paid research studies **CDsampling** package addresses constrained sampling problems in paid research studies or clinical trials where the number of qualified volunteers exceeds the available budget. The CDsampling package implements the designer’s sampling strategy and utilizes the constrained lift-one algorithm to find the optimal sample sizes for different subgroups with the goal to achieve the most accurate model estimates. ## Example 3: trial_data & constrained sampling with GLM The *trial_data* is a simulated dataset containing information for $N=500$ volunteers with gender, age, and final efficacy information. The covariates considered in this example are: 1). **Gender:** - $0$: for female - $1$ for male 2). **Age group:** - $age\_1 = 0$ and $age\_2 = 0$: ages $18\sim25$ - $age\_1 = 1$ and $age\_2 = 0$: ages $26\sim64$ - $age\_1 = 0$ and $age\_2 = 1$: ages $65$ and above. There are $m=6$ design points, that is, the number of combinations of gender and age groups $(x_{gender\_i}, x_{age\_1i}, x_{age\_2i})$: 1). $(0,0,0)$: Female, 18-25 $N_1=50$) 2). $(0,1,0)$: Female, 26-64 ($N_2=40$) 3). $(0,0,1)$: Female, 65+ ($N_3=10$) 4). $(1,0,0)$: Male, 18-25 ($N_4=200$) 5). $(1,1,0)$: Male, 26-64 ($N_5=150$) 6). $(1,0,1)$: Male, 65+ ($N_6=50$) Suppose that a sample of $n=200$ participants is required due to budget limit.Our goal is to find the constrained D-optimal allocation $(w_1, w_2, \dots, w_6)$ with feasible allocation $$S = \{(w_1, \ldots, w_m)^T \in S_0 \mid n w_i \leq N_i, i=1, \ldots, m\}.$$ We use constrained lift-one algorithm *liftone_constrained_GLM( )* to find the locally D-optimal approximate sampling allocations with the input of design matrix $X$, $\mathbf W$ matrix which is the result returned from *W_func_GLM( )*, constraints setup (g.con, g.dir, and g.rhs), and boundaries in searching for lift-one weight (step 3 in the constrained lift-one algorithm, see reference). We consider the logistic regression model for $j=1,\dots,m$, $i=1,\dots,n_j$ with $\boldsymbol \beta=(\beta_{1}, \beta_{21}, \beta_{22})= (0,3,3,3)$: \begin{equation}\label{eq:trial_logistic_model} {\rm logit} \{P(Y_{ij}=1 \mid x_{gender\_i}, x_{age\_1i}, x_{age\_2i})\} = \beta_0 + \beta_1 x_{gender\_i} + \beta_{21} x_{age\_1i} + \beta_{22} x_{age\_2i} \end{equation} Use the following R codes to define the coefficients, sample size, and design matrix: ```{r} beta = c(0, 3, 3, 3) #coefficients #design matrix X X=matrix(data=c(1,0,0,0,1,0,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,0,1), ncol=4, byrow=TRUE) nsample=200 #sample size ``` To run the *liftone_constrained_GLM( )* function, we also need to the $\mathbf W$ matrix from the calculation of Fisher information matrix, we can use the *W_func_GLM( )* function in the package: ```{r} W_matrix=CDsampling::W_func_GLM(X=X, b=beta, link="logit") #W matrix ``` Lastly, we also need to define the constraints (number of patients from different gender and age group) and boundaries for constrained sampling (please see the reference for details of lower bound $r_{i1}$ and upper bound $r_{i2}$): ```{r} rc = c(50, 40, 10, 200, 150, 50)/nsample #constraints for each subgroup m = 6 g.con = matrix(0,nrow=(2*m+1), ncol=m) g.con[1,] = rep(1, m) g.con[2:(m+1),] = diag(m) g.con[(m+2):(2*m+1), ] = diag(m) g.dir = c("==", rep("<=", m), rep(">=", m)) g.rhs = c(1, rc, rep(0, m)) lower.bound=function(i, w){ nsample = 200 rc = c(50, 40, 10, 200, 150, 50)/nsample m=length(w) #num of categories temp = rep(0,m) temp[w>0]=1-pmin(1,rc[w>0])*(1-w[i])/w[w>0]; temp[i]=0; max(0,temp); } upper.bound=function(i, w){ nsample = 200 rc = c(50, 40, 10, 200, 150, 50)/nsample m=length(w) #num of categories rc[i]; min(1,rc[i]) } ``` We can define an optional subgroups label for the constrained approximate allocations: ```{r} label = c("F, 18-25", "F, 26-64", "F, >=65", "M, 18-25", "M, 26-64", "M, >=65") ``` Now, we can run the constrained lift-one algorithm to find optimal approximate allocations for the defined subgroups: ```{r} set.seed(2025) approximate_design = CDsampling::liftone_constrained_GLM(X=X, W=W_matrix, g.con=g.con, g.dir=g.dir, g.rhs=g.rhs, lower.bound=lower.bound, upper.bound=upper.bound, label=label, reltol=1e-10, maxit=100, random=TRUE, nram=4, w00=NULL, epsilon=1e-8) print(approximate_design) ``` The output contains several key components: - $w$: the converged D-optimal approximate - $w_0$: random initial weights used for the optimization - *maximum*: the achieved maximum determinant of the Fisher information matrix - *reason*: criteria for lift-one loop termination with either "all derivative <=0", and "gmax <=0" (see reference). To find the exact allocation (integer value of allocation), we can use the *approxtoexact_constrained_func( )* with input of sample size $n$, approximate allocation found by constrained lift-one algorithm, number of design point $m$, coefficients *beta*, link type of GLM ("logit" in this example), **Fdet_func=Fdet_func_GLM**, and design matrix $X$: ```{r} exact_design = CDsampling::approxtoexact_constrained_func(n=200, w=approximate_design$w, m=6, beta=beta, link='logit', X=X, Fdet_func=Fdet_func_GLM, iset_func=iset_func_trial, label=label) print(exact_design) ``` The *allocation* represents the final D-optimal exact allocation of the constrained sampling, while allocation.real is the approximate allocation found in the previous step.