--- title: "`allocation`: Exact Optimal Allocation Algorithms for Stratified Sampling" author: - name: Andrew M. Raim email: andrew.raim@census.gov affiliations: - name: U.S. Census Bureau department: Center for Statistical Research and Methodology address: 4600 Silver Hill Road city: Washington DC country: U.S.A. format: pdf: fontsize: 10pt indent: false toc: true number-sections: true colorlinks: true link-citations: true prompt: false include-in-header: text: | \usepackage{common} \usepackage{algorithm} \usepackage{algorithmicx} \usepackage{algpseudocode} vignette: > %\VignetteIndexEntry{allocation} %\VignetteEncoding{UTF-8} %\VignetteEngine{quarto::pdf} bibliography: references.bib editor_options: chunk_output_type: console abstract: | When sampling from a finite population, the $N$ units are often partitioned into strata based on predetermined criteria. The survey designer must determine the number of units to sample from each strata. The `allocation` package implements several algorithms from @Wright2012 and @Wright2017 which reconsider Neyman's classic method of allocating a given sample size $n$ among such strata [@Neyman1934]. These algorithms provide optimal integer-valued solutions to minimize the variance of an estimator for the population total. thanks: | Document was compiled `{r} format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z")` and corresponds to `allocation` version `{r} packageVersion("allocation")`. **Contact**`:` , Center for Statistical Research & Methodology, U.S. Census Bureau, Washington, DC, 20233, U.S.A. geometry: - left=0.75in - right=0.75in - top=0.75in - bottom=1.00in code-block-bg: "#FAFAFA" code-block-border-left: "#008CFF" callout-icon: false filters: - include-code-files execute: eval: true --- ```{r} #| include: false library(allocation) set.seed(1234) ``` # Disclaimer and Acknowledgments {-} This document is released to inform interested parties of ongoing research and to encourage discussion of work in progress. Any views expressed are those of the author and not those of the U.S. Census Bureau. Thanks to Tommy Wright (U.S. Census Bureau) for discussions which prompted to the development of this package, and for providing a review of the materials. Although there are no guarantees of correctness of the `allocation` package, reasonable efforts will be made to address shortcomings. Comments, questions, corrections, and possible improvements can be communicated through the Github repository for the package (). # Introduction {#sec-intro} Suppose there are $N$ units in a population which is partitioned into $H$ strata of known sizes $N_1, \ldots, N_H$. A sample consisting of $n_1, \ldots, n_H$ units will be taken from the corresponding strata with $n_h \geq 1$ for all $h$. Therefore, the overall sample size will be $n = \sum_{h=1}^H n_h$. Denote the population mean and variance for the $h$th stratum as $$ \bar{Y}_h = \sum_{j=1}^{N_h} Y_{hj} \quad \text{and} \quad S_h^2 = \frac{1}{N_h - 1} \sum_{j=1}^{N_h} (Y_{hj} - \bar{Y}_h)^2, $$ where $Y_{hj}$ is the value of the variable of interest for the $j$th unit. To estimate the population total of $T_Y = \sum_{h=1}^H N_h \bar{Y}_h$ from the sampled units, consider the estimator $$ \hat{T}_Y = \sum_{h=1}^H N_h \bar{y}_h, $$ where $\bar{y}_h$ is the sample mean from the $h$th strata. The variance of $\hat{T}_Y$ is $$ \text{Var}(\hat{T}_Y) = \sum_{h=1}^H \frac{N_h}{n_h} (N_h - n_h) S_h^2, $$ {#eq-var-total} Neyman's allocation method minimizes ([-@eq-var-total]) with respect to $n_1, \ldots, n_H$ as the optimization variables, subject to the constraint $n = \sum_{h=1}^H n_h$ for a given $n$. Regarding the variables $n_1, \ldots, n_H$ as real numbers, Lagrange's method can be used to obtain the solution $$ n_h = n \frac{N_h S_h}{\sum_{\ell=1}^H N_{\ell} S_{\ell}}, \quad h = 1, \ldots, H. $$ Rounding is then used to obtain integer-valued $n_1, \ldots, n_H$ which are needed in practice; however, rounding may not yield an optimal integer solution. The allocation methods in @Wright2012 and @Wright2017 address this by directly obtaining integer solutions. Exploiting the structure of ([-@eq-var-total]), units are iteratively placed into strata to yield an optimal integer solution. Additionally, these methods support nonnegative real-valued bounds $a_h, b_h$ such that $a_h \leq n_h \leq b_h$, $a_h > 0$ and $b_h \leq N_h$. We consider two methods in particular. Algorithm III of @Wright2017 assumes a target sample size of $n_0$ and finds an optimal allocation such that $\sum_{h=1}^H n_h = n_0$. Algorithm IV of @Wright2017 assumes a target variance assumes a target variance $V_0$ and finds an optimal allocation with the smallest overall sample size $\sum_{h=1}^H n_h$ such that ([-@eq-var-total]) is no larger than $V_0$. Algorithms III and IV of @Wright2017 are summarized as Algorithms \ref{alg:fixn} and \ref{alg:prec} here, respectively. See @Wright2017 for further details. The remainder of the vignette proceeds as follows. @sec-overview gives an overview of the `allocation` package and its interface. @sec-examples demonstrates use of the package on several examples from @Wright2017. \algblock{Inputs}{EndInputs} \algblock{Outputs}{EndOutputs} \algrenewtext{Inputs}{\textbf{inputs}} \algrenewtext{EndInputs}{\textbf{end inputs}} \begin{algorithm} \caption{Optimal allocation for a fixed overall sample size.} \label{alg:fixn} \begin{algorithmic}[1] \Inputs \State $n_0$: desired overall sample size. \State $N_1, \ldots, N_H$: population sizes. \State $S_1, \ldots, S_H$: standard deviations. \State $a_1, \ldots, a_H$: lower bounds. \State $b_1, \ldots, b_H$: upper bounds. \EndInputs \State Let $n_h = a_h$ for $h = 1, \ldots, H$. \While{$\sum_{h=1}^H n_h < n_0$} \State $P_h \leftarrow N_h S_h / \sqrt{n_h (n_h + 1)}$ if $n_h + 1 \leq b_h$, $P_h \leftarrow 0$ otherwise, for $h = 1, \ldots, H$. \State $h \leftarrow \argmax (P_1, \ldots, P_H)$. \State $n_h \leftarrow n_h + 1$. \EndWhile \State \Return $n_1, \ldots, n_H$. \end{algorithmic} \end{algorithm} \begin{algorithm} \caption{Optimal allocation for a desired precision.} \label{alg:prec} \begin{algorithmic}[1] \Inputs \State $V_0$: desired variance target. \State $N_1, \ldots, N_H$: population sizes. \State $S_1, \ldots, S_H$: standard deviations. \State $a_1, \ldots, a_H$: lower bounds. \State $b_1, \ldots, b_H$: upper bounds. \EndInputs \State Let $n_h = a_h$ for $h = 1, \ldots, H$. \State Let $V = \sum_{h=1}^H N_h (N_h - n_h) S_h^2 / n_h$ \While{$V > V_0$ and $\sum_{h=1}^H n_h < \sum_{h=1}^H N_h$} \State $P_h \leftarrow N_h S_h / \sqrt{n_h (n_h + 1)}$ if $n_h + 1 \leq b_h$, $P_h \leftarrow 0$ otherwise, for $h = 1, \ldots, H$. \State $h \leftarrow \argmax (P_1, \ldots, P_H)$. \State $n_h \leftarrow n_h + 1$. \State $V \leftarrow \sum_{h=1}^H N_h (N_h - n_h) S_h^2 / n_h$ \EndWhile \State \Return $n_1, \ldots, n_H$. \end{algorithmic} \end{algorithm} # Overview of the Package {#sec-overview} The `allocation` package makes use of the `Rmpfr` package to handle very large numbers while avoiding loss of precision. Furthermore, users may consider encode such values with `Rmpfr` rather than standard floating point numbers; especially for numbers such as target variances which may be very large. ```{r} #| message: false #| warning: false library(Rmpfr) ``` The following functions implement Neyman's allocation method, Algorithm \ref{alg:fixn}, and Algorithm \ref{alg:prec}, respectively. ```{r} #| echo: false print_interface = function(f) { fname = deparse(substitute(f)) cat(fname, "=\n") str(f, give.attr = F, nchar.max = 4096, width = 80) } ``` ```{r} #| echo: false print_interface(allocate_neyman) print_interface(allocate_fixn) print_interface(allocate_prec) ``` The arguments are as follows: - `n0`: target sample size $n_0$, - `v0`: target variance $V_0$, - `N`: the vector $(N_1, \ldots, N_H)$, - `S`: the vector $(S_1, \ldots, S_H)$, - `lo`: the vector $(a_1, \ldots, a_H)$, - `hi`: the vector $(b_1, \ldots, b_H)$. The argument `control` contains additional arguments and can be created with the following function. See its manual page for further information. ```{r} print_interface(allocation_control) ``` Several accessors are provided to operate on results from the allocation methods. ```{r} #| eval: false out = allocate_fixn(n0, N, S) allocation(out) ## Extract allocation (n[1], ..., n[H]). print(out) ## Print table with allocation and other information. ``` # Examples {#sec-examples} ## Allocation for Fixed Overall Sample Size Here we demonstrate Algorithm \ref{alg:fixn} using an example in @Wright2017. ```{r} N = c(47, 61, 41) S = sqrt(c(100, 36, 16)) lo = c(1,2,3) hi = c(5,6,4) n0 = 10 out1 = allocate_fixn(n0, N, S, lo, hi) print(out1) ``` Note that rows labels are the stratum indices $1, \ldots, H$. The columns `lo`, `hi`, and `n` correspond to the vectors $a_1, \ldots, a_H$, $b_1, \ldots, b_H$, and $n_1, \ldots, n_H$, respectively. To see details justifying each selection, run `allocate_fixn` with the `verbose` option enabled. ```{r} #| eval: false out1 = allocate_fixn(v0, N, S, lo, hi, control = allocation_control(verbose = TRUE)) ``` Let us compare the above results to Neyman allocation. ```{r} out2 = allocate_neyman(n0, N, S) print(out2) ``` The number of decimal points in the output can be changed using the control object. ```{r} print(out2, control = allocation_control(digits = 2)) ``` Extract the allocation as a numeric vector using the `allocation` accessor function. ```{r} allocation(out1) ## allocate_fixn result allocation(out2) ## allocate_neyman result ``` ## Allocation for a Desired Precision Run Algorithm \ref{alg:prec} using an example in @Wright2017. Since our target variance `v0` is a very large number, we pass it as an `mpfr` object to avoid loss of precision. ```{r} H = 10 v0 = mpfr(388910760, 256)^2 N = c(819, 672, 358, 196, 135, 83, 53, 40, 35, 13) lo = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 13) S = c(330000, 518000, 488000, 634000, 1126000, 2244000, 2468000, 5869000, 29334000, 1233311000) print(data.frame(N, S, lo)) out1 = allocate_prec(v0, N, S, lo) print(out1) ``` To see details justifying each selection, we can run `allocate_prec` with the `verbose` option enabled. ```{r} #| eval: false out1 = allocate_prec(v0, N, S, lo, control = allocation_control(verbose = TRUE)) ``` Compare the above results to Neyman allocation. Here, we first need to compute a target sample size. This is done with a given cv and revenue data; see @Wright2017 for details. We also exclude the 10th stratum from the allocation procedure, as it is a certainty stratum; its allocation is considered fixed at 13. ```{r} cv = 0.042 rev = mpfr(9259780000, 256) n = sum(N[-10] * S[-10])^2 / ((cv * rev)^2 + sum(N[-10] * S[-10]^2)) out2 = allocate_neyman(n, N[-10], S[-10]) print(out2) ``` Extract the final allocations. ```{r} allocation(out1) ## allocate_prec result allocation(out2) ## allocate_neyman result ``` # References {-} ::: {#refs} :::