---
title: "`allocation`: Exact Optimal Allocation Algorithms for Stratified Sampling"
author:
  - name: Andrew M. Raim
    email: andrew.raim@census.gov
affiliations:
  - name: U.S. Census Bureau
    department: Center for Statistical Research and Methodology
    address: 4600 Silver Hill Road
    city: Washington DC
    country: U.S.A.
format:
  pdf:
    fontsize: 10pt
    indent: false
    toc: true
    number-sections: true
    colorlinks: true
    link-citations: true
    prompt: false
    include-in-header:
      text: |
        \usepackage{common}
        \usepackage{algorithm}
        \usepackage{algorithmicx}
        \usepackage{algpseudocode}
vignette: >
  %\VignetteIndexEntry{allocation}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{quarto::pdf}
bibliography: references.bib
editor_options: 
  chunk_output_type: console
abstract: |
  When sampling from a finite population, the $N$ units are often partitioned
  into strata based on predetermined criteria. The survey designer must
  determine the number of units to sample from each strata. The `allocation`
  package implements several algorithms from @Wright2012 and @Wright2017 which
  reconsider Neyman's classic method of allocating a given sample size $n$
  among such strata [@Neyman1934]. These algorithms provide optimal
  integer-valued solutions to minimize the variance of an estimator for the
  population total.
thanks: |
  Document was compiled `{r} format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z")` and
  corresponds to `allocation` version `{r} packageVersion("allocation")`.
  **Contact**`:` <andrew.raim@gmail.com>, Center for Statistical Research &
  Methodology, U.S. Census Bureau, Washington, DC, 20233, U.S.A.
geometry:
  - left=0.75in
  - right=0.75in
  - top=0.75in
  - bottom=1.00in
code-block-bg: "#FAFAFA"
code-block-border-left: "#008CFF"
callout-icon: false
filters:
  - include-code-files
execute:
  eval: true
---

```{r}
#| include: false
library(allocation)
set.seed(1234)
```

# Disclaimer and Acknowledgments {-}

This document is released to inform interested parties of ongoing research and
to encourage discussion of work in progress. Any views expressed are those of
the author and not those of the U.S. Census Bureau.

Thanks to Tommy Wright (U.S. Census Bureau) for discussions which prompted to
the development of this package, and for providing a review of the materials.

Although there are no guarantees of correctness of the `allocation` package,
reasonable efforts will be made to address shortcomings. Comments, questions,
corrections, and possible improvements can be communicated through the
Github repository for the package (<https://github.com/andrewraim/allocation>).

# Introduction {#sec-intro}

Suppose there are $N$ units in a population which is partitioned into $H$ strata of known sizes $N_1, \ldots, N_H$. A sample consisting of $n_1, \ldots, n_H$ units will be taken from the corresponding strata with $n_h \geq 1$ for all $h$. Therefore, the overall sample size will be $n = \sum_{h=1}^H n_h$. Denote the population mean and variance for the $h$th stratum as

$$
\bar{Y}_h = \sum_{j=1}^{N_h} Y_{hj}
\quad \text{and} \quad
S_h^2 = \frac{1}{N_h - 1} \sum_{j=1}^{N_h} (Y_{hj} - \bar{Y}_h)^2,
$$

where $Y_{hj}$ is the value of the variable of interest for the $j$th unit. To estimate the population total of $T_Y = \sum_{h=1}^H N_h \bar{Y}_h$ from the sampled units, consider the estimator

$$
\hat{T}_Y = \sum_{h=1}^H N_h \bar{y}_h,
$$

where $\bar{y}_h$ is the sample mean from the $h$th strata. The variance of $\hat{T}_Y$ is

$$
\text{Var}(\hat{T}_Y) = \sum_{h=1}^H \frac{N_h}{n_h} (N_h - n_h) S_h^2,
$$ {#eq-var-total}

Neyman's allocation method minimizes ([-@eq-var-total]) with respect to $n_1, \ldots, n_H$ as the optimization variables, subject to the constraint $n = \sum_{h=1}^H n_h$ for a given $n$. Regarding the variables $n_1, \ldots, n_H$ as real numbers, Lagrange's method can be used to obtain the solution

$$
n_h = n \frac{N_h S_h}{\sum_{\ell=1}^H N_{\ell} S_{\ell}}, \quad h = 1, \ldots, H.
$$

Rounding is then used to obtain integer-valued $n_1, \ldots, n_H$ which are needed in practice; however, rounding may not yield an optimal integer solution. The allocation methods in @Wright2012 and @Wright2017 address this by directly obtaining integer solutions. Exploiting the structure of ([-@eq-var-total]), units are iteratively placed into strata to yield an optimal integer solution. Additionally, these methods support nonnegative real-valued bounds $a_h, b_h$ such that $a_h \leq n_h \leq b_h$, $a_h > 0$ and $b_h \leq N_h$. We consider two methods in particular. Algorithm III of @Wright2017 assumes a target sample size of $n_0$ and finds an optimal allocation such that $\sum_{h=1}^H n_h = n_0$. Algorithm IV of @Wright2017 assumes a target variance assumes a target variance $V_0$ and finds an optimal allocation with the smallest overall sample size $\sum_{h=1}^H n_h$ such that ([-@eq-var-total]) is no larger than $V_0$. Algorithms III and IV of @Wright2017 are summarized as Algorithms \ref{alg:fixn} and \ref{alg:prec} here, respectively. See @Wright2017 for further details.

The remainder of the vignette proceeds as follows. @sec-overview gives an overview of the `allocation` package and its interface. @sec-examples demonstrates use of the package on several examples from @Wright2017.

<!-- Adapted from https://tex.stackexchange.com/a/439350 -->
\algblock{Inputs}{EndInputs}
\algblock{Outputs}{EndOutputs}
\algrenewtext{Inputs}{\textbf{inputs}}
\algrenewtext{EndInputs}{\textbf{end inputs}}

\begin{algorithm}
\caption{Optimal allocation for a fixed overall sample size.}
\label{alg:fixn}
\begin{algorithmic}[1]
\Inputs
\State $n_0$: desired overall sample size.
\State $N_1, \ldots, N_H$: population sizes.
\State $S_1, \ldots, S_H$: standard deviations.
\State $a_1, \ldots, a_H$: lower bounds.
\State $b_1, \ldots, b_H$: upper bounds.
\EndInputs
\State Let $n_h = a_h$ for $h = 1, \ldots, H$.
\While{$\sum_{h=1}^H n_h < n_0$}
\State $P_h \leftarrow N_h S_h / \sqrt{n_h (n_h + 1)}$ if $n_h + 1 \leq b_h$, $P_h \leftarrow 0$ otherwise, for $h = 1, \ldots, H$.
\State $h \leftarrow \argmax (P_1, \ldots, P_H)$.
\State $n_h \leftarrow n_h + 1$.
\EndWhile
\State \Return $n_1, \ldots, n_H$.
\end{algorithmic}
\end{algorithm}

\begin{algorithm}
\caption{Optimal allocation for a desired precision.}
\label{alg:prec}
\begin{algorithmic}[1]
\Inputs
\State $V_0$: desired variance target.
\State $N_1, \ldots, N_H$: population sizes.
\State $S_1, \ldots, S_H$: standard deviations.
\State $a_1, \ldots, a_H$: lower bounds.
\State $b_1, \ldots, b_H$: upper bounds.
\EndInputs
\State Let $n_h = a_h$ for $h = 1, \ldots, H$.
\State Let $V = \sum_{h=1}^H N_h (N_h - n_h) S_h^2 / n_h$
\While{$V > V_0$ and $\sum_{h=1}^H n_h < \sum_{h=1}^H N_h$}
\State  $P_h \leftarrow N_h S_h / \sqrt{n_h (n_h + 1)}$ if $n_h + 1 \leq b_h$, $P_h \leftarrow 0$ otherwise, for $h = 1, \ldots, H$.
\State $h \leftarrow \argmax (P_1, \ldots, P_H)$.
\State $n_h \leftarrow n_h + 1$.
\State $V \leftarrow \sum_{h=1}^H N_h (N_h - n_h) S_h^2 / n_h$
\EndWhile
\State \Return $n_1, \ldots, n_H$.
\end{algorithmic}
\end{algorithm}

# Overview of the Package {#sec-overview}

The `allocation` package makes use of the `Rmpfr` package to handle very large numbers while avoiding loss of precision. Furthermore, users may consider encode such values with `Rmpfr` rather than standard floating point numbers; especially for numbers such as target variances which may be very large.

```{r}
#| message: false
#| warning: false
library(Rmpfr)
```

The following functions implement Neyman's allocation method, Algorithm \ref{alg:fixn}, and Algorithm \ref{alg:prec}, respectively.

```{r}
#| echo: false
print_interface = function(f) {
	fname = deparse(substitute(f))
	cat(fname, "=\n")
	str(f, give.attr = F, nchar.max = 4096, width = 80)
}
```

```{r}
#| echo: false
print_interface(allocate_neyman)
print_interface(allocate_fixn)
print_interface(allocate_prec)
```

The arguments are as follows:

- `n0`: target sample size $n_0$,
- `v0`: target variance $V_0$,
- `N`: the vector $(N_1, \ldots, N_H)$,
- `S`: the vector $(S_1, \ldots, S_H)$,
- `lo`: the vector $(a_1, \ldots, a_H)$,
- `hi`: the vector $(b_1, \ldots, b_H)$.

The argument `control` contains additional arguments and can be created with
the following function. See its manual page for further information.

```{r}
print_interface(allocation_control)
```

Several accessors are provided to operate on results from the allocation methods.

```{r}
#| eval: false
out = allocate_fixn(n0, N, S)
allocation(out)  ## Extract allocation (n[1], ..., n[H]).
print(out)       ## Print table with allocation and other information.
```

# Examples {#sec-examples}

## Allocation for Fixed Overall Sample Size

Here we demonstrate Algorithm \ref{alg:fixn} using an example in @Wright2017.

```{r}
N = c(47, 61, 41)
S = sqrt(c(100, 36, 16))
lo = c(1,2,3)
hi = c(5,6,4)
n0 = 10

out1 = allocate_fixn(n0, N, S, lo, hi)
print(out1)
```

Note that rows labels are the stratum indices $1, \ldots, H$. The columns
`lo`, `hi`, and `n` correspond to the vectors $a_1, \ldots, a_H$,
$b_1, \ldots, b_H$, and $n_1, \ldots, n_H$, respectively.
To see details justifying each selection, run `allocate_fixn` with the
`verbose` option enabled.

```{r}
#| eval: false
out1 = allocate_fixn(v0, N, S, lo, hi, control = allocation_control(verbose = TRUE))
```

Let us compare the above results to Neyman allocation.

```{r}
out2 = allocate_neyman(n0, N, S)
print(out2)
```

The number of decimal points in the output can be changed using the control object.

```{r}
print(out2, control = allocation_control(digits = 2))
```

Extract the allocation as a numeric vector using the `allocation` accessor function.

```{r}
allocation(out1)  ## allocate_fixn result 
allocation(out2)  ## allocate_neyman result
```


## Allocation for a Desired Precision

Run Algorithm \ref{alg:prec} using an example in @Wright2017. Since our target
variance `v0` is a very large number, we pass it as an `mpfr` object to
avoid loss of precision.

```{r}
H = 10
v0 = mpfr(388910760, 256)^2
N = c(819, 672, 358, 196, 135, 83, 53, 40, 35, 13)
lo = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 13)
S = c(330000, 518000, 488000, 634000, 1126000, 2244000, 2468000, 5869000,
	29334000, 1233311000)

print(data.frame(N, S, lo))
out1 = allocate_prec(v0, N, S, lo)
print(out1)
```

To see details justifying each selection, we can run `allocate_prec` with the
`verbose` option enabled.

```{r}
#| eval: false
out1 = allocate_prec(v0, N, S, lo, control = allocation_control(verbose = TRUE))
```

Compare the above results to Neyman allocation. Here, we first need to
compute a target sample size. This is done with a given cv and revenue
data; see @Wright2017 for details. We also exclude the 10th stratum
from the allocation procedure, as it is a certainty stratum; its
allocation is considered fixed at 13.

```{r}
cv = 0.042
rev = mpfr(9259780000, 256)
n = sum(N[-10] * S[-10])^2 / ((cv * rev)^2 + sum(N[-10] * S[-10]^2))
out2 = allocate_neyman(n, N[-10], S[-10])
print(out2)
```

Extract the final allocations.

```{r}
allocation(out1)  ## allocate_prec result
allocation(out2)  ## allocate_neyman result
```

# References {-}

::: {#refs}
:::