---
title: "Introduction to sdcLog"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 5
    number_sections: false
vignette: >
  %\VignetteIndexEntry{Introduction to sdcLog}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>" # ,
  # tidy = TRUE,
  # tidy.opts = list(
  #   indent = 2L,
  #   width.cutoff = 95L,
  #   wrap = TRUE
  # )
)

user_options <- options()
options(width = 93)
options(knitr.kable.NA = "")
options(sdc.info_level = 1L)
options(datatable.print.keys = FALSE)
options(datatable.print.class = FALSE)

library(sdcLog)
library(knitr)
library(skimr)
```

# Overview

This vignette introduces the sdcLog package and its main functions, illustrated
with various examples. sdcLog provides tools which simplify statistical
disclosure control in the context of research data centers (RDC).

The package includes four main functions:

-   `sdc_descriptives()`: This function is used for statistical disclosure
    control of descriptive statistics. It checks the data used for the
    calculation of descriptive statistics for compliance with the rules and
    regulations set by the RDC.

    The calculation of simple extreme values such as minimum and maximum is
    usually not allowed according to RDC rules, so `sdc_descriptives()` cannot
    be used to check them. Extreme values can only be used if they are
    calculated with `sdc_min_max()`.

-   `sdc_min_max()`: This function is used for the automatic calculation of
    extreme values according to the rules of the RDC (if possible). It uses the
    available data and calculates the values for desired variables and groupings
    in compliance with the rules. The values are calculated as averages of a
    sufficient number of distinct entities. This helps the researcher to easily
    follow the rules and simplifies the output control.

-   `sdc_model()`: This function is used for statistical disclosure control for
    various types of models such as `lm()` or `glm()`. It checks the calculated
    model and the underlying data for compliance with the rules and regulations
    of the RDC.

-   `sdc_log()`: This function is a simple wrapper around `source()` which makes
    it easy to run scripts and capture all output (especially output from the
    other `sdc_*` functions) in a log file. Usually, this should be used to
    source R scripts containing one or more of the functions above.

# sdc_descriptives()

This function performs statistical disclosure control according to two main
criteria: On the one hand, it checks for a sufficiently large number of
different statistical entities. On the other hand, it checks for dominance,
which means that two entities must not account for more than 85 percent of the
observed values. How to use `sdc_descriptives()` is shown below.

## Data

To introduce `sdc_descriptives()`, a simple toy dataset is used. There are 20
observations of 10 distinct entities from two different sectors and values in
the years 2019 and 2020 for the variables `val_1` and `val_2`.

```{r test_data_descriptives}
data("sdc_descriptives_DT")
sdc_descriptives_DT
```

## Simple cases

Consider the case that the mean for `val_1` <!-- and `val_2`  --> has been
calculated and is now to be output as a result:[^1]

[^1]: Since sdcLog heavily relies on `data.table`, all examples will use
    `data.table` syntax as well.

```{r descriptives_simple_case}
sdc_descriptives_DT[, .(mean = mean(val_1, na.rm = TRUE))]
```

Before this result can be released, it must be checked whether all RDC rules for
calculating this value have been followed. Thus, the underlying data is checked
for compliance with the RDC rules.
<!-- The following examples serve to illustrate the examination of these different cases. -->

This is the simplest case, the descriptive statistic (mean) was calculated for
the variable `val_1` without further specifications. Required arguments of
`sdc_descriptives()` are the data set (`data`), the ID variable (`id_var`) and
the variable for which the statistics were calculated (`val_var`):

```{r descriptives_simple}
sdc_descriptives(data = sdc_descriptives_DT, id_var = "id", val_var = "val_1")
```

Since there are no problems at this point, the function runs without warnings
and returns (invisibly) a list of information containing options, settings and
the checked criteria `distinct_ids` and `dominance`.

Options and settings are always printed to show that all specifications are set
according to RDC rules. From the output above follows that there are at least 5
distinct entities required (`sdc.n_ids: 5`) and that dominance is defined as 2
entities (`sdc.n_ids_dominance: 2`) with a value share of more than 85 percent
(`sdc.share_dominance: 0.85`). This reflects the standard values for the
options. For details on setting options see the [separate vignette on
options](options.html).

The settings show again which arguments were specified in the function call and
vary depending on the `sdc_function`. This is important if the result from
`sdc_descriptives()` is not printed right away.

<!-- Depending on the specification of option `sdc.info_level` variously detailed information are printed to the console. Here the default value (sdc.info_level = 1) is used, therefore only a message appears that the output complies to the rules. On the whole, there are no rule violations and the output of the results could be allowed in this case. -->

## Grouped descriptive statistics using by

In this and the following section some advanced cases are presented to introduce
more arguments and functionalities of `sdc_descriptives()`.

In this case the descriptive statistics for the variable `val_1` are grouped by
`sector`:

```{r descriptives_by_case}
sdc_descriptives_DT[, .(mean = mean(val_1, na.rm = TRUE)), by = "sector"]
```

<!-- To check this type of result we introduce the `by` argument in `sdc_descriptives()`.  -->

The mean is computed grouped by sector, so the grouping variable must be
specified in `by`. Checking the results leads to the following:

```{r, descriptives_by}
sdc_descriptives(data = sdc_descriptives_DT, id_var = "id", val_var = "val_1", by = "sector")
```

The grouped descriptive statistics by sector do not generate a warning and
therefore comply with RDC rules. Therefore, the results could be released in
this case.

In order to extend this case even further, it is now proposed to group the mean
of `val_1` not only by `sector`, but also by `year`:

```{r descriptives_byby_case}
sdc_descriptives_DT[, .(mean = mean(val_1, na.rm = TRUE)), by = c("sector", "year")]
```

To check this result for compliance with RDC rules, use:

```{r descriptives_byby}
sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_1",
  by = c("sector", "year")
)
```

Now several warnings appear, as both criteria are violated. For sector `S1`
there are not enough distinct ids in year 2019, as there is a missing value in
the data. The dominance criterion for year 2020 is violated in both sectors. As
can be seen in the table displayed, the value share of approximately 88 percent
for `S1` and 91 percent for `S2` are above the 85 percent limit. Therefore, the
descriptive statistics for `val_1`, grouped by `sector` and `year` cannot be
released.

## Handling zeros using zero_as_NA

Now, descriptive statistics are calculated for variable `val_2` and grouped by
sector and year:

```{r descriptives_zero_case}
sdc_descriptives_DT[, .(mean = mean(val_2, na.rm = TRUE)), by = c("sector", "year")]
```

The compliance with the rules can be checked just as in the previous case (only
replacing `val_1` by `val_2`):

```{r descriptives_zero}
sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_2",
  by = c("sector", "year")
)
```

The result indicates that problems exist and the output does not comply to the
rules. There are not enough distinct entities and the output cannot be released
like this.

An additional message indicates that the value `0` occurs rather frequently in
the data (20 percent of all cases). The message indicates that `0` is assumed to
represent missing values and will be treated as such. Please note that even if
`0`s are actual `0`s in the data, this assumption might be correct in the
context of statistical disclosure control. For example, if most of the cases are
`0`, it might be known publicly which entities do not have a value of `0` for
this specific variable. So treating those `0` as `NA` is correct in this
context. Since this is the more defensive interpretation of `0`s, it's the
default.

However, it might be the case that it is accurate according to the data basis to
treat values of `0` as zero (instead of `NA`). Then, specifying the argument
`zero_as_NA = FALSE` circumvents the default behavior and treats `0` like other
numeric values:

```{r descriptives_zerozero}
sdc_descriptives(
  data = sdc_descriptives_DT,
  id_var = "id",
  val_var = "val_2",
  by = c("sector", "year"),
  zero_as_NA = FALSE
)
```

Now `0` is not recognized as `NA` anymore. In this case the criterion of
distinct entities is not longer violated. Therefore, the output could be
released (assuming it is actually correct to treat `0`s as usual numeric
values).


# sdc_min_max()

This function automatically calculates extreme values that comply with the rules
of the RDC. It checks the criteria of distinct entities and dominance. The
values are calculated as averages of a sufficiently large number of
observations. It is based on an iterative procedure that aggregates data until
there are enough distinct entities to calculate the extreme values and no
problems with dominance occur.

The function always starts the iteration process with the lowest possible number
of observations for each extreme value (here `5`, since at least five distinct
statistical units must be included in the calculation according to the rules of
the RDC). Furthermore, the function checks that the subsets of data for minimum
and maximum do not overlap.

If there are no problems with the calculation, the function returns a
`data.table` with the extreme values. Maximum and minimum are always output
together, none of the two can be calculated separately. If it is not possible to
calculate extreme values under these criteria, a corresponding message is
printed and the result is filled with `NA`.

## Data

To introduce `sdc_min_max()`, another simple dataset is used. We have 20
observations of 10 different entities, for which the corresponding sector is
given and values for the variables `val_1`, `val_2`, `val_3` in the years 2019
and 2020, respectively.

```{r test_data_extreme}
data("sdc_min_max_DT")
sdc_min_max_DT
```

## Simple cases

In this simple case, extreme values should be calculated for variable `val_1`.
This can be done with `sdc_min_max()` by specifying the dataset (`data`), the id
variable (`id_var`) and the variable for which extreme values are to be
calculated (`val_var`).

```{r extreme_simple}
sdc_min_max(data = sdc_min_max_DT, id_var = "id", val_var = "val_1")
```

Since no problems occur, the function (invisibly) returns a list with the
options, settings and extreme values and prints the calculated extreme values.
As shown in the output, the extreme values could be calculated and 5 distinct
entities were used for each value. Thus, no additional entities had to be
included in the calculation.

<!-- ### Advanced cases -->

<!-- In this section some advanced cases are presented to introduce more arguments and functionalities of the function `sdc_min_max()`. -->

## Limiting the number of observations included in minimum and maximum using max_obs

In this case minimum and maximum values are to be calculated for variable
`val_2`:

```{r extreme_n1}
sdc_min_max(data = sdc_min_max_DT, id_var = "id", val_var = "val_2")
```

When we look at the output, we see that values from 5 distinct entities were
used to calculate the minimum and 7 distinct entities to calculate the maximum.
This is because the dominance criterion would be violated if only 5 distinct
entities were considered for the maximum. Thus, 7 distinct entities are
automatically taken into account.

If you specify `max_obs = 5`, there is no feasible solution:

```{r extreme_n2}
sdc_min_max(data = sdc_min_max_DT, id_var = "id", val_var = "val_2", max_obs = 5)
```

Note that `max_obs` controls the maximum number of observations, not distinct
entities.

## Minimum and maximum values by groups

It is also possible to calculate minimum and maximum values by groups. In the
following, these are calculated by `year` and `sector`, separately.

```{r exterme_by1}
sdc_min_max(data = sdc_min_max_DT, id_var = "id", val_var = "val_1", by = "year")

sdc_min_max(data = sdc_min_max_DT, id_var = "id", val_var = "val_1", by = "sector")
```

No problems occur, so minimum and maximum values are calculated and shown for
each group.

This can also be done for several grouping variables. In the following, extreme
values for variable `val_1` are to be calculated by `year` and `sector`.

```{r}
res <- sdc_min_max(
  data = sdc_min_max_DT,
  id_var = "id",
  val_var = "val_1",
  by = c("sector", "year")
)
```

Now a message occurs, explaining that RDC rules would be violated for the
calculation of these values. For programming purposes, please note that the
structure of the resulting `data.table` remains the same (but is filled with
`NA`:

```{r extreme_by3}
# extreme_vals
res
```


# sdc_model()

This function checks if your model complies to RDC rules. The criterion of
distinct entities is also checked here. In addition, it is checked whether there
are enough different entities for each attribute or value level. For continuous
variables, `sdc_model()` distinguishes between `<zero>` and `<non-zero>` values.
The function can be used to check a broad range of models like `lm`, `glm` and
various others. In fact, anything which can be handled by `broom::augment()` can
also be handled by `sdc_model()`. For a list of supported models see
`?generics::augment`.

## Data & models

To introduce `sdc_model()`, another dataset with different variables is used,
which includes dummy-variables.

We have 80 observations of 10 different entities for the variables `y`, `x_1`,
`x_2`, `x_3`, `x_4` and additional information on sector, year and country
(dummy variables). A summary of the data set is given below.

```{r model_data}
data("sdc_model_DT")
print(skim(sdc_model_DT))
```

Various simple linear models are specified from this dataset for illustration
purposes.

```{r model_models}
model_1 <- lm(y ~ x_1 + x_2, data = sdc_model_DT)
model_2 <- lm(y ~ x_1 + x_2 + x_3, data = sdc_model_DT)
model_3 <- lm(y ~ x_1 + x_2 + dummy_1 * dummy_2, data = sdc_model_DT)
model_4 <- lm(y ~ x_1 + x_2 + dummy_1 * dummy_3, data = sdc_model_DT)
```

These models are now checked for compliance with the rules of the RDC. It is
checked if there are enough distinct entities in the whole model and if every
level of each variable is checked for compliance with the rules.

A selection of problematic and unproblematic models has been made to better
explain the differences. To check for compliance, the model object (`model`),
the data used (`data`) and the ID variable (`id_var`) must be specified in
`sdc_model()`.

## Simple cases

A check of `model_1` and `model_3` is shown below.

```{r model_simple}
sdc_model(data = sdc_model_DT, model = model_1, id_var = "id")

sdc_model(data = sdc_model_DT, model = model_3, id_var = "id")
```

As we see, there are no problems and the models could be released as output.
Note that `sdc_log()` supports the interaction term in `model_3`.

## Problematic cases

Now we turn to the problematic cases. We are checking the models `model_2` and
`model_4`:

```{r model_prob1}
sdc_model(data = sdc_model_DT, model = model_2, id_var = "id")
```

Some difficulties occur with these models, but which?

`model_2` leads to problems with the number of distinct entities. This problem
arises with the inclusion of variable `x_3` due to a high number of `NA`s.

```{r model_prob2}
sdc_model(data = sdc_model_DT, model = model_4, id_var = "id")
```

For `model_4` the problem stems from a small number of distinct entities for the
value level `FR` of `dummy_3`. This also leads to a problem in the interaction
term. Therefore the respective coefficients cannot be released either. Please
note that this last case is probably the most common problem to occur when
checking models.

# sdc_log()

This function serves to create Stata-like log files from R Scripts. The function
is called to wrap an R script containing your analysis to write the
corresponding code and console output into a log file. It can handle single
files or a list of files at once.

A character vector containing the path(s) of the R script(s) which should be run
must be specified as well as a character vector containing the path(s) of the
text file(s) where the log(s) should be stored. To replace existing log files,
one can specify the argument `replace = TRUE`.

A simple call of this function could look as follows:

```{r eval = FALSE}
sdc_log(
  r_scripts = "/home/my_project/R/my_script.R",
  log_files = "/home/my_project/log/my_script.txt"
)
```

Even though this seems trivial, creating logs for scripts is essential because a
log file bundles all information needed by the RDC for output control.

```{r reset options, include=FALSE}
options(user_options)
```