--- title: "Design" vignette: > %\VignetteIndexEntry{Design} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- ```{r setup} #| include: false library(dplyr) library(osdc) ``` ## Principles These are the guiding principles for this package: 1. Because of the amount of data in the registers and the extensive processing that osdc does to classify diabetes status, the data must be in the [DuckDB](https://duckdb.org/) format. DuckDB is an extremely powerful data analytic engine, so this is quite essential for osdc to keep performance high. 2. Functions have consistent inputs and outputs (e.g. inputs and outputs are the same, regardless of specific conditions). 3. Functions have predictable outputs based on inputs (e.g. if an input is a `data.frame`, the output is a `data.frame`). 4. Functions have consistent naming based on their action. 5. Functions have limited additional arguments. 6. Casing of input variables (upper or lower case) is agnostic, but all internal variables are lower case, and output variables are lower case. ## Use cases We make these assumptions on how this package will be used, based on our experiences and expectations for use cases: We expect the package will be: - Entirely used within the Denmark Statistics or the Danish Health Authority's servers, since that is where their data are kept. - Used by researchers within or affiliated with Danish research institutions. - Used specifically within a Danish register-based context. Below is a set of "narratives" or "personas" with associated needs that this package aims to fulfill. "As a researcher, ..." - "... I want to easily get an overview of which Danish registers and variables I need to request from Denmark Statistics and the Danish Health Data Authority, so that I am able to classify the diabetes status of individuals in the registers using the osdc package." - "... I want to easily and simply create a dataset that contains data on diabetes status in my population, so that I can begin conducting my research that involves persons with diabetes without having to tinker with coding the correct algorithm to classify them." - "... I want to be informed early and in a clear way whether my data fits with the required data types, so that I can fix and correct these issues without having to do extensive debugging of the code and/or data." ## Core functionality This is the list of the core functionality of the osdc package: 1. Classifies individuals' diabetes type (type 1 or 2) 2. Outputs a single data frame-type object (as a DuckDB object) including individuals with diabetes, their type (type 1 or 2), and date of onset as classified by the algorithm. 3. Internally checks individual registers for the variables required by the algorithm. 4. Provides a list of required variables and registers in order to calculate diabetes status. 5. Provides internal checks of whether variables match the expected data types. 6. Provides a common and easily accessible standard for determining diabetes status within the context of research using Danish registers. ## Function conventions To effectively develop both the user-facing and internal functions, we follow some conventions and design patterns for building these functions. There are a few conventions we describe here: naming patterns for functions and arguments, their argument input requirements, and their output data structure. The below conventions are *ideals* only, to be used as a guidelines to help with development and understanding of the code; they are not hard rules. ### Naming - First word is an action verb, later words are objects or conditions. - Functions that filter by dropping rows based on specific criteria are prefixed with `drop_`. - Functions that filter by keeping rows based on specific criteria are prefixed with `keep_`. - Helpers that add columns needed for classification are prefixed with `add_`. - Helpers that join the output of other functions are prefixed with `join_`. - Functions that prepare and process register data are prefixed with `prepare_`. ### Input - As few arguments as is possible, with as few core required arguments as possible (ideally one or two). - `keep_` functions take a register as the first argument. - One input register database at a time. - `drop_` functions can take a register as the first argument or take the output from a `keep_` function. - All functions take a DuckDB type object as input (e.g. `duckplyr_df`). - The first argument will always take a data frame type object. - The second argument could be an output data frame object from another function. ### Output - All functions output the same type of object as the input object (a `duckplyr_df` type object). ## Interface The osdc package contains one main function that classifies individuals into those with either type 1 or type 2 diabetes using the Danish registers: `classify_diabetes()`. This function classifies those with diabetes (type 1 or 2) based on the Danish registers described in the `vignette("design")` and `vignette("data-sources")`. All data sources needed by osdc are used as input for this function. The specific details of the classification algorithm are described in the `vignette("algorithm")`. ### Input There is one argument in `classify_diabetes()` for each required register. The names and descriptions of these arguments are as follows: ```{r} #| output: asis #| echo: false registers() |> purrr::imap_chr(~ glue::glue("- `{.y}`: The register called '{.x$name}' in Danish.")) |> unname() |> cat(sep = "\n") ``` ### Output The output is a `data.frame` type object which includes four columns: - **pnr**: The pseudonymised social security number of individuals in the diabetes population (one row per individual). - **stable_inclusion_date**: The *stable* inclusion date (i.e., the raw date mutated so only individuals included in the time-period where data coverage is sufficient to make incident cases reliable)[^1]. - **raw_inclusion_date**: The *raw* inclusion date (i.e., the date of the second inclusion event as described in the `vignette("algorithm")`). - **diabetes_type**: The classified diabetes type. [^1]: For more information on the "raw" versus "stable" inclusion date, see `vignette("algorithm")`. For an example, see below. | pnr | stable_inclusion_date | raw_inclusion_date | has_t1d | has_t2d | |-----|-----------------------|--------------------|---------|---------| | 1 | 2020-01-01 | 2020-01-01 | TRUE | FALSE | | 4 | NA | 1995-04-19 | FALSE | TRUE | : Example rows of the `data.frame` output of the osdc package. The individuals `1` and `4` have been classified as having diabetes (either `has_t1d` or `has_t2d`, respectively). `1` is classified as having type 1 diabetes (T1D) with an inclusion date of `2020-01-01`. Since this date is within a time-period of sufficient data coverage, the column `stable_inclusion_date` is populated with the same date as `raw_inclusion_date`. The individual in the second row, `4` is classified as having type 2 diabetes `T2D` with an inclusion date of `1995-19-04`. Since 1995 is within a time-period of insufficient data coverage, the validity of this inclusion date is uncertain and `stable_inclusion_date` is `NULL`. However, `raw_inclusion_date` still contains the inclusion date of this individual. In the context of generating a diabetes population with valid inclusion dates (e.g. true incident cases), three aspects of the register records were considered when determining which periods of time had sufficient data available: - **Sufficient data on inclusion events:** While HbA1c test results are the diagnostic standard, these records are the newest addition to the register data ecosystem and have limited historical coverage nationwide. According to supplementary analyses by Isaksen et al.[@Isaksen2023sup], this data has complete nationwide coverage from Q4 2015 onward ([direct link to supplementary file S9](https://doi.org/10.1371/journal.pgph.0001277.s009)). However, as the vast majority of diabetes patients are treated with glucose-lowering drugs at some point, we made the pragmatic assessment that prescription drug purchase data are sufficient to identify incident cases. These are available from 1995 onward. - **Sufficient data on exclusion events:** In order to correctly identify pregnancies and discard inclusion events that may occur due to gestational diabetes rather than T1D or T2D, register information on pregnancy occurrences is necessary. In the patient register, this information is available from 1994 onward, but coverage is insufficient until 1997, according to supplementary analyses by Isaksen[@isaksen2023thesis] ([direct link to analysis](https://aastedet.github.io/dissertation/5-discussion-methods.html#fig-births)). - **Sufficient wash-out period:** In order to "wash out" prevalent cases from true incident cases, a period of time with valid data is necessary to capture prevalent cases, before new inclusions can be considered true incident cases and the incidence stabilizes. We considered a full year to be enough. Given the above requirements of complete nationwide data on inclusion and exclusion events, as well as a sufficient wash-out period to establish valid incident cases, the algorithm was designed to restrict valid inclusion dates to periods where all criteria are met. Consequently, only inclusion dates occurring from 1998 onward are considered true incident cases and assigned a `stable_inclusion_date` value.