Design

Principles

These are the guiding principles for this package:

  1. Because of the amount of data in the registers and the extensive processing that osdc does to classify diabetes status, the data must be in the DuckDB format. DuckDB is an extremely powerful data analytic engine, so this is quite essential for osdc to keep performance high.
  2. Functions have consistent inputs and outputs (e.g. inputs and outputs are the same, regardless of specific conditions).
  3. Functions have predictable outputs based on inputs (e.g. if an input is a data.frame, the output is a data.frame).
  4. Functions have consistent naming based on their action.
  5. Functions have limited additional arguments.
  6. Casing of input variables (upper or lower case) is agnostic, but all internal variables are lower case, and output variables are lower case.

Use cases

We make these assumptions on how this package will be used, based on our experiences and expectations for use cases:

We expect the package will be:

Below is a set of “narratives” or “personas” with associated needs that this package aims to fulfill.

“As a researcher, …”

Core functionality

This is the list of the core functionality of the osdc package:

  1. Classifies individuals’ diabetes type (type 1 or 2)
  2. Outputs a single data frame-type object (as a DuckDB object) including individuals with diabetes, their type (type 1 or 2), and date of onset as classified by the algorithm.
  3. Internally checks individual registers for the variables required by the algorithm.
  4. Provides a list of required variables and registers in order to calculate diabetes status.
  5. Provides internal checks of whether variables match the expected data types.
  6. Provides a common and easily accessible standard for determining diabetes status within the context of research using Danish registers.

Function conventions

To effectively develop both the user-facing and internal functions, we follow some conventions and design patterns for building these functions. There are a few conventions we describe here: naming patterns for functions and arguments, their argument input requirements, and their output data structure.

The below conventions are ideals only, to be used as a guidelines to help with development and understanding of the code; they are not hard rules.

Naming

Input

Output

Interface

The osdc package contains one main function that classifies individuals into those with either type 1 or type 2 diabetes using the Danish registers: classify_diabetes(). This function classifies those with diabetes (type 1 or 2) based on the Danish registers described in the vignette("design") and vignette("data-sources"). All data sources needed by osdc are used as input for this function. The specific details of the classification algorithm are described in the vignette("algorithm").

Input

There is one argument in classify_diabetes() for each required register. The names and descriptions of these arguments are as follows:

Output

The output is a data.frame type object which includes four columns:

For an example, see below.

Example rows of the data.frame output of the osdc package.
pnr stable_inclusion_date raw_inclusion_date has_t1d has_t2d
1 2020-01-01 2020-01-01 TRUE FALSE
4 NA 1995-04-19 FALSE TRUE

The individuals 1 and 4 have been classified as having diabetes (either has_t1d or has_t2d, respectively). 1 is classified as having type 1 diabetes (T1D) with an inclusion date of 2020-01-01. Since this date is within a time-period of sufficient data coverage, the column stable_inclusion_date is populated with the same date as raw_inclusion_date.

The individual in the second row, 4 is classified as having type 2 diabetes T2D with an inclusion date of 1995-19-04. Since 1995 is within a time-period of insufficient data coverage, the validity of this inclusion date is uncertain and stable_inclusion_date is NULL. However, raw_inclusion_date still contains the inclusion date of this individual.

In the context of generating a diabetes population with valid inclusion dates (e.g. true incident cases), three aspects of the register records were considered when determining which periods of time had sufficient data available:

Given the above requirements of complete nationwide data on inclusion and exclusion events, as well as a sufficient wash-out period to establish valid incident cases, the algorithm was designed to restrict valid inclusion dates to periods where all criteria are met. Consequently, only inclusion dates occurring from 1998 onward are considered true incident cases and assigned a stable_inclusion_date value.


  1. For more information on the “raw” versus “stable” inclusion date, see vignette("algorithm").↩︎