---
title: "Data sources"
vignette: >
  %\VignetteIndexEntry{Data sources}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
execute: 
  echo: false
  output: "asis"
---

```{r setup}
#| include: false
library(osdc)
library(dplyr)

# TODO: Do we want this as an exported function?
#' Convert the register data sources
#'
#' @param caption Caption to add to the table.
#'
#' @return A character vector as a Markdown table.
registers_as_md_table <- function(caption = NULL) {
  registers() |>
    purrr::map(~ purrr::discard(.x, tibble::is_tibble)) |>
    purrr::map(tibble::enframe) |>
    purrr::map(tidyr::pivot_wider) |>
    purrr::map(~ dplyr::mutate(.x, dplyr::across(tidyselect::everything(), unlist))) |>
    tibble::enframe(name = "register_abbrev") |>
    tidyr::unnest(cols = value) |>
    dplyr::mutate(
      end_year = dplyr::if_else(is.na(.data$end_year), "present", as.character(.data$end_year)),
      years = glue::glue("{start_year} - {end_year}"),
      register_abbrev = glue::glue("`{register_abbrev}`")
    ) |>
    dplyr::distinct() |>
    dplyr::select(
      "Register" = "name",
      "Abbreviation" = "register_abbrev",
      "Years" = "years"
    ) |>
    knitr::kable(caption = caption)
}

#' Convert the register list to a table showing registers and their variables.
#'
#' @returns A character vector as a Markdown table.
register_variables_as_md_table <- function() {
  register_names <- registers() |>
    purrr::map("name") |>
    tibble::enframe(name = "register_abbrev", value = "register_name") |>
    dplyr::mutate(register_name = unlist(register_name))

  registers() |>
    purrr::map(~ purrr::keep(.x, tibble::is_tibble)) |>
    tibble::enframe(name = "register_abbrev") |>
    tidyr::unnest(cols = value) |>
    dplyr::mutate(value = purrr::map(value, ~ dplyr::select(.x, variable_name = name))) |>
    tidyr::unnest(cols = value) |>
    dplyr::left_join(register_names, by = c("register_abbrev" = "register_abbrev")) |>
    dplyr::mutate(Register = paste0(register_name, " (`", register_abbrev, "`)")) |>
    dplyr::select(Register, Variable = variable_name) |>
    knitr::kable()
}

#' Convert the register name into text to use in a Markdown header
#'
#' @params register The list object of the register to create a table for.
#' @params abbrev The abbreviation of the register.
#'
#' @return A character vector.
register_as_md_header <- function(register, abbrev) {
  glue::glue(
    "### `{abbrev}`: {register$name}"
  )
}

#' Converts the variables for a register into a Markdown table
#'
#' @params register The list object of the register to create a table for.
#' @params abbrev The abbreviation of the register.
#'
#' @return A character vector as a Markdown table.
#'
variables_as_md_table <- function(register, abbrev) {
  register$variables |>
    dplyr::select("name", "english_description") |>
    dplyr::mutate(english_description = stringr::str_to_sentence(.data$english_description)) |>
    knitr::kable(
      caption = glue::glue("Variables and their descriptions within the `{abbrev}` register. If you want to see what the data *should* look like, see `simulate_registers()`.")
    )
}
```

This document describes the sources of data needed by the OSDC algorithm
and gives a brief overview of each of these sources and how they might
look like. In addition, the final section contains information on how to
gain access to these data.

The algorithm uses these Danish registers as input data sources:

```{r}
registers_as_md_table("Danish registers used in the OSDC algorithm.")
```

In a future revision, the algorithm can also use the Danish Medical
Birth Register to extend the period of time of valid inclusions further
back in time compared to what is possible using obstetric codes from the
National Patient Register.

## Data required from registers

The following is a list of the variables required from specific
registers in order for the package to classify diabetes status:

```{r}
register_variables_as_md_table()
```

## Expected data structure

This section describes how the data sources listed from the above table
are expected to look like when they are input into the OSDC algorithm.
This should mimic the structure of the raw data as it appears within the
virtual machines of Statistics Denmark. The package does not support
inputs in their raw `.sas7bdat` format, and these must be converted to
`.parquet` files first, which can then be loaded in as DuckDB objects.
The [`registers2parquet`
package](https://dp-next.github.io/registers2parquet/) provides
functions to conveniently handle this.

In this document, we present the variable names in lower case, but case
may vary between data sources (and even between years in the same data
source) in real data. However, the package should be robust to
inconsistent casing, as we internally convert all variable names to
lower case.

A small note about the National Patient Register: It contains several
tables and types of data. The algorithm uses only hospital diagnosis
data contained in four registers, which are pairs of two related
registers used before 2019 (LPR2) and from 2019 onward (LPR3). So the
LPR2 to LPR3 equivalents are `lpr_adm` to `kontakter` and `lpr_diag` to
`diagnoser`. Most variables have direct equivalents between LPR2 and
LPR3, but it should be noted that while `c_spec` is the LPR2 equivalent
of `hovedspeciale_ans` in LPR3, the specialty values in
`hovedspeciale_ans` are coded as strings of the literal specialty names
unlike the padded integer codes that `c_spec` contains (in our
experience).

On Statistics Denmark, these tables are provided as a mix of separate
files for each calendar year prior to 2019 (in LPR2 format) and a single
file containing all the data from 2019 onward (LPR3 format). The two
tables can be joined with either the `recnum` variable (LPR2 data) or
the `dw_ek_kontakt` variable (LPR3 data).

```{r}
registers() |>
  # iwalk takes the name of the element in the list (`.y`) as well as the
  # list itself (`.x`) and passes them to the function.
  purrr::iwalk(~ {
    print(register_as_md_header(.x, .y))
    print(variables_as_md_table(.x, .y))
  })
```

## Getting access to data

The above data is available through Statistics Denmark and the Danish
Health Data Authority. Researchers must be affiliated with an approved
research institute in Denmark and fees apply. Information on how to gain
access to data can be found at
`www.dst.dk/en/TilSalg/data-til-forskning` (URL often changes, so we
can't link directly).