--- title: "Data sources" vignette: > %\VignetteIndexEntry{Data sources} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} execute: echo: false output: "asis" --- ```{r setup} #| include: false library(osdc) library(dplyr) # TODO: Do we want this as an exported function? #' Convert the register data sources #' #' @param caption Caption to add to the table. #' #' @return A character vector as a Markdown table. registers_as_md_table <- function(caption = NULL) { registers() |> purrr::map(~ purrr::discard(.x, tibble::is_tibble)) |> purrr::map(tibble::enframe) |> purrr::map(tidyr::pivot_wider) |> purrr::map(~ dplyr::mutate(.x, dplyr::across(tidyselect::everything(), unlist))) |> tibble::enframe(name = "register_abbrev") |> tidyr::unnest(cols = value) |> dplyr::mutate( end_year = dplyr::if_else(is.na(.data$end_year), "present", as.character(.data$end_year)), years = glue::glue("{start_year} - {end_year}"), register_abbrev = glue::glue("`{register_abbrev}`") ) |> dplyr::distinct() |> dplyr::select( "Register" = "name", "Abbreviation" = "register_abbrev", "Years" = "years" ) |> knitr::kable(caption = caption) } #' Convert the register list to a table showing registers and their variables. #' #' @returns A character vector as a Markdown table. register_variables_as_md_table <- function() { register_names <- registers() |> purrr::map("name") |> tibble::enframe(name = "register_abbrev", value = "register_name") |> dplyr::mutate(register_name = unlist(register_name)) registers() |> purrr::map(~ purrr::keep(.x, tibble::is_tibble)) |> tibble::enframe(name = "register_abbrev") |> tidyr::unnest(cols = value) |> dplyr::mutate(value = purrr::map(value, ~ dplyr::select(.x, variable_name = name))) |> tidyr::unnest(cols = value) |> dplyr::left_join(register_names, by = c("register_abbrev" = "register_abbrev")) |> dplyr::mutate(Register = paste0(register_name, " (`", register_abbrev, "`)")) |> dplyr::select(Register, Variable = variable_name) |> knitr::kable() } #' Convert the register name into text to use in a Markdown header #' #' @params register The list object of the register to create a table for. #' @params abbrev The abbreviation of the register. #' #' @return A character vector. register_as_md_header <- function(register, abbrev) { glue::glue( "### `{abbrev}`: {register$name}" ) } #' Converts the variables for a register into a Markdown table #' #' @params register The list object of the register to create a table for. #' @params abbrev The abbreviation of the register. #' #' @return A character vector as a Markdown table. #' variables_as_md_table <- function(register, abbrev) { register$variables |> dplyr::select("name", "english_description") |> dplyr::mutate(english_description = stringr::str_to_sentence(.data$english_description)) |> knitr::kable( caption = glue::glue("Variables and their descriptions within the `{abbrev}` register. If you want to see what the data *should* look like, see `simulate_registers()`.") ) } ``` This document describes the sources of data needed by the OSDC algorithm and gives a brief overview of each of these sources and how they might look like. In addition, the final section contains information on how to gain access to these data. The algorithm uses these Danish registers as input data sources: ```{r} registers_as_md_table("Danish registers used in the OSDC algorithm.") ``` In a future revision, the algorithm can also use the Danish Medical Birth Register to extend the period of time of valid inclusions further back in time compared to what is possible using obstetric codes from the National Patient Register. ## Data required from registers The following is a list of the variables required from specific registers in order for the package to classify diabetes status: ```{r} register_variables_as_md_table() ``` ## Expected data structure This section describes how the data sources listed from the above table are expected to look like when they are input into the OSDC algorithm. This should mimic the structure of the raw data as it appears within the virtual machines of Statistics Denmark. The package does not support inputs in their raw `.sas7bdat` format, and these must be converted to `.parquet` files first, which can then be loaded in as DuckDB objects. The [`registers2parquet` package](https://dp-next.github.io/registers2parquet/) provides functions to conveniently handle this. In this document, we present the variable names in lower case, but case may vary between data sources (and even between years in the same data source) in real data. However, the package should be robust to inconsistent casing, as we internally convert all variable names to lower case. A small note about the National Patient Register: It contains several tables and types of data. The algorithm uses only hospital diagnosis data contained in four registers, which are pairs of two related registers used before 2019 (LPR2) and from 2019 onward (LPR3). So the LPR2 to LPR3 equivalents are `lpr_adm` to `kontakter` and `lpr_diag` to `diagnoser`. Most variables have direct equivalents between LPR2 and LPR3, but it should be noted that while `c_spec` is the LPR2 equivalent of `hovedspeciale_ans` in LPR3, the specialty values in `hovedspeciale_ans` are coded as strings of the literal specialty names unlike the padded integer codes that `c_spec` contains (in our experience). On Statistics Denmark, these tables are provided as a mix of separate files for each calendar year prior to 2019 (in LPR2 format) and a single file containing all the data from 2019 onward (LPR3 format). The two tables can be joined with either the `recnum` variable (LPR2 data) or the `dw_ek_kontakt` variable (LPR3 data). ```{r} registers() |> # iwalk takes the name of the element in the list (`.y`) as well as the # list itself (`.x`) and passes them to the function. purrr::iwalk(~ { print(register_as_md_header(.x, .y)) print(variables_as_md_table(.x, .y)) }) ``` ## Getting access to data The above data is available through Statistics Denmark and the Danish Health Data Authority. Researchers must be affiliated with an approved research institute in Denmark and fees apply. Information on how to gain access to data can be found at `www.dst.dk/en/TilSalg/data-til-forskning` (URL often changes, so we can't link directly).