---
title: "2 - Post processing"
format:
html:
toc: true
vignette: >
%\VignetteIndexEntry{2 - Post processing}
%\VignetteEngine{quarto::html}
%\VignetteEncoding{UTF-8}
knitr:
opts_chunk:
collapse: true
comment: '#>'
---
## Introduction
::: callout-caution
Before reading this, you should read `vignette("reading")` to learn how to import your database.
:::
After importing your database with EDCimport, you end up with an `edc_database` object, which can be loaded to the global environment using `load_database()`.
However, EDCimport provides a few functions to improve the database before loading it.
## Harmonize Subject ID across the database
The Subject ID column, usually `SUBJID` for CDISC data, is the primary key, shared by almost all your datasets.
Using `edc_unify_subjid()`, you can harmonize this column across the whole database, so that it becomes a `factor`, consistent for all datasets. With `preprocess`, you can even customize it.
This is especially convenient for joining your data and checking for missing patients.
```{r}
#| warning: false
library(EDCimport)
db1 = edc_example()
load_database(db1)
enrol$subjid %>% class()
enrol$subjid %>% head()
db2 = edc_example() %>%
edc_unify_subjid(preprocess=~paste0("#", .x))
load_database(db2)
enrol$subjid %>% class()
enrol$subjid %>% head()
#missing patients in table `ae`
ae$subjid %>%
forcats::fct_count() %>%
dplyr::filter(n==0)
```
::: callout-note
If your SUBJID column is numeric and `preprocess` is empty, SUBJID will be cast to numeric.
:::
## Clean dataset names
Is your database from a messy EDC software, filled with special characters or camelCase column names?
Fear not! With `edc_clean_names()` you can clean all dataset names at once.
By default, it converts names to lowercase letters, numbers, and underscores only. For this example, since `edc_example()` already provides clean column names, let's convert all columns to **uppercase**:
```{r}
#| warning: false
library(EDCimport)
db = edc_example() %>%
edc_clean_names(toupper)
load_database(db)
names(enrol)
```
## Split some dataset to short+long
::: {.callout-note appearance="minimal"}
This one is a bit more complex, but bear with me, I'll try to make is understandable.
:::
When a CRF form contains both repeated and non-repeated measures, the export usually duplicates the non-repeated measure.
This results in a "mixed" data format, combining both "long" and "short" structures. (You usually call the latter "wide", but in this case it is not really.)
For example, in the dataset `long_mixed` from `edc_example()`, you have two long-format variables (one value per observation) and one wide-format variable (one value per subject).
```{r}
head(long_mixed)
```
With complex CRFs and lengthy forms, this mixed structure can complicate analysis, as repeated and non-repeated data may be unrelated.
With `edc_split_mixed()`, you can split this dataset into two, one `short` and one `long`:
```{r}
#| warning: false
db = edc_example() %>%
edc_split_mixed(long_mixed)
load_database(db)
head(long_mixed_short) #one row per subject
head(long_mixed_long) #one row per observation
```
## You can combine!
Obviously, these functions can be piped to one another:
``` r
db = edc_example() %>%
edc_split_mixed(long_mixed) %>%
edc_unify_subjid(preprocess=~paste0("#", .x))%>%
edc_clean_names(toupper)
load_database(db)
```
Don't hesitate to [submit a feature request](https://github.com/DanChaltiel/EDCimport/issues/new?template=feature_request.md) if you think another function can be useful to others!