---
title: "3 - Data checking"
format: 
  html:
    toc: true
vignette: >
  %\VignetteIndexEntry{3 - Data checking}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
knitr:
  opts_chunk:
    collapse: true
    comment: '#>'
---

## Introduction

You imported your database, but now you need to check it for errors and inconsistencies.

There are a lot of ways to do so, so EDCimport provides functions for a few concepts.

As in previous vignettes, we will be using `edc_example()`, but in the real world you should use EDC reading functions. See `vignette("reading")` to see how.

```{r}
#| warning: false
#| message: true
library(EDCimport)
library(dplyr)
db = edc_example(N=200) %>% 
  edc_unify_subjid() %>% 
  edc_clean_names()
db
load_database(db)
```

## Data warning

### Data errors

The primary and most valuable data-checking tool in EDCimport is `edc_data_warn()`.

Simply use `dplyr::filter()` to identify problematic or inconsistent rows and pipe them into the function.

For example, let's say that in our study:

-   Patients should be older than 25

-   Adverse event grades should be between 1 and 5.

-   Patients in the treatment arm should not have `data1$date1` before 2010-04-10

Here's how you check these conditions:

```{r}
enrol %>% 
  filter(age<25) %>% 
  edc_data_warn("Patients should be >25yo", issue_n=1)

ae %>% 
  filter(aegr<1 | aegr>5) %>% 
  edc_data_warn("Incorrect adverse event grade", issue_n=2)

data1 %>% 
  edc_left_join(enrol) %>% 
  filter(arm=="Trt") %>% 
  filter(date1<"2010-04-10") %>% 
  edc_data_warn("Treated patients should have been seen later", issue_n=3)
```

You can implement these checks according to your Data Validation Plan, ensuring they run after every export. Any failed check will produce a warning in your R console. Once the database is corrected, the warnings will no longer appear (e.g. `issue_n=2`).

After running all your checks, you can use `edc_data_warnings()` to get a summary of all detected issues.

```{r}
edc_data_warnings()
```

### Fatal error

If one check is so mandatory that you cannot work in a database where it fails, use `edc_data_stop()` instead.

For example, you can use it to check that some variable construction didn't go wrong:

``` r
df = mtcars %>% 
  mutate(
    type = case_when(
      cyl==4 ~ "4 cylinders", 
      cyl==6 ~ "6 cylinders", 
      cyl==8 ~ "8 cylinders", 
      .default="ERROR"
    ),
  )

df %>% 
  filter(type=="ERROR") %>% 
  edc_data_stop("Error on type construction")
```

## Duplicate-free dataset assertion

If you work with multiple datasets, your code probably include a lot of joins. As you may have painfully discovered, joining data carries a high risk of altering the data layout and resulting in multiple rows per patient.

This is why you should always include `assert_no_duplicate()` in your pipeline if you expect only one row per patient.

```{r}
#| error: true
enrol %>% 
  assert_no_duplicate() %>% 
  count(arm)

enrol %>% 
  edc_left_join(data1) %>% #oopsie
  assert_no_duplicate() %>% 
  count(arm)
```

::: callout-tip
This is the [Fail Fast principle](https://www.codereliant.io/fail-fast-pattern/): you'd better have an error in your R script than in your analysis report.
:::

## Last-news table

If your analysis involves a survival endpoint, you likely have a follow-up dataset that includes the vital status as of the last visit date.

However, in real-world scenarios, this dataset might not be accurately filled, and some patient can have other data after the date of last visit.

The `lastnews_table()` function calculates the actual date of the last recorded information for each patient (`SUBJID`), based on all Date/Datetime columns across all datasets.

Currently, `edc_example()` does not include an explicit table for this scenario, so let's consider the following example:

-   `data3$date10` is your last visit date in your followup dataset. It is the origin you should `prefer` if there is a tie and the reference to identify any inconsistency.

-   `data1` is a dataset containing scheduled protocol dates, such as planned medical visits. You should ignore all columns in this dataset, as they do not pertain to the patient's last known

-   `date8` and `date9` are dates when treatments were administered. They imply that the patient was alive at that time. If a date is after `date10`, it means that the survival time is underestimated.

Here is how to parameterize `lastnews_table()` to fit this scenario:

```{r}
lastnews_table(prefer="date10", except="data1", show_delta=TRUE) %>% 
  mutate(delta=round(delta)) %>% 
  arrange(desc(delta))
```

::: callout-warning
This table is useful for Overall Survival and data checking. However, you should be very careful when using it for Event-Free Survival: a patient can be alive at that point without having experienced an event.
:::