---
title: "Formatting Data Exported from REDCap"
output:
 rmarkdown::html_vignette:
   toc: true
   number_sections: false
vignette: >
 %\VignetteIndexEntry{Formatting Data Exported from REDCap}
 %\VignetteEngine{knitr::rmarkdown}
 %\VignetteEncoding{UTF-8}
---


```{r label = "setup", include = FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```


This vignette covers different methods for formatting the records from REDCap
into an analysis ready data set.  It is assumed that the reader is familiar
with the process for exporting data from REDCap to R as described in
`r  qwraps2::backtick(vignette("api", package = "REDCapExporter"))  `

For the purposes of this vignette we will use the example data sets provided
in the package from the 2000-2001 National Hockey League Stanley Cup Champion
Colorado Avalanche.  The data was transcribed from [Hockey
Reference](https://www.hockey-reference.com/teams/COL/2001.html) into a
REDCap Project hosed at the University of Colorado Denver.

The data sets we will work with in this vignette are:

```{r label = "namespace", eval = TRUE, results = "hide"}
library(REDCapExporter)
avs_raw_core      # object returned from export_core(format = "csv")
avs_raw_metadata  # object returned from export_content(content = "metadata", format = "csv")
avs_raw_record    # object returned from export_content(content = "record", format = "csv")
```


There are two conceptual formatting tools provided by REDCapExporter:

1.
`r  qwraps2::backtick(as.data.frame)  `

2.
`r  qwraps2::backtick(format_record)  `

# Coercion to data.frame

The object returned from
`r  qwraps2::backtick(export_content)  `
is a string in either csv or json format.  To have that information as a
data.frame call
`r  paste0(qwraps2::backtick(as.data.frame), ".") `

This method works for the metadata and records directly.


```{r echo = TRUE, results = "markup"}
avs_metadata_DF <- as.data.frame(avs_raw_metadata)
avs_record_DF   <- as.data.frame(avs_raw_record)
```


For
`r  qwraps2::backtick(rcer_rccore)  `
objects returned by
`r  qwraps2::backtick(export_core)  `
all the elements can be coerced to data.frames via
`r  qwraps2::backtick(lapply)  `

```{r echo = TRUE, results = "markup"}
avs_core_DFs  <- lapply(avs_raw_core, as.data.frame)
```


The behavior of
`r  qwraps2::backtick(as.data.frame)  `
for these objects is to return a data.frame with all character columns.

```{r echo = TRUE, results = 'markup'}
avs_metadata_DF |> sapply(class) |> sapply(is.character) |> all()
avs_record_DF   |> sapply(class) |> sapply(is.character) |> all()
```


Obviously, this is not ideal for analysis.  It does give the user a known
starting point for formatting the records explicitly.  However,
REDCapExporter provides the
`r  qwraps2::backtick(format_record)  `
method to simplify this task by using the metadata from the REDCap project.

# format_record

`r  qwraps2::backtick(format_record)  `
uses the metadata to inform the storage mode of the elements of a data.frame.
For example, after exporting the core of a REDCap project we can build a
data.frame
`r  qwraps2::backtick(avsDF)  `
via

```{r echo = TRUE, results = "markup"}
avsDF <- format_record(avs_raw_core)
str(avsDF, max.level = 0)
```


Note: the above uses the core export from REDCap.  You can use just the
record and metadata to get the same result:

```{r echo = TRUE, result = "markup"}
identical(
  format_record(avs_raw_core),
  format_record(avs_raw_record, avs_raw_metadata)
)
```


Let's look at the
`r  qwraps2::backtick(avsDF)  `
object (presented as a nice human readable table)


```{r echo = FALSE, results = "asis"}
avsDF |>
  kableExtra::kbl(format = "html", row.names = FALSE) |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"),
                            fixed_thead = TRUE) |>
  kableExtra::scroll_box(height = "200px", width = "700px")
```


Now, consider the classes of the columns.  Start by looking at
a few columns which look like they are numeric values, record_id,
uniform_number, height, and points.

```{r echo = TRUE, results = "markup"}
cols <- c("record_id", "uniform_number", "height", "points")
head(avsDF[, cols], n = 3)
sapply(avsDF[, cols], class)
```


Why are record_id and uniform_number, stored as characters whereas height and
points (sum of goals scored and assists) integer and numeric values
respectively?  The answer is in the metadata.


```{r echo = TRUE, results = "hide"}
avs_metadata_DF[avs_metadata_DF$field_name %in% cols, ]
```



```{r echo = FALSE, results = "asis"}
avs_metadata_DF[avs_metadata_DF$field_name %in% cols, ] |>
  kableExtra::kbl(format = "html", row.names = FALSE) |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"),
                            fixed_thead = TRUE) |>
  kableExtra::scroll_box(height = "200px", width = "700px")
```


Notice that for the record_id and uniform_number the field_type is "text"
with no value for "select_choices_or_calculations" and no value for
"text_validation_type_or_show_slider_number".  This is interpreted, then, as
just a text field and should be character vector in the data.frame.
Obviously the user could coerce to integer of numeric is desired and if
appropriate.

For height, note that the field_type is "text" and the
"text_validation_type_or_show_slider_number" is "integer", hence the coercion
from the raw data to integer when building the data.frame.  Lastly, the
points are a calculated field and set to numeric.

REDCapExporter attempts to make reasonable assumptions for the data types
base on the metadata.  For example, dates in REDCap can by entered and
validated in Year-Month-Day, Month-Day-Year, and Day-Month-Year formats.  The
raw data is all in Year-Month-Day format.

```{r label = "date_metadata", echo = FALSE, results = "asis"}
avs_metadata_DF[
  avs_metadata_DF$field_name %in% c('birthdate', 'first_nhl_game', 'last_nhl_game'),
  c("field_name", "field_type", "field_label", "field_note", "text_validation_type_or_show_slider_number")
  ] |>
  kableExtra::kbl(format = "html", row.names = FALSE) |>
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"),
                            fixed_thead = TRUE) |>
  kableExtra::scroll_box(height = "200px", width = "700px")
```


The coercion that will be used when calling
`r  qwraps2::backtick(format_record)  `
is defined by an implicit call to
`r  qwraps2::backtick(col_type)  `
which uses the metadata, in raw or formatted form, to determine the
coercion.


```{r}
identical(col_type(avs_raw_metadata), col_type(avs_metadata_DF))
ct <- col_type(avs_metadata_DF)
```


Each of the elements of
`r  qwraps2::backtick(ct)  `
are applied to the column of the data frame with the same name.
Examples:  The record_id is to be a character string by default.

```{r}
ct[["record_id"]]
```


If the user would prefer the record_id to be an integer we can modify
`r  qwraps2::backtick(ct)  `
and apply it explicitly when calling
`r  paste0(qwraps2::backtick(format_record), ".")  `


```{r echo = TRUE, results = "markup"}
ct[["record_id"]] |> str()
ct[["record_id"]] <- expression(as.integer(record_id))
avsDF2 <- format_record(avs_raw_core, col_type = ct)
```


Two notes to make here, first, we can see that the storage mode
is different between
`r  qwraps2::backtick(avsDF$record_id)  `
and
`r  paste0(qwraps2::backtick(avsDF2$record_id), ".")  `

```{r echo = TRUE, results = "markup"}
class(avsDF$record_id)
class(avsDF2$record_id)
```


Second, there is a message (not a warning), that the metadata that is part of
the
`r  qwraps2::backtick(avs_raw_core)  `
object, is not being used to define the column types.

If you want to suppress that message you can use

```{r echo = TRUE, results = "hide"}
suppressMessages(format_record(avs_raw_core, col_type = ct))
```


or use the records as the object passed to
`r  qwraps2::backtick(format_record)  `

```{r echo = TRUE, results = "hide"}
format_record(avs_record_DF, col_type = ct)
```


By default, variables recorded in REDCap via radio buttons or dropdown lists
are formatted as factors.  For example, the position of the player is a
factor.

```{r}
class(avsDF$position)
summary(avsDF$position)
```


If you'd prefer to have all these variables stored as characters instead of
factors you can modify the call to
`r  qwraps2::backtick(col_type)  `

```{r echo = TRUE, results = "markup"}
ct <- col_type(avs_raw_metadata, factors = FALSE)
avsDF2 <- format_record(avs_raw_record, col_type = ct)
class(avsDF2$position)
summary(avsDF2$position)
table(avsDF2$position)
```


The default formatting is documented in the manual file
The implemented code is within the S3 method:

```{r echo = TRUE, eval = FALSE}
?col_type
```