--- title: "Specifying Augmented Directed Acyclic Graphs" output: rmarkdown::html_vignette bibliography: vignettes.bib vignette: > %\VignetteIndexEntry{Specifying Augmented Directed Acyclic Graphs} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` In this vignette, we introduce a specification for **augmented** Directed Acyclic Graphs (aDAGs). We will overload regular DAGs as specified using the `dagitty` R-package. While `dagitty` provides standard facilities for declaring nodes, edges, exposures, and outcomes in causal frameworks, we augment the DAG with additional **metadata fields** to make it more conducive to theory specification. These metadata fields do not interfere with regular use of the DAG in `dagitty`. The metadata fields in an aDAG include: - **`tags`** for identifying nodes of interest in causal inference. This field can take on values like `exposure`, `outcome`, and `unobserved`. - **`pos`** for nodes, which defines the layout position in the X and Y dimension, e.g., `pos="0,1` is positioned at coordinates X = 0 and Y = 1. This metadata field is used by `dagitty`. - **`label`** for nodes or edges: A descriptive label used for visualization and reporting. This is a new metadata field. - **`distribution`** for nodes: The assumed distribution-generating function for the variable associated with a node. For exogenous nodes, this constitutes the distribution of the variable associated with the node itself; for endogenous nodes, this constitutes the residual distribution of the associated variable. This is a new metadata field. - **`form`** for edges: A function specification (in a form interpretable by `as.formula()`) that describes how the variable associated with a child node is calculated from its parents. This is a new metadata field. Throughout the vignette, we will illustrate how to write an augmented DAG, how to parse and inspect it with `dagitty`, `theorytools`, and `tidySEM` (for plotting), and how to leverage these additional properties for further modeling or simulation tasks. ```{r setup} library(theorytools) library(dagitty) library(tidySEM) ``` ## A Basic DAG The usual syntax for specifying a DAG in the `dagitty` R-package is something like: ```{r, eval = FALSE} library(dagitty) dagitty("dag { X -> Y Z -> X Z -> Y }") ``` There are several tags that can be used in `dagitty`. Note that quotation marks used in tags **must be** double quotes `"`, so it makes sense to wrap the whole DAG syntax in single quotes `'`: ```{r eval = FALSE} library(dagitty) dagitty('dag { X [exposure, pos="0,1"] Y [outcome, pos="1,1"] Z [unobserved, pos="1,0"] X -> Y Z -> X Z -> Y }') ``` ## Augmented DAGs (aDAGs) In our **augmented** specification, we add additional properties as metadata fields. Below, we detail each new property: ### `label` (Nodes/Edges) - **Usage**: Provide a descriptive name for the node or edge. - **Example**: `X [label="Study hours"]` The label is used, for example, by `tidySEM` to label nodes and edges: ```{r eval = FALSE} library(tidySEM) g <- dagitty('dag { X [label="Predictor", pos="0,0"] Y [label="Outcome", pos="1,0"] X -> Y [label="effect"] }') graph_sem(g, text_size = 2) ``` ```{r eval = knitr::is_html_output(), echo = FALSE, out.width="60%"} library(tidySEM) library(ggplot2) g <- dagitty('dag { X [label="Predictor", pos="0,0"] Y [label="Outcome", pos="1,0"] X -> Y [label="effect"] }') p <- graph_sem(g, text_size = 4) ggsave("dag_basic.png", p, device = "png", width = 4, height = 1) knitr::include_graphics("dag_basic.png") ``` ## 2. `distribution` (Nodes) - **Usage**: References a function that generates data for exogenous variables, or that describes the residual distribution for endogenous variables. The function can reference the argument `n` to determine sample size. For example, to specify a node comprising five groups with total sample size `n`, one could use `sample.int(n = 5, size = n, replace=TRUE)`. If the argument `n` is not explicitly provided, `theorytools` checks if `n` is a formal argument of the function, and assigns it. - **Examples**: - `X [distribution="rnorm()"]`: Node `X` is an exogenous variable drawn from a normal distribution with default arguments. - `Y [distribution="rnorm()"]`: Node `Y` has residuals assumed to be normally distributed with default arguments. ```{r} g <- dagitty('dag { X [distribution="rbinom(size = 2, prob = .5)"] Y [distribution="rnorm()"] X -> Y [form=".2*X"] }') ``` ## 3. `form` (Edges) - **Usage**: Provides a formula-like specification for how a child node depends on its parent node(s). It should be something that `as.formula()` can parse. - **Examples**: * `X -> Y [form=".2*X"]` indicates that `Y` is a linear function of `.2` times `X` * `X -> Y [form="X:Z"]` indicates that `Y` depends on an interaction between `X` and `Z` * `X -> Y [form="X^2"]` indicates that `Y` depends on a quadratic function of `X` ```{r} g <- dagitty('dag { X [distribution="rbinom(size = 2, prob = .5)"] Y [distribution="rnorm()"] X -> Y [form=".2*X"] }') ``` # Example: Augmented DAG Specification Below is a simple, hypothetical DAG showing how to combine these ideas. This DAG posits: - `X`: Number of study hours, an `exposure`. Values are randomly sampled from 1-20 hours. - `Z`: Stress level, an exogenous covariate, exponentially distributed (i.e., right-skewed, most people are not very stressed). - `Y`: Exam performance an outcome depending on `X` and `Z`, with normally distributed residuals.s ```{r augmentedDAG, echo=TRUE, eval = FALSE} g <- dagitty('dag { X [exposure, pos="0,0", label="Study Hours", distribution="sample.int(n = 20, size = n, replace = TRUE)"] Z [label="Stress Level", pos=".5,1", distribution="rexp()"] Y [outcome, pos="1,.2", label="Exam Performance", distribution="rnorm()"] X -> Y [label="direct", form="0.5+X"] X -> Z Z -> Y [label="indirect", form="2*Z"] }') graph_sem(g, text_size = 3) ``` ```{r eval = knitr::is_html_output(), echo = FALSE, out.width="60%"} g <- dagitty('dag { X [exposure, pos="0,0", label="Study Hours", distribution="sample.int(n = 20, size = n, replace = TRUE)"] Z [label="Stress Level", pos=".5,1", distribution="rexp()"] Y [outcome, pos="1,.2", label="Exam Performance", distribution="rnorm()"] X -> Y [label="direct", form="-X^2+4*X"] X -> Z Z -> Y [label="indirect", form="2*Z"] }') graph_sem(g, text_size = 4) -> p ggsave("dag_three.png", p, device = "png", width = 6, height = 3) knitr::include_graphics("dag_three.png") ``` ## Parsing DAG Properties Augmented DAGs are interoperable with `dagitty`, but the `dagitty` package is not natively aware of the additional metadata fields used in `theorytools`, like `distribution` or `form`. To access the augmented properties of aDAGs, the `theorytools` package uses `tidySEM`. The purpose of the `tidySEM` package is to plot graphs (structural equation models and DAGs) as `ggplot` objects, which can be further customized using regular `ggplot2` code. It contains parsing functions to extract nodes and edges from a variety of objects, including `dagitty` graphs. The functions `get_nodes()` and `get_edges()` parse the nodes and edges of aDAGs, respectively: ```{r nodeProperties, echo=TRUE} get_nodes(g) get_edges(g) ``` # Interpreting `distribution` and `form` in Simulation A primary motivation for these augmented properties is **simulation**. For example, you might simulate data by: 1. Generating `X` from `sample.int(n)`. 2. Generating `Z` from `rexp(n)`. 3. Generating `Y` using a formula that includes `X` and `Z` plus a residual from `rnorm(n)`. Code to simulate data in line with these metadata can be automatically generated: ```{r simulateData, echo=TRUE, eval=TRUE} set.seed(1) cat(simulate_data(g, run = FALSE), sep = "\n") ``` To illustrate, we show a scatter plot of data simulated using this code: ```{r} df <- simulate_data(g, run = TRUE) ggplot2::ggplot(df, aes(x=X,y=Y,color=Z))+geom_point() ``` You can use this script, for example, to generate synthetic data and build a reproducible analysis pipeline for a Preregistration-As-Code [@peikertReproducibleResearchTutorial2021; @vanlissaComplementingPreregisteredConfirmatory2022]. ## Additional Notes 1. **Syntax Quoting**: In R, there are rules about using quotes within quotes. The `dagitty` package only recognizes double quotes (`" "`) inside graph specifications. This means you must **wrap the graph specification text in single quotes** (`' '`). Alternatively, you can escape every double quote inside the graph specification, which is not recommended because it is a hassle. 2. **Multiple Parents**: If a node has multiple parents, you can either specify multiple edges with their own `form` properties or a single edge with a combined formula. They are combined, and unique terms are retained. 3. **Order of Declaration**: `dagitty` does not mind the order in which nodes are declared, but you’ll need a topological order (no cycles) for valid DAG generation and simulation. 4. **Integration with `dagitty` Functions**: The standard `dagitty` functions (e.g., `adjustmentSets()`) only look for recognized tags like `exposure` and `outcome`. They ignore custom properties like `distribution` and `form`, but these do not interfere with normal usage. ## References