library(ggplot2)
library(dplyr)
library(tidyr)
library(faux)The rnorm_multi() function makes multiple normally distributed vectors with specified parameters and relationships.
For example, the following creates a sample that has 100 observations of 3 variables, drawn from a population where A has a mean of 0 and SD of 1, while B and C have means of 20 and SDs of 5. A correlates with B and C with r = 0.5, and B and C correlate with r = 0.25.
dat <- rnorm_multi(n = 100, 
                  mu = c(0, 20, 20),
                  sd = c(1, 5, 5),
                  r = c(0.5, 0.5, 0.25), 
                  varnames = c("A", "B", "C"),
                  empirical = FALSE)| n | var | A | B | C | mean | sd | 
|---|---|---|---|---|---|---|
| 100 | A | 1.00 | 0.49 | 0.51 | -0.04 | 1.04 | 
| 100 | B | 0.49 | 1.00 | 0.19 | 19.95 | 4.91 | 
| 100 | C | 0.51 | 0.19 | 1.00 | 19.64 | 4.61 | 
Table: Sample stats
You can specify the correlations in one of four ways:
If you want all the pairs to have the same correlation, just specify a single number.
bvn <- rnorm_multi(100, 5, 0, 1, .3, varnames = letters[1:5])| n | var | a | b | c | d | e | mean | sd | 
|---|---|---|---|---|---|---|---|---|
| 100 | a | 1.00 | 0.18 | 0.29 | 0.33 | 0.31 | 0.04 | 1.03 | 
| 100 | b | 0.18 | 1.00 | 0.18 | 0.33 | 0.30 | 0.13 | 1.06 | 
| 100 | c | 0.29 | 0.18 | 1.00 | 0.14 | 0.20 | 0.07 | 0.99 | 
| 100 | d | 0.33 | 0.33 | 0.14 | 1.00 | 0.28 | 0.15 | 1.06 | 
| 100 | e | 0.31 | 0.30 | 0.20 | 0.28 | 1.00 | 0.03 | 1.03 | 
Table: Sample stats from a single rho
If you already have a correlation matrix, such as the output of cor(), you can specify the simulated data with that.
cmat <- cor(iris[,1:4])
bvn <- rnorm_multi(100, 4, 0, 1, cmat, 
                  varnames = colnames(cmat))| n | var | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | mean | sd | 
|---|---|---|---|---|---|---|---|
| 100 | Sepal.Length | 1.00 | -0.24 | 0.87 | 0.82 | 0.09 | 0.98 | 
| 100 | Sepal.Width | -0.24 | 1.00 | -0.58 | -0.52 | 0.07 | 1.08 | 
| 100 | Petal.Length | 0.87 | -0.58 | 1.00 | 0.96 | 0.04 | 1.03 | 
| 100 | Petal.Width | 0.82 | -0.52 | 0.96 | 1.00 | 0.05 | 1.04 | 
Table: Sample stats from a correlation matrix
You can specify your correlation matrix by hand as a vars*vars length vector, which will include the correlations of 1 down the diagonal.
cmat <- c(1, .3, .5,
          .3, 1, 0,
          .5, 0, 1)
bvn <- rnorm_multi(100, 3, 0, 1, cmat, 
                  varnames = c("first", "second", "third"))| n | var | first | second | third | mean | sd | 
|---|---|---|---|---|---|---|
| 100 | first | 1.00 | 0.31 | 0.48 | 0.05 | 1.02 | 
| 100 | second | 0.31 | 1.00 | 0.01 | -0.14 | 0.86 | 
| 100 | third | 0.48 | 0.01 | 1.00 | 0.02 | 1.12 | 
Table: Sample stats from a vars*vars vector
You can specify your correlation matrix by hand as a vars*(vars-1)/2 length vector, skipping the diagonal and lower left duplicate values.
rho1_2 <- .3
rho1_3 <- .5
rho1_4 <- .5
rho2_3 <- .2
rho2_4 <- 0
rho3_4 <- -.3
cmat <- c(rho1_2, rho1_3, rho1_4, rho2_3, rho2_4, rho3_4)
bvn <- rnorm_multi(100, 4, 0, 1, cmat, 
                  varnames = letters[1:4])| n | var | a | b | c | d | mean | sd | 
|---|---|---|---|---|---|---|---|
| 100 | a | 1.00 | 0.29 | 0.61 | 0.41 | -0.10 | 1.06 | 
| 100 | b | 0.29 | 1.00 | 0.23 | -0.03 | 0.09 | 1.14 | 
| 100 | c | 0.61 | 0.23 | 1.00 | -0.28 | 0.08 | 1.17 | 
| 100 | d | 0.41 | -0.03 | -0.28 | 1.00 | -0.12 | 0.97 | 
Table: Sample stats from a (vars*(vars-1)/2) vector
If you want your samples to have the exact correlations, means, and SDs you entered, set empirical to TRUE.
bvn <- rnorm_multi(100, 5, 0, 1, .3, 
                  varnames = letters[1:5], 
                  empirical = T)| n | var | a | b | c | d | e | mean | sd | 
|---|---|---|---|---|---|---|---|---|
| 100 | a | 1.0 | 0.3 | 0.3 | 0.3 | 0.3 | 0 | 1 | 
| 100 | b | 0.3 | 1.0 | 0.3 | 0.3 | 0.3 | 0 | 1 | 
| 100 | c | 0.3 | 0.3 | 1.0 | 0.3 | 0.3 | 0 | 1 | 
| 100 | d | 0.3 | 0.3 | 0.3 | 1.0 | 0.3 | 0 | 1 | 
| 100 | e | 0.3 | 0.3 | 0.3 | 0.3 | 1.0 | 0 | 1 | 
Table: Sample stats with empirical = TRUE
Us rnorm_pre() to create a vector with a specified correlation to one or more pre-existing variables. The following code creates a new column called B with a mean of 10, SD of 2 and a correlation of r = 0.5 to the A column.
dat <- rnorm_multi(varnames = "A") %>%
  mutate(B = rnorm_pre(A, mu = 10, sd = 2, r = 0.5))| n | var | A | B | mean | sd | 
|---|---|---|---|---|---|
| 100 | A | 1.00 | 0.37 | -0.03 | 1.10 | 
| 100 | B | 0.37 | 1.00 | 10.02 | 2.28 | 
Set empirical = TRUE to return a vector with the exact specified parameters.
dat$C <- rnorm_pre(dat$A, mu = 10, sd = 2, r = 0.5, empirical = TRUE)| n | var | A | B | C | mean | sd | 
|---|---|---|---|---|---|---|
| 100 | A | 1.00 | 0.37 | 0.50 | -0.03 | 1.10 | 
| 100 | B | 0.37 | 1.00 | 0.15 | 10.02 | 2.28 | 
| 100 | C | 0.50 | 0.15 | 1.00 | 10.00 | 2.00 | 
You can also specify correlations to more than one vector by setting the first argument to a data frame containing only the continuous columns and r to the correlation with each column.
dat$D <- rnorm_pre(dat, r = c(.1, .2, .3), empirical = TRUE)| n | var | A | B | C | D | mean | sd | 
|---|---|---|---|---|---|---|---|
| 100 | A | 1.00 | 0.37 | 0.50 | 0.1 | -0.03 | 1.10 | 
| 100 | B | 0.37 | 1.00 | 0.15 | 0.2 | 10.02 | 2.28 | 
| 100 | C | 0.50 | 0.15 | 1.00 | 0.3 | 10.00 | 2.00 | 
| 100 | D | 0.10 | 0.20 | 0.30 | 1.0 | 0.00 | 1.00 | 
Not all correlation patterns are possible, so you’ll get an error message if the correlations you ask for are impossible.
dat$E <- rnorm_pre(dat, r = .9)
#> Warning in rnorm_pre(dat, r = 0.9): Correlations are impossible.