The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Simulate Correlated Variables

Lisa DeBruine

2023-04-18

library(ggplot2)
library(dplyr)
library(tidyr)
library(faux)

The rnorm_multi() function makes multiple normally distributed vectors with specified parameters and relationships.

Quick example

For example, the following creates a sample that has 100 observations of 3 variables, drawn from a population where A has a mean of 0 and SD of 1, while B and C have means of 20 and SDs of 5. A correlates with B and C with r = 0.5, and B and C correlate with r = 0.25.


dat <- rnorm_multi(n = 100, 
                  mu = c(0, 20, 20),
                  sd = c(1, 5, 5),
                  r = c(0.5, 0.5, 0.25), 
                  varnames = c("A", "B", "C"),
                  empirical = FALSE)
n var A B C mean sd
100 A 1.00 0.49 0.51 -0.04 1.04
100 B 0.49 1.00 0.19 19.95 4.91
100 C 0.51 0.19 1.00 19.64 4.61

Table: Sample stats

Specify correlations

You can specify the correlations in one of four ways:

One Number

If you want all the pairs to have the same correlation, just specify a single number.

bvn <- rnorm_multi(100, 5, 0, 1, .3, varnames = letters[1:5])
n var a b c d e mean sd
100 a 1.00 0.18 0.29 0.33 0.31 0.04 1.03
100 b 0.18 1.00 0.18 0.33 0.30 0.13 1.06
100 c 0.29 0.18 1.00 0.14 0.20 0.07 0.99
100 d 0.33 0.33 0.14 1.00 0.28 0.15 1.06
100 e 0.31 0.30 0.20 0.28 1.00 0.03 1.03

Table: Sample stats from a single rho

Matrix

If you already have a correlation matrix, such as the output of cor(), you can specify the simulated data with that.

cmat <- cor(iris[,1:4])
bvn <- rnorm_multi(100, 4, 0, 1, cmat, 
                  varnames = colnames(cmat))
n var Sepal.Length Sepal.Width Petal.Length Petal.Width mean sd
100 Sepal.Length 1.00 -0.24 0.87 0.82 0.09 0.98
100 Sepal.Width -0.24 1.00 -0.58 -0.52 0.07 1.08
100 Petal.Length 0.87 -0.58 1.00 0.96 0.04 1.03
100 Petal.Width 0.82 -0.52 0.96 1.00 0.05 1.04

Table: Sample stats from a correlation matrix

Vector (vars*vars)

You can specify your correlation matrix by hand as a vars*vars length vector, which will include the correlations of 1 down the diagonal.

cmat <- c(1, .3, .5,
          .3, 1, 0,
          .5, 0, 1)
bvn <- rnorm_multi(100, 3, 0, 1, cmat, 
                  varnames = c("first", "second", "third"))
n var first second third mean sd
100 first 1.00 0.31 0.48 0.05 1.02
100 second 0.31 1.00 0.01 -0.14 0.86
100 third 0.48 0.01 1.00 0.02 1.12

Table: Sample stats from a vars*vars vector

Vector (vars*(vars-1)/2)

You can specify your correlation matrix by hand as a vars*(vars-1)/2 length vector, skipping the diagonal and lower left duplicate values.

rho1_2 <- .3
rho1_3 <- .5
rho1_4 <- .5
rho2_3 <- .2
rho2_4 <- 0
rho3_4 <- -.3
cmat <- c(rho1_2, rho1_3, rho1_4, rho2_3, rho2_4, rho3_4)
bvn <- rnorm_multi(100, 4, 0, 1, cmat, 
                  varnames = letters[1:4])
n var a b c d mean sd
100 a 1.00 0.29 0.61 0.41 -0.10 1.06
100 b 0.29 1.00 0.23 -0.03 0.09 1.14
100 c 0.61 0.23 1.00 -0.28 0.08 1.17
100 d 0.41 -0.03 -0.28 1.00 -0.12 0.97

Table: Sample stats from a (vars*(vars-1)/2) vector

empirical

If you want your samples to have the exact correlations, means, and SDs you entered, set empirical to TRUE.

bvn <- rnorm_multi(100, 5, 0, 1, .3, 
                  varnames = letters[1:5], 
                  empirical = T)
n var a b c d e mean sd
100 a 1.0 0.3 0.3 0.3 0.3 0 1
100 b 0.3 1.0 0.3 0.3 0.3 0 1
100 c 0.3 0.3 1.0 0.3 0.3 0 1
100 d 0.3 0.3 0.3 1.0 0.3 0 1
100 e 0.3 0.3 0.3 0.3 1.0 0 1

Table: Sample stats with empirical = TRUE

Pre-existing variables

Us rnorm_pre() to create a vector with a specified correlation to one or more pre-existing variables. The following code creates a new column called B with a mean of 10, SD of 2 and a correlation of r = 0.5 to the A column.

dat <- rnorm_multi(varnames = "A") %>%
  mutate(B = rnorm_pre(A, mu = 10, sd = 2, r = 0.5))
n var A B mean sd
100 A 1.00 0.37 -0.03 1.10
100 B 0.37 1.00 10.02 2.28

Set empirical = TRUE to return a vector with the exact specified parameters.

dat$C <- rnorm_pre(dat$A, mu = 10, sd = 2, r = 0.5, empirical = TRUE)
n var A B C mean sd
100 A 1.00 0.37 0.50 -0.03 1.10
100 B 0.37 1.00 0.15 10.02 2.28
100 C 0.50 0.15 1.00 10.00 2.00

You can also specify correlations to more than one vector by setting the first argument to a data frame containing only the continuous columns and r to the correlation with each column.

dat$D <- rnorm_pre(dat, r = c(.1, .2, .3), empirical = TRUE)
n var A B C D mean sd
100 A 1.00 0.37 0.50 0.1 -0.03 1.10
100 B 0.37 1.00 0.15 0.2 10.02 2.28
100 C 0.50 0.15 1.00 0.3 10.00 2.00
100 D 0.10 0.20 0.30 1.0 0.00 1.00

Not all correlation patterns are possible, so you’ll get an error message if the correlations you ask for are impossible.

dat$E <- rnorm_pre(dat, r = .9)
#> Warning in rnorm_pre(dat, r = 0.9): Correlations are impossible.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.