The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
library(faux)
library(ggplot2)
library(ggExtra)
library(patchwork)
NORTA stands for Normal to Anything. This procedure simulates correlated data from a normal distribution and converts it to any other distribution using quantile functions. The challenge is determining what correlation between normally distributed variables is equivalent to a specified correlation between non-normally distributed variables.
The current implementation of rmulti()
is experimental.
It has six arguments:
n
the number of samples requireddist
A named vector of the distributions of each
variableparams
A list of lists of the arguments to pass to each
distribution functionr
the correlations among the variables (can be a single
number, vars*vars matrix, vars*vars vector, or a vars*(vars-1)/2
vector)empirical
logical. If true, params specify the sample
parameters, not the population parametersas.matrix
logical. If true, returns a matrixBy default, it returns a data frame with 100 rows of two normally distributed values with means of 0, standard deviations of 1, and a correlation of 0.
<- rmulti()
dat
get_params(dat)
#> n var A B mean sd
#> 1 100 A 1.00 -0.06 0.05 0.99
#> 2 100 B -0.06 1.00 0.05 0.93
The n
, r
, empirical
and
as.matrix
arguments work the same as the do for
rnorm_multi()
.
<- rmulti(n = 200, r = 0.5,
dat empirical = TRUE,
as.matrix = FALSE)
get_params(dat)
#> n var A B mean sd
#> 1 200 A 1.0 0.5 0 1
#> 2 200 B 0.5 1.0 0 1
You can set the distribution for each variable with the
dist
argument. The options are any distribution function in
the {stats} package, such as “norm”, “pois”, “binom”, and “unif”, plus
the “truncnorm” distribution from the {truncnorm} package and the
“likert” distribution from {faux}.
Set the params
argument to a named list with a vector of
named arguments for the random generation function for each
distribution. For example, use ?rpois
to find out what
arguments the rpois()
function needs to simulate values
from a poisson distribution. Don’t set n
in
params
; it will be set by the n
in the
rmulti()
function.
<- rmulti(n = 1000,
dat dist = c(uniform_var = "unif",
poisson_var = "pois"),
params = list(uniform_var = c(min = 0, max = 100),
poisson_var = c(lambda = 3)),
r = 0.5)
get_params(dat)
#> n var uniform_var poisson_var mean sd
#> 1 1000 uniform_var 1.00 0.49 50.62 28.67
#> 2 1000 poisson_var 0.49 1.00 2.98 1.75
You can also simulate more than two variables. Set the correlations using the upper right triangle or a correlation matrix (e.g., from a pilot dataset you’re trying to simulate).
<- rmulti(
dat n = 1000,
dist = c(N = "norm",
T = "truncnorm",
L = "likert"),
params = list(
N = list(mean = 100, sd = 10),
T = list(a = 1, b = 7, mean = 3.5, sd = 2.1),
L = list(prob = c(`much less` = .10,
`less` = .20,
`equal` = .35,
`more` = .25,
`much more` = .10))
),r = c(-0.5, -0.6, 0.7)
)
# convert likert-scale variable to integer
$L <- as.integer(dat$L)
datget_params(dat)
#> n var N T L mean sd
#> 1 1000 N 1.00 -0.47 -0.57 100.32 9.70
#> 2 1000 T -0.47 1.00 0.67 3.80 1.43
#> 3 1000 L -0.57 0.67 1.00 3.08 1.08
Not all correlations are possible for a given pair of distributions. At the extreme, it’s obvious that a normal distribution and a uniform distribution can’t be correlated 1.0, because they wouldn’t be different distributions then.
If you ask rmulti()
for a correlation that is too high
or low, you will get a message telling you the maximum and minimum
correlations that can be generated.
<- rmulti(
dat dist = c(A = "binom", B = "pois", C = "norm"),
params = list(A = list(size = 1, prob = 0.5),
B = list(lambda = 3),
C = list(mean = 100, sd = 10)),
r = c(0.8, 0.9, 0.5)
)#> Error in rmulti(dist = c(A = "binom", B = "pois", C = "norm"), params = list(A = list(size = 1, : Some of the correlations are not possible:
#> * A&B (-0.776 to 0.776)
#> * A&C (-0.798 to 0.798)
I made a few helper functions for rmulti()
. I’m not sure
if they’ll be useful to anyone else, but they’re available.
Fréchet-Hoefding bounds are the limits to a correlation for a pair of distributions.
fh_bounds(dist1 = "truncnorm",
dist2 = "beta",
params1 = list(a = 0, b = 10, mean = 5, sd = 3),
params2 = list(shape1 = 1, shape2 = 5))
#> $min
#> [1] -0.944
#>
#> $max
#> [1] 0.944
This gives the r-value you’d need to simulate for a pair of normally-distributed variables to achieve the target r-value after converting to the target distributions.
<- convert_r(
adjusted_r target_r = 0.75,
dist1 = "truncnorm",
dist2 = "binom",
params1 = list(a = 0, b = 10, mean = 5, sd = 3),
params2 = list(size = 1, prob = 0.5)
)
adjusted_r#> [1] 0.911
What the rmulti()
function does is use this adjusted r
to generate a multivariate normal distribution, then convert each
variable to the target distribution.
# simulate multivariate normal
<- rnorm_multi(n = 1000,
dat varnames = c("N1", "N2"),
r = adjusted_r,
empirical = TRUE)
# convert to target distributions
$T1 <- norm2trunc(dat$N1,
datmin = 0, max = 10,
mu = 5, sd = 3,
x_mu = 0, x_sd = 1)
$B2 = norm2binom(dat$N2,
datsize = 1, prob = 0.5,
mu = 0, sd = 1)
# check
get_params(dat)
#> n var N1 N2 T1 B2 mean sd
#> 1 1000 N1 1.00 0.91 0.99 0.73 0.00 1.00
#> 2 1000 N2 0.91 1.00 0.90 0.79 0.00 1.00
#> 3 1000 T1 0.99 0.90 1.00 0.76 5.02 2.38
#> 4 1000 B2 0.73 0.79 0.76 1.00 0.52 0.50
Note that the correlation between T1 and B2 is unlikely to be exactly 0.75 unless the n is very large, especially for distributions that have very few unique values.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.