simulateMvMatrix.Rd
Simulate a multivariate matrix of random numbers from specified theoretical probability distributions and/or empirical probability distributions based on a specified rank correlation matrix, using either Latin Hypercube sampling or simple random sampling.
simulateMvMatrix(n, distributions = c(Var.1 = "norm", Var.2 = "norm"),
param.list = list(Var.1 = list(mean = 0, sd = 1), Var.2 = list(mean = 0, sd = 1)),
cor.mat = diag(length(distributions)), sample.method = "SRS", seed = NULL,
left.tail.cutoff = ifelse(is.finite(supp.min), 0, .Machine$double.eps),
right.tail.cutoff = ifelse(is.finite(supp.max), 0, .Machine$double.eps),
tol.1 = .Machine$double.eps, tol.symmetry = .Machine$double.eps,
tol.recip.cond.num = .Machine$double.eps, max.iter = 10)
a positive integer indicating the number of random vectors (i.e., the number of rows of the matrix) to generate.
a character vector of length \(k\) denoting the distribution abbreviations for
each of the \(k\) distributions. If there is a names
attribute associated
with this character vector, these names will be the column names of the resulting
matrix. The default value of distributions
is
c(Var.1="norm", Var.2="norm")
, indicating that \(k=2\), both
distributions are the normal distribution, and the column names of the
resulting \(n \times k\) matrix will be "Var.1"
and "Var.2"
.
See the help file for Distribution.df
for a list of possible
distribution abbreviations.
Alternatively, the character string "emp"
may be used to denote sampling
from an empirical distribution based on a set of observations. The vector
containing the observations is specified in the argument param.list
.
a list containing \(k\) lists that specify the values for the parameters of
the \(k\) distributions. If param.list
has a names
attribute
(not necessary), the names
attribute should be exactly the same as the
names
attribute of the argument distributions
. The default value
of param.list
is list(Var.1=list(mean=0, sd=1), Var.2=list(mean=0, sd=1))
.
See the help file for Distribution.df
for the names and
possible values of the parameters associated with each distribution.
Alternatively, if you specify an empirical distribution for the \(j\)'th
distribution by setting the \(j\)'th element of distribution
to
"emp"
, then the \(j\)'th component of param.list
must be a
list of the form list(obs=
name)
, where name denotes the
name of the vector containing the observations to use for the empirical
distribution. In this case, you may also supply arguments to the
qemp
function through the \(j\)'th component of param.list
.
For example, you may set this component to
list(obs=
name, discrete=T)
to
specify an empirical distribution based on a discrete random variable.
a \(k \times k\) matrix specifying the rank correlations between the \(k\) distributions. This argument must be a positive definite symmetric matrix, with all 1's on the diagonal. All elements on the off-diagonal must be between -1 and 1. The default value is the \(k \times k\) identity matrix, specifying no rank correlation between any of the variables.
a character vector of length 1 or \(k\) indicating, for each distribution,
whether to use Latin Hypercube sampling or simple random sampling. If
sample.method
is of length 1, it is replicated to length \(k\).
Each element of sample.method
must be the character string
"LHS"
(Latin Hypercube sampling) or "SRS"
(simple random sampling),
or an abbreviation of one of these strings. The default value is "SRS"
,
indicating simple random sampling for each distribution. Note that by specifying
sample.method
as a vector of length \(k\), you may use different sampling
methods for different distributions.
integer to supply to the R function set.seed
.
The default value is seed=NULL
, in which case the random seed is
not set but instead based on the current value of .Random.seed
.
a numeric vector of length \(k\) indicating, for each distribution,
what proportion of the left-tail of the probability distribution to omit
for Latin Hypercube sampling. All elements of left.tail.cutoff
must be between 0 and 1.
For densities with a finite support minimum (e.g., Lognormal or
Empirical) the default value is left.tail.cutoff=0
;
for densities with a support minimum of \(-\infty\), the default value is
left.tail.cutoff=.Machine$double.eps
.
The \(j\)'th element of this argument is ignored if
the \(j\)'th element of sample.method
is equal to "SRS"
.
a numeric vector of length \(k\) indicating, for each distribution,
what proportion of the right-tail of the probability distribution to omit
for Latin Hypercube sampling. All elements of right.tail.cutoff
must be between 0 and 1.
For densities with a finite support maximum (e.g., Beta or
Empirical) the default value is right.tail.cutoff=0
;
for densities with a support maximum of \(\infty\), the default value
is right.tail.cutoff=.Machine$double.eps
.
The \(j\)'th element of this argument is ignored if
the \(j\)'th element of sample.method
is equal to "SRS"
.
a positive numeric scalar indicating the allowable absolute deviation
from 1 for the diagonal elements of cor.mat
. The default value
is .Machine$double.eps
.
a positive numeric scalar indicating the allowable absolute deviation from
0 for the difference between symmetric elements of cor.mat
(e.g., abs(cor.mat[3,2]-cor.mat[2,3])
. The default value is .Machine$double.eps
.
a positive numeric scalar indicating the allowable minimum value of the
reciprocal of the condition number for cor.mat
. The condition number
is defined to be the largest eigen value divided by the smallest eigen value.
The reciprocal of the condition number is some number between 0 and 1.
This value must be sufficiently large for cor.mat
to be of full rank
(i.e., to not be singular). The default value of tol.recip.cond.num
is .Machine$double.eps
.
a positive integer indicating the maximum number of iterations to use to
produce the \(R\) matrix in the algorithm to create the output matrix.
The sample correlation matrix of \(R\) must be positive definite.
The number of iterations will rarely be more than 2 for moderate to
large sample sizes (e.g., \(n > 2k\)). The default value is
max.iter=10
. See the DETAILS section below for more information on
the \(R\) matrix.
Motivation
In risk assessment and Monte Carlo simulation, the outcome variable of interest,
say \(Y\), is usually some function of one or more other random variables:
$$Y = h(\underline{X}) = h(X_1, X_2, \ldots, X_k) \;\;\;\;\;\; (1)$$
For example, \(Y\) may be the incremental lifetime cancer risk due to
ingestion of soil contaminated with benzene (Thompson et al., 1992;
Hamed and Bedient, 1997). In this case the random vector \(\underline{X}\)
may represent observations from several kinds of distributions that characterize
exposure and dose-response, such as benzene concentration in the soil,
soil ingestion rate, average body weight, the cancer potency factor for benzene,
etc. These distributions may or may not be assumed to be independent of one
another (Smith et al., 1992; Bukowski et al., 1995). Often, input variables in a
Monte Carlo simulation are in fact known to be correlated, such as body weight
and dermal area.
Characterizing the joint distribution of a random vector \(\underline{X}\),
where different elements of \(\underline{X}\) come from different distributions,
is usually mathematically complex or impossible unless the elements
(random variables) of \(\underline{X}\) are independent.
Iman and Conover (1982) present an algorithm for creating a set of \(n\)
multivariate observations with a rank correlation matrix that is approximately
equal to a specified rank correlation matrix. This method allows for different
probability distributions for each element of the multivariate vector. The
details of this algorithm are as follows.
Algorithm
Specify \(n\), the desired number of random vectors (i.e., number of
rows of the \(n \times k\) output matrix). This is specified by the
argument n
for the function simulateMvMatrix
.
Create \(C\), the desired \(k \times k\) correlation matrix. This is
specified by the argument cor.mat
.
Compute \(P\), where \(P\) is a lower triangular \(k \times k\)
matrix and
$$PP^{'} = C \;\;\;\;\;\; (2)$$
where \(P^{'}\) denotes the transpose of \(P\). The function
simulateMvMatrix
uses the Cholesky decomposition to compute P
(see the R help file for chol
).
Create \(R\), an \(n \times k\) matrix, whose columns represent \(k\) independent permutations of van der Waerden scores. That is, each column of \(R\) is a random permutation of the scores $$\Phi^{-1}(\frac{i}{n+1}), \; i = 1, 2, \ldots, n \;\;\;\;\;\; (3)$$ where \(\Phi\) denotes the cumulative distribution function of the standard normal distribution.
Compute \(T\), the \(k \times k\) Pearson sample correlation matrix of \(R\). Make sure \(T\) is positive definite; if it is not, then repeat step 4.
Compute \(Q\), where \(Q\) is a lower triangular \(k \times k\)
matrix and
$$QQ^{'} = T \;\;\;\;\;\; (4)$$
The function simulateMvMatrix
uses the Cholesky decomposition to compute
\(Q\) (see the R help file for chol
).
Compute the lower triangular \(k \times k\) matrix \(S\), where $$S = PQ^{-1} \;\;\;\;\;\; (5)$$
Compute the matrix \(R^{*}\), where $$R^{*} = RS^{'} \;\;\;\;\;\; (6)$$
Generate an \(n \times k\) matrix of random numbers \(\underline{X}\),
where each column of \(\underline{X}\) comes from the distribution specified
by the arguments distributions
and param.list
. Generate each
column of random numbers independently of the other columns. If the \(j\)'th
element of sample.method
equals "SRS"
, use simple random sampling
to generate the random numbers for the \(j\)'th column of \(\underline{X}\).
If the \(j\)'th element of sample.method
equals "LHS"
, use
Latin Hypercube sampling to generate the random numbers for the \(j\)'th column
of \(\underline{X}\). At this stage in the algorithm, the function
simulateMvMatrix
calls the function simulateVector
to
create each column of \(\underline{X}\).
Order the observations within each column of \(\underline{X}\) so that the order of the ranks within each column of \(\underline{X}\) matches the order of the ranks within each column of \(R^{*}\). This way, \(\underline{X}\) and \(R^{*}\) have exactly the same sample rank correlation matrix.
Explanation
Iman and Conover (1982) present two algorithms for computing an \(n \times k\)
output matrix with a specified rank correlation. The algorithm presented above is
the second, more complicated one. In order to explain the reasoning behind this
algorithm, we need to explain the simple algorithm first.
Simple Algorithm
Let \(R_i\) denote the \(i\)'th row vector of the matrix \(R\), the
matrix of scores. This row vector has a population correlation matrix of \(I\),
where \(I\) denotes the \(k \times k\) identity matrix. Thus, the
\(1 \times k\) vector \(R_i P^{'}\) has a population correlation matrix equal to
\(C\). Therefore, if we define \(R^{*}\) by
$$R^{*} = RP^{'} \;\;\;\;\;\; (7)$$
each row of \(R^{*}\) has the same multivariate distribution with population
correlation matrix \(C\). The rank correlation matrix of \(R^{*}\) should
therefore be close to \(C\). Ordering the columns of \(\underline{X}\) as
described in Step 10 above will yield a matrix of observations with the
specified distributions and the exact same rank correlation matrix as the
rank correlation matrix of \(R^{*}\).
Iman and Conover (1982) use van der Waerden scores instead of raw ranks to create \(R\) because van der Waerden scores yield more "natural-looking" pairwise scatterplots.
If the Pearson sample correlation matrix of \(R\), denoted \(T\) in Step 5
above, is exactly equal to the true population correlation matrix \(I\),
then the sample correlation matrix of \(R^{*}\) is exactly equal to \(C\),
and the rank correlation matrix of \(R^{*}\) is approximately equal to \(C\).
The Pearson sample correlation matrix of \(R\), however, is an estimate of the
true population correlation matrix \(I\), and is therefore
“bouncing around” \(I\). Likewise, the Pearson sample correlation matrix
of \(R^{*}\) is an estimate of the true population correlation matrix \(C\),
and is therefore bouncing around \(C\). Using this simple algorithm, the
Pearson sample correlation matrix of \(R^{*}\), as \(R^{*}\) is defined in
Equation (7) above, may not be “close” enough to the desired rank
correlation matrix \(C\), and thus the rank correlation of \(R^{*}\) will not
be close enough to \(C\). Iman and Conover (1982), therefore present a more
complicated algorithm.
More Complicated Algorithm
To get around the problem mentioned above, Iman and Conover (1982) find a
\(k \times k\) lower triangular matrix \(S\) such that the matrix \(R^{*}\)
as defined in Equation (6) above has a correlation matrix exactly equal to \(C\).
The formula for \(S\) is given in Steps 6 and 7 of the algorithm above.
Iman and Conover (1982, p.330) note that even if the desired rank correlation matrix \(C\) is in fact the identity matrix \(I\), this method of generating the matrix will produce a matrix with an associated rank correlation that more closely resembles \(I\) than you would get by simply generating random numbers within each column of \(\underline{X}\).
A numeric matrix of dimension \(n \times k\) of random numbers,
where the \(j\)'th column of numbers comes from the distribution
specified by the \(j\)'th elements of the arguments distributions
and param.list
, and the rank correlation of this matrix is
approximately equal to the argument cor.mat
. The value of \(n\)
is determined by the argument n
, and the value of \(k\) is
determined by the length of the argument distributions
.
Bukowski, J., L. Korn, and D. Wartenberg. (1995). Correlated Inputs in Quantitative Risk Assessment: The Effects of Distributional Shape. Risk Analysis 15(2), 215–219.
Hamed, M., and P.B. Bedient. (1997). On the Effect of Probability Distributions of Input Variables in Public Health Risk Assessment. Risk Analysis 17(1), 97–105.
Iman, R.L., and W.J. Conover. (1980). Small Sample Sensitivity Analysis Techniques for Computer Models, With an Application to Risk Assessment (with Comments). Communications in Statistics–Volume A, Theory and Methods, 9(17), 1749–1874.
Iman, R.L., and W.J. Conover. (1982). A Distribution-Free Approach to Inducing Rank Correlation Among Input Variables. Communications in Statistics–Volume B, Simulation and Computation, 11(3), 311–334.
Iman, R.L., and J.M. Davenport. (1982). Rank Correlation Plots For Use With Correlated Input Variables. Communications in Statistics–Volume B, Simulation and Computation, 11(3), 335–360.
Iman, R.L., and J.C. Helton. (1988). An Investigation of Uncertainty and Sensitivity Analysis Techniques for Computer Models. Risk Analysis 8(1), 71–90.
Iman, R.L. and J.C. Helton. (1991). The Repeatability of Uncertainty and Sensitivity Analyses for Complex Probabilistic Risk Assessments. Risk Analysis 11(4), 591–606.
McKay, M.D., R.J. Beckman., and W.J. Conover. (1979). A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output From a Computer Code. Technometrics 21(2), 239–245.
Millard, S.P. (2013). EnvStats: an R Package for Environmental Statistics. Springer, New York. https://link.springer.com/book/10.1007/978-1-4614-8456-1.
Smith, A.E., P.B. Ryan, and J.S. Evans. (1992). The Effect of Neglecting Correlations When Propagating Uncertainty and Estimating the Population Distribution of Risk. Risk Analysis 12(4), 467–474.
Thompson, K.M., D.E. Burmaster, and E.A.C. Crouch. (1992). Monte Carlo Techniques for Quantitative Uncertainty Analysis in Public Health Risk Assessments. Risk Analysis 12(1), 53–63.
Vose, D. (2008). Risk Analysis: A Quantitative Guide. Third Edition. John Wiley & Sons, West Sussex, UK, 752 pp.
Monte Carlo simulation and risk assessment often involve looking at the distribution or characteristics of the distribution of some outcome variable that depends upon several input variables (see Equation (1) above). Usually these input variables can be considered random variables. An important part of both sensitivity analysis and uncertainty analysis involves looking at how the distribution of the outcome variable changes with changing assumptions on the input variables. One important assumption is the correlation between the input random variables.
Often, the input random variables are assumed to be independent when in fact they are know to be correlated (Smith et al., 1992; Bukowski et al., 1995). It is therefore important to assess the effect of the assumption of independence on the distribution of the outcome variable. One way to assess the effect of this assumption is to run the Monte Carlo simulation assuming independence and then also run it assuming certain forms of correlations among the input variables.
Iman and Davenport (1982) present a series of scatterplots showing “typical”
scatterplots with various distributions on the \(x\)- and \(y\)-axes and
various assumed rank correlations. These plots are meant to aid in developing
reasonable estimates of rank correlation between input variables. These plots can
easily be produced using the simulateMvMatrix
and plot
functions.
# Generate 5 observations from a standard bivariate normal distribution
# with a rank correlation matrix (approximately) equal to the 2 x 2
# identity matrix, using simple random sampling for each
# marginal distribution.
simulateMvMatrix(5, seed = 47)
#> Var.1 Var.2
#> [1,] -0.98548216 -1.82822917
#> [2,] -1.46575030 -0.92245624
#> [3,] -1.08573747 0.49382018
#> [4,] -0.25204590 0.03960243
#> [5,] 0.01513086 0.09147291
# Var.1 Var.2
#[1,] 0.01513086 0.03960243
#[2,] -1.08573747 0.09147291
#[3,] -0.98548216 0.49382018
#[4,] -0.25204590 -0.92245624
#[5,] -1.46575030 -1.82822917
#==========
# Look at the observed rank correlation matrix for 100 observations
# from a standard bivariate normal distribution with a rank correlation matrix
# (approximately) equal to the 2 x 2 identity matrix. Compare this observed
# rank correlation matrix with the observed rank correlation matrix based on
# generating two independent sets of standard normal random numbers.
# Note that the cross-correlation is closer to 0 for the matrix created with
# simulateMvMatrix.
cor(simulateMvMatrix(100, seed = 47), method = "spearman")
#> Var.1 Var.2
#> Var.1 1.000000000 0.002472247
#> Var.2 0.002472247 1.000000000
# Var.1 Var.2
#Var.1 1.000000000 -0.005976598
#Var.2 -0.005976598 1.000000000
cor(matrix(simulateVector(200, seed = 47), 100 , 2), method = "spearman")
#> [,1] [,2]
#> [1,] 1.00000000 -0.05374137
#> [2,] -0.05374137 1.00000000
# [,1] [,2]
#[1,] 1.00000000 -0.05374137
#[2,] -0.05374137 1.00000000
#==========
# Generate 1000 observations from a bivariate distribution, where the first
# distribution is a normal distribution with parameters mean=10 and sd=2,
# the second distribution is a lognormal distribution with parameters
# mean=10 and cv=1, and the desired rank correlation between the two
# distributions is 0.8. Look at the observed rank correlation matrix, and
# plot the results.
mat <- simulateMvMatrix(1000,
distributions = c(N.10.2 = "norm", LN.10.1 = "lnormAlt"),
param.list = list(N.10.2 = list(mean=10, sd=2),
LN.10.1 = list(mean=10, cv=1)),
cor.mat = matrix(c(1, .8, .8, 1), 2, 2), seed = 47)
round(cor(mat, method = "spearman"), 2)
#> N.10.2 LN.10.1
#> N.10.2 1.00 0.79
#> LN.10.1 0.79 1.00
# N.10.2 LN.10.1
#N.10.2 1.00 0.78
#LN.10.1 0.78 1.00
dev.new()
plot(mat, xlab = "Observations from N(10, 2)",
ylab = "Observations from LN(mean=10, cv=1)",
main = "Lognormal vs. Normal Deviates with Rank Correlation 0.8")
#----------
# Repeat the last example, but use Latin Hypercube sampling for both
# distributions. Note the wider range on the y-axis.
mat.LHS <- simulateMvMatrix(1000,
distributions = c(N.10.2 = "norm", LN.10.1 = "lnormAlt"),
param.list = list(N.10.2 = list(mean=10, sd=2),
LN.10.1 = list(mean=10, cv=1)),
cor.mat = matrix(c(1, .8, .8, 1), 2, 2),
sample.method = "LHS", seed = 298)
round(cor(mat.LHS, method = "spearman"), 2)
#> N.10.2 LN.10.1
#> N.10.2 1.00 0.79
#> LN.10.1 0.79 1.00
# N.10.2 LN.10.1
#N.10.2 1.00 0.79
#LN.10.1 0.79 1.00
dev.new()
plot(mat.LHS, xlab = "Observations from N(10, 2)",
ylab = "Observations from LN(mean=10, cv=1)",
main = paste("Lognormal vs. Normal Deviates with Rank Correlation 0.8",
"(Latin Hypercube Sampling)", sep = "\n"))
#==========
# Generate 1000 observations from a multivariate distribution, where the
# first distribution is a normal distribution with parameters
# mean=10 and sd=2, the second distribution is a lognormal distribution
# with parameters mean=10 and cv=1, the third distribution is a beta
# distribution with parameters shape1=2 and shape2=3, and the fourth
# distribution is an empirical distribution of 100 observations that
# we'll generate from a Pareto distribution with parameters
# location=10 and shape=2. Set the desired rank correlation matrix to:
cor.mat <- matrix(c(1, .8, 0, .5, .8, 1, 0, .7,
0, 0, 1, .2, .5, .7, .2, 1), 4, 4)
cor.mat
#> [,1] [,2] [,3] [,4]
#> [1,] 1.0 0.8 0.0 0.5
#> [2,] 0.8 1.0 0.0 0.7
#> [3,] 0.0 0.0 1.0 0.2
#> [4,] 0.5 0.7 0.2 1.0
# [,1] [,2] [,3] [,4]
#[1,] 1.0 0.8 0.0 0.5
#[2,] 0.8 1.0 0.0 0.7
#[3,] 0.0 0.0 1.0 0.2
#[4,] 0.5 0.7 0.2 1.0
# Use Latin Hypercube sampling for each variable, look at the observed
# rank correlation matrix, and plot the results.
pareto.rns <- simulateVector(100, "pareto",
list(location = 10, shape = 2), sample.method = "LHS",
seed = 56)
mat <- simulateMvMatrix(1000,
distributions = c(Normal = "norm", Lognormal = "lnormAlt",
Beta = "beta", Empirical = "emp"),
param.list = list(Normal = list(mean=10, sd=2),
Lognormal = list(mean=10, cv=1),
Beta = list(shape1 = 2, shape2 = 3),
Empirical = list(obs = pareto.rns)),
cor.mat = cor.mat, seed = 47, sample.method = "LHS")
round(cor(mat, method = "spearman"), 2)
#> Normal Lognormal Beta Empirical
#> Normal 1.00 0.79 0.00 0.48
#> Lognormal 0.79 1.00 0.01 0.68
#> Beta 0.00 0.01 1.00 0.21
#> Empirical 0.48 0.68 0.21 1.00
# Normal Lognormal Beta Empirical
#Normal 1.00 0.78 -0.01 0.47
#Lognormal 0.78 1.00 -0.01 0.67
#Beta -0.01 -0.01 1.00 0.19
#Empirical 0.47 0.67 0.19 1.00
dev.new()
pairs(mat)
#==========
# Clean up
#---------
rm(mat, mat.LHS, pareto.rns)
graphics.off()