All variables are generated via the appropriate transformation of standard normal variables, as described below. See the documentation for corrvar and corrvar2 for more details about all inputs. The parameter inputs should be checked first using validpar. Summaries of simulated variables can be obtained using summary_var. Some code has been modified from the SimMultiCorrData package (Fialkowski 2018).

1) Continuous Variables: Continuous variables are simulated using either Fleishman (1978)’s third-order (method = “Fleishman”) or Headrick (2002)’s fifth-order (method = “Polynomial”) power method transformation (PMT). This is a computationally efficient algorithm that simulates continuous distributions through the method of moments. It works by matching standardized cumulants – the first four (mean, variance, skew, and standardized kurtosis) for Fleishman’s third-order method, or the first six (mean, variance, skew, standardized kurtosis, and standardized fifth and sixth cumulants) for Headrick’s fifth-order method. The transformation is expressed as follows:
\[\begin{equation} Y = c_{0} + c_{1} * Z + c_{2} * Z^2 + c_{3} * Z^3 + c_{4} * Z^4 + c_{5} * Z^5,\ Z \sim iid\ N(0,1), \tag{1} \end{equation}\]

where \(c_{4}\) and \(c_{5}\) both equal \(0\) for Fleishman’s method. The real constants are calculated by SimMultiCorrData::find_constants, which solves the system of equations given in SimMultiCorrData::fleish or SimMultiCorrData::poly. All variables are simulated with mean \(0\) and variance \(1\), and then transformed to the specified mean and variance at the end.

Continuous Mixture Variables: Mixture distributions describe random variables that are drawn from more than one component distribution. For a random variable \(Y\) from a finite continuous mixture distribution with \(k\) components, the probability density function (PDF) can be described by:
\[\begin{equation} h_{Y}(y) = \sum_{i=1}^{k} \pi_{i} f_{Y_{i}}(y),\ \ \sum_{i=1}^{k} \pi_{i} = 1. \tag{2} \end{equation}\]

The \(\pi_{i}\) are mixing parameters which determine the weight of each component distribution \(f_{Y_{i}}(y)\) in the overall probability distribution. As long as each component has a valid PDF, the overall distribution \(h_{Y}(y)\) has a valid PDF. The main assumption is statistical independence between the process of randomly selecting the component distribution and the distributions themselves: Assume there is a random selection process that first generates the numbers \(1,\ ...,\ k\) with probabilities \(\pi_{1},\ ...,\ \pi_{k}\). After selecting number \(i\), where \(1 \leq i \leq k\), a random variable \(y\) is drawn from component distribution \(f_{Y_{i}}(y)\).

Mixture distributions provide a useful way for describing heterogeneity in a population, especially when an outcome is a composite response from multiple sources. They may also be used to model distributions with outliers. For example, the contaminated normal outlier distribution is used when a proportion of measurement errors, which are usually modeled as normal variables with zero mean and common variance \(\sigma^2\), have larger variability. These distributions can be thought of as mixture distributions with contamination percentage \(\pi\ \epsilon\ (0,\ 1)\). For example, the PDF of \(Y\) may be given by:
\[\begin{equation} h_{Y}(y) = \pi \phi(0,\ 9\sigma^2)(y) + (1-\pi) \phi(0,\ \sigma^2)(y),\ \ -\infty < y < \infty. \tag{3} \end{equation}\]

Here, \(\phi(\mu,\ \sigma^2)(y)\) is the PDF of the normal distribution (Davenport, Bezder, and Hathaway 1988; Schork, Allison, and Thiel 1996; Everitt 1996; Pearson 2011).

Continuous mixture variables are generated at the component level. Each component variable is created using the PMT, and the mixture variable is generated from these based on a random multinomial variable described by the mixing probabilities. Correlations are also controlled at the component level. Users specify the target pairwise correlations between the components across continuous mixture variables and between components and other types of variables.

The function rho_M1M2 approximates the expected correlation between two mixture variables \(M1\) and \(M2\) based on the component distributions and the correlations between components. The function rho_M1Y approximates the expected correlation between a mixture variable \(M1\) and another random variable \(Y\), that may have an ordinal, continuous, or count distribution. If the user desires a specific correlation between a mixture variable \(M1\) and another simulated variable \(M2\) or \(Y\), various component level correlations can be tried in these 2 functions to see if the desired correlation can be achieved. See the Expected Cumulants and Correlations for Continuous Mixture Variables vignette for more details on continuous mixture variables.

  1. The required parameters for simulating continuous variables include: skewness, standardized kurtosis (kurtosis - 3), and standardized fifth and sixth cumulants (for the fifth-order method). If the goal is to simulate a theoretical distribution (i.e. Gamma, Beta, Logistic, etc.), these values can be obtained using SimMultiCorrData::calc_theory. If the goal is to mimic an empirical data set, these values can be found using SimMultiCorrData::calc_moments (using the method of moments) or SimMultiCorrData::calc_fisherk (using Fisher’s k-statistics). If the standardized cumulants are obtained from calc_theory, the user may need to use rounded values as inputs (i.e. skews = round(skews, 8)). For example, in order to ensure that skew is exactly \(0\) for symmetric distributions. Due to the nature of the integration involved in calc_theory, the results are approximations. Greater accuracy can be achieved by increasing the number of subdivisions (sub) used in the integration process.

  2. For mixture variables, the parameters are specified at the component level by the inputs mix_skews, mix_skurts, mix_fifths, mix_sixths, and mix_Six. The mixing probabilities, means and standard deviations of the component variables are given by mix_pis, mix_mus and mix_sigmas.

  3. The means and variances of all continuous variables are specified by means and vars. These are at the variable level, i.e., they refer to the continous non-mixture and mixture variables themselves. The function calc_mixmoments calculates the expected mean, standard deviation, and standardized cumulants for mixture variables based on the component distributions.

  4. For some sets of cumulants, it is either not possible to find power method constants or the calculated constants do not generate valid power method PDFs. In these situations, adding a value to the sixth cumulant may provide solutions (see find_constants). When using Headrick’s fifth-order approximation, if simulation results indicate that a continuous variable does not generate a valid PDF, the user can try find_constants with various sixth cumulant correction vectors to determine if a valid PDF can be found. These sixth cumulant corrections are specified in the simulation functions using Six or mix_Six.

  5. Choice of Fleishman’s or Headrick’s Method: Using the fifth-order approximation allows additional control over the fifth and sixth moments of the generated distribution, improving accuracy. In addition, the range of feasible standardized kurtosis (\(\gamma_{2}\)) values, given skew (\(\gamma_{1}\)) and standardized fifth (\(\gamma_{3}\)) and sixth (\(\gamma_{4}\)) cumulants, is larger than with the third-order approximation. For example, Fleishman’s method can not be used to generate a non-normal distribution with a ratio of \(\gamma_{1}^2/\gamma_{2} > 9/14\) (Headrick and Kowalchuk 2007). This eliminates the family of distributions, which has a constant ratio of \(\gamma_{1}^2/\gamma_{2} = 2/3\). The fifth-order method also generates more distributions with valid PDF’s. However, if the fifth and sixth cumulants do not exist, the Fleishman approximation should be used.

2) Ordinal Variables: Ordinal variables (\(r \ge 2\) categories) are generated by discretizing the standard normal variables at quantiles. These quantiles are determined by evaluating the inverse standard normal CDF at the cumulative probabilities defined by each variable’s marginal distribution. If the support for variable \({Y}_{i}\) is not provided, the default is to use \(\{1,\ 2,\ ...,\ r_{i}\}\).

  1. The required parameters for simulating ordinal variables include: the cumulative marginal probabilities and support values (if desired). The probabilities should be combined into a list of length equal to the number of ordinal variables. The \(i^{th}\) element is a vector of the cumulative probabilities defining the marginal distribution of the \(i^{th}\) variable. If the variable can take \(r\) values, the vector will contain \(r - 1\) probabilities (the \(r^{th}\) is assumed to be \(1\)).

  2. For binary variables, the user-supplied probability should be the probability of the \(1^{st}\) (lower) support value. This would ordinarily be considered the probability of failure (\(q\)), while the probability of the \(2^{nd}\) (upper) support value would be considered the probability of success (\(p = 1 - q\)). The support values should be combined into a separate list. The \(i^{th}\) element is a vector containing the \(r\) ordered support values. If not provided, the default is for the \(i^{th}\) element to be the vector \(1, ..., r\).

3) Poisson and Negative Binomial Variables: Count variables are generated using the inverse CDF method. The cumulative distribution function of a standard normal variable has a uniform distribution. The appropriate quantile function \(F_{Y}^{-1}\) is applied to this uniform variable with the designated parameters to generate the count variable: \(Y = F_{y}^{-1}(\Phi(Z))\).

Zero-inflated distributions: A zero-inflated (ZI) random variable \(Y_{ZI}\) is a mixture of a degenerate distribution having the point mass at \(0\) and another distribution \(Y\) that contributes both \(0\) and non-\(0\) values. If the mixing probability is \(\phi\), then:
\[\begin{equation} P(Y_{ZI} = 0) = \phi + (1 - \phi) * P(Y = 0),\ \ 0 < \phi < 1. \tag{4} \end{equation}\]

Therefore, \(\phi\) is the probability of a structural zero, and setting \(\phi = 0\) reduces \(Y_{ZI}\) to the variable \(Y\). In SimCorrMix, \(Y\) can have either a Poisson distribution (\(Y_P\)) or a NB distribution (\(Y_{NB}\)) (Ismail and Zamani 2013; Lambert 1992; Zhang, Mallick, and Yi 2016).

  1. Zero-inflated Poisson (ZIP): The model for \(Y_{ZIP} \sim ZIP(\lambda,\ \phi),\ \lambda>0,\ 0 < \phi < 1\) is:
\[\begin{equation} \begin{split} P(Y_{ZIP} = 0) &= \phi + (1 - \phi) * exp(-\lambda) \\ P(Y_{ZIP} = y) &= (1 - \phi) * exp(-\lambda) * \frac{\lambda^y}{y!},\ \ y = 1, 2, ... \end{split} \tag{5} \end{equation}\]

The mean of \(Y_{ZIP}\) is \((1 - \phi) * \lambda\), and the variance is \(\lambda + \lambda^2 * \phi/(1 - \phi)\) (see VGAM::dzipois).

The zero-deflated Poisson distribution may be obtained by setting \(\phi \in (-1/(exp(\lambda) - 1),\ 0)\), so that the probability of a zero count is less than the nominal Poisson value. In this case, \(\phi\) no longer represents a probability.

When \(\phi = -1/(exp(\lambda) - 1)\), the random variable has a positive-Poisson distribution (see VGAM::dpospois). The probability of a zero response is \(0\), and the other probabilities are scaled to sum to \(1\).

The parameters \(\lambda\) and \(\phi\) are specified through the inputs lam and p_zip. Setting p_zip = 0 (default setting) generates a regular Poisson variable. Parameters for zero-inflated Poisson variables should follow those for regular Poisson variables in all function inputs. For Correlation Method 2, a vector of total cumulative probability truncation values should be given in pois_eps. These values represent the amount removed from the total cumulative probability when creating finite supports. The value may differ by variable, but the default value is \(0.0001\) (suggested by Barbiero and Ferrari (2015)). For example, pois_eps = 0.0001 means that the support values removed have a total probability of \(0.0001\) of occurring in the distribution of that variable. The effect is to remove improbable values, which may be of concern if the user wishes to replicate a distribution with outliers.

  1. Zero-inflated Negative Binomial (ZINB): The model for \(Y_{ZINB} \sim ZINB(size,\ p,\ \phi),\ size>0,\ 0<p\leq1,\ \ 0 < \phi < 1\) is:
\[\begin{equation} \begin{split} P(Y_{ZINB} = 0) &= \phi + (1 - \phi) * p^{size} \\ P(Y_{ZINB} = y) &= (1 - \phi) * \frac{\Gamma(y + size)}{\Gamma(size) * y!} * p^{size} * (1 - p)^{size},\ \ y = 1, 2, ... \end{split} \tag{6} \end{equation}\]

In this formulation, the Negative Binomial component \(Y_{NB}\) represents the number of failures which occur in a sequence of independent Bernoulli trials before a target number of successes (\(size\)) is reached. The probability of success in each trial is \(p\). \(Y_{NB}\) may also be parametrized by the mean \(\mu\) and dispersion parameter \(size\) so that \(p = size/(size + \mu)\) or \(\mu = size * (1-p)/p\) (see stats::dnbinom).

The mean of \(Y_{ZINB}\) is \((1 - \phi) * \mu\), and the variance is \((1 - \phi) * \mu * (1 + \mu * (\phi + 1/size))\) (see VGAM::dzinegbin).
The zero-deflated NB distribution may be obtained by setting \(\phi \in (-p^{size}/(1 - p^{size}),\ 0)\), so that the probability of a zero count is less than the nominal NB value. Again, in this case, \(\phi\) no longer represents a probability.

The positive-NB distribution can be obtained with \(\phi = -p^{size}/(1 - p^{size})\) (see VGAM::dposnegbin). The probability of a zero response is \(0\), and the other probabilities are scaled to sum to \(1\).

The parameters \(size\), \(p\), \(\mu\), and \(\phi\) are specified through the inputs size, prob, mu, and p_zinb. Either prob or mu should be given for all NB and ZINB variables. Setting p_zinb = 0 (default setting) generates a regular NB variable. Parameters for zero-inflated NB variables should follow those for regular NB variables in all function inputs. For Correlation Method 2, a vector of total cumulative probability truncation values should be given in nb_eps. The default value in the simulation functions is \(0.0001\).

The distributions functions are taken from the VGAM package (Yee 2018).

Error Loop

The error loop may be used to correct the final pairwise correlation of simulated variables to be within a user-specified precision value (epsilon) of the target correlation. It updates the pairwise intermediate MVN correlation iteratively in a loop until either the maximum error is less than epsilon or the number of iterations exceeds the maximum number set by the user (maxit). Some code has been modified from the SimMultiCorrData package (Fialkowski 2018). Below is a description of the algorithm, which is executed within each variable pair across all k = k_cat + k_cont + k_comp + k_pois + k_nb variables (k_comp is the number of component distributions for the continuous mixture variables). rho_calc is the calculated final correlation matrix updated in each iteration, rho is the target final correlation, Sigmaold is the intermediate correlation from the previous iteration, it is the iteration number, q is the row number, r is the column number, and maxerr is the absolute pairwise correlation error between rho_calc[q, r] and rho[q, r].

  1. If rho[q, r] = 0, set Sigma[q, r] = 0.

  2. While maxerr is greater than epsilon and it is less than maxit:
  1. Calculate new intermediate correlation:
    1. If rho[q, r] * (rho[q, r]/rho_calc[q, r]) <= -1, then: Sigma[q, r] = Sigmaold[q, r] * (1 + 0.1 * (1 - Sigmaold[q, r]) * -sign(rho[q, r] - rho_old[q, r]))
    2. If rho[q, r] * (rho[q, r]/rho_calc[q, r]) >= 1, then: Sigma[q, r] = Sigmaold[q, r] * (1 + 0.1 * (1 - Sigmaold[q, r]) * sign(rho[q, r] - rho_old[q, r]))
    3. Else, Sigma[q, r] = Sigmaold[q, r] * (rho[q, r]/rho_calc[q, r]).
  2. Eigenvalue decomposition is done on the kxk Sigma correlation matrix. If Sigma is not positive-definite, the negative eigenvalues are replaced with \(0\).
  3. Generate new normal variables \(X_{i}\), \(1 \leq i \leq\) k with correlation matrix Sigma.
  4. Generate new \(Y_{i}\), \(1 \leq i \leq\) k using the appropriate transformations on the \(X_{i}\).
  5. The correlation rho_calc of all the \(\bm{Y}\) variables is calculated.
  6. Set Sigmaold = Sigma and increase it by 1.
  1. Store the number of iterations in the matrix niter.

The error loop does increase simulation time, but it can improve accuracy in most situations. It may be unsuccessful in more difficult to obtain correlation structures. Some cases utilizing negative correlations will have results similar to those without the error loop. Trying different values of epsilon (i.e., \(0.01\) instead of \(0.001\)) can help in these cases. For a given row (q = 1, …, nrow(Sigma)), the error loop progresses through the intermediate correlation matrix Sigma by increasing column index (r = 2, …, ncol(Sigma), r not equal to q). Each time a new pairwise correlation Sigma[q, r] is calculated, the new Sigma matrix is imposed on the intermediate normal variables X, the appropriate transformations are applied to get Y, and the final correlation matrix rho_calc is found. Even though the intermediate correlations from previous q, r combinations are not changed, the final correlations are affected. The fewer iterations for a given q, r combination, the less rho_calc[q, r] changes. Since larger values of epsilon require fewer iterations, using epsilon = 0.01 may give better results than epsilon = 0.001.

Correlation Bounds

Each simulation pathway in SimCorrMix has its own function to calculate correlation boundaries, given specified distributional parameters, and to check if a target correlation matrix rho falls within these boundaries. Correlation method 1 uses validcorr and correlation method 2 uses validcorr2. The parameter inputs should be checked first using validpar. Some code has been modified from the SimMultiCorrData package (Fialkowski 2018). The distribution functions for Poisson and Negative Binomial variables are taken from the VGAM package (Yee 2018).

Some general methods for determining correlation boundaries:

The Generate, Sort, and Correlate (GSC) Algorithm:

The GSC algorithm is a flexible method for determining empirical correlation bounds when the theoretical bounds are unknown. The steps are as follows:

  1. Generate independent random samples from the desired distributions using a large number of observations (i.e. \(n = 100,000\)).

  2. Lower Bound: Sort the two variables in opposite directions (i.e., one increasing and one decreasing) and find the sample correlation.

  3. Upper Bound: Sort the two variables in the same direction and find the sample correlation.

Demirtas and Hedeker (2011) showed that the empirical bounds computed from the GSC method are similar to the theoretical bounds (when they are known).

The Frechet-Hoeffding Correlation Bounds:

Suppose two random variables \(Y_1\) and \(Y_2\) have cumulative distribution functions given by \(F_1\) and \(F_2\). Let \(U\) be a Uniform(0,1) random variable, i.e. representing the distribution of the standard normal CDF. Then Hoeffding (1994) and Fréchet (1951) showed that bounds for the correlation between \(Y_1\) and \(Y_2\) are given by:
\[\begin{equation} \Big\{cor\left(F_1^{-1}(U), F_2^{-1}(1 - U)\right),\ cor\left(F_1^{-1}(U), F_2^{-1}(U)\right)\Big\}. \tag{1} \end{equation}\]

Methods Used in Both Pathways:

First, the calculations which are equivalent in the two pathways will be discussed by variable type.

Ordinal Variables:

  1. Binary pairs: The correlation bounds are determined as in Demirtas, Hedeker, and Mermelstein (2012), who used the method of Emrich and Piedmonte (1991). The joint distribution is determined by “borrowing” the moments of a multivariate normal distribution. For two binary variables \(Y_1\) and \(Y_2\), with success probabilities \(p_1\) and \(p_2\), the boundaries are given by:
    \[\begin{equation} \Big\{max\left(-\sqrt{(p_1p_2)/(q_1q_2)},\ -\sqrt{(q_1q_2)/(p_1p_2)}\right),\ \ \ min\left(\sqrt{(p_1q_2)/(q_1p_2)},\ \sqrt{(q_1p_2)/(p_1q_2)}\right)\Big\}, \tag{2} \end{equation}\]

    where \(q_1 = 1 - p_1\) and \(q_2 = 1 - p_2\).

  2. Binary-Ordinal or Ordinal-Ordinal pairs: Randomly generated variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.

Continuous Variables: Continuous variables are randomly generated using constants from SimMultiCorrData::find_constants and a vector of sixth cumulant correction values (if provided). The GSC algorithm is used to find the lower and upper bounds.

Continuous - Ordinal Pairs: Randomly generated ordinal variables with the given marginal distributions and the previously generated continuous variables are used in the GSC algorithm to find the correlation bounds.

Correlation Method 1:

The GSC bounds are used for all variable pair types except Poisson and Negative Binomial variables. The Frechet-Hoeffding bounds are used for Poisson-Poisson, NB-NB, and Poisson-NB variable pairs.

Correlation Method 2:

In correlation method 2, count variables (regular or zero-inflated) are treated as “ordinal” by truncating their infinite supports. The maximum support values for the Poisson variables, given the cumulative probability truncation values (pois_eps), means (lam), and probabilities of structural zeros (p_zip for zero-inflated Poisson), are calculated using maxcount_support. The finite supports are used to determine marginal distributions for each Poisson variable. The maximum support values for the Negative Binomial variables, given the cumulative probability truncation values (nb_eps), sizes (size), success probabilities (prob) or means (mu), and probabilities of structural zeros (p_zinb for zero-inflated NB), are calculated using maxcount_support. The finite supports are used to determine marginal distributions for each NB variable.

The GSC bounds are used for all variable pair types.

References

Barbiero, A, and P A Ferrari. 2015. “Simulation of Correlated Poisson Variables.” Applied Stochastic Models in Business and Industry 31: 669–80. https://doi.org/10.1002/asmb.2072.

Davenport, JW, JC Bezder, and RJ Hathaway. 1988. “Parameter Estimation for Finite Mixture Distributions.” Computers & Mathematics with Applications 15 (10): 819–28.

Demirtas, H, and D Hedeker. 2011. “A Practical Way for Computing Approximate Lower and Upper Correlation Bounds.” The American Statistician 65 (2): 104–9. https://doi.org/10.1198/tast.2011.10090.

Demirtas, H, D Hedeker, and R J Mermelstein. 2012. “Simulation of Massive Public Health Data by Power Polynomials.” Statistics in Medicine 31 (27): 3337–46. https://doi.org/10.1002/sim.5362.

Emrich, L J, and M R Piedmonte. 1991. “A Method for Generating High-Dimensional Multivariate Binary Variates.” The American Statistician 45: 302–4. https://doi.org/10.1080/00031305.1991.10475828.

Everitt, B S. 1996. “An Introduction to Finite Mixture Distributions.” Statistical Methods in Medical Research 5 (2): 107–27. https://doi.org/10.1177/096228029600500202.

Fialkowski, A C. 2018. SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types. https://CRAN.R-project.org/package=SimMultiCorrData.

Fleishman, A I. 1978. “A Method for Simulating Non-Normal Distributions.” Psychometrika 43: 521–32. https://doi.org/10.1007/BF02293811.

Fréchet, M. 1951. “Sur Les Tableaux de Corrélation Dont Les Marges Sont Données.” Annales de L’Université de Lyon. Section A: Sciences Mathématiques et Astronomie 9: 53–77.

Headrick, T C. 2002. “Fast Fifth-Order Polynomial Transforms for Generating Univariate and Multivariate Non-Normal Distributions.” Computational Statistics and Data Analysis 40 (4): 685–711. https://doi.org/10.1016/S0167-9473(02)00072-5.

Headrick, T C, and R K Kowalchuk. 2007. “The Power Method Transformation: Its Probability Density Function, Distribution Function, and Its Further Use for Fitting Data.” Journal of Statistical Computation and Simulation 77: 229–49. https://doi.org/10.1080/10629360600605065.

Hoeffding, W. 1994. “Scale-Invariant Correlation Theory.” In The Collected Works of Wassily Hoeffding, edited by N I Fisher and P K Sen, 57–107. Springer Series in Statistics (Perspectives in Statistics). New York: Springer-Verlag. https://doi.org/10.1007/978-1-4612-0865-5_4.

Ismail, N, and H Zamani. 2013. “Estimation of Claim Count Data Using Negative Binomial, Generalized Poisson, Zero-Inflated Negative Binomial and Zero-Inflated Generalized Poisson Regression Models.” Casualty Actuarial Society E-Forum 41 (20): 1–28.

Lambert, D. 1992. “Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing.” Technometrics 34 (1): 1–14.

Pearson, RK. 2011. “Exploring Data in Engineering, the Sciences, and Medicine.” In. New York: Oxford University Press.

Schork, N J, D B Allison, and B Thiel. 1996. “Mixture Distributions in Human Genetics Research.” Statistical Methods in Medical Research 5: 155–78. https://doi.org/10.1177/096228029600500204.

Yee, Thomas W. 2018. VGAM: Vector Generalized Linear and Additive Models. https://CRAN.R-project.org/package=VGAM.

Zhang, X, H Mallick, and N Yi. 2016. “Zero-Inflated Negative Binomial Regression for Differential Abundance Testing in Microbiome Studies.” Journal of Bioinformatics and Genomics 2 (2): 1–9. https://doi.org/10.18454/jbg.2016.2.2.1.