This tutorial shows how to use the **MADPop** package to
test for genetic differences between two populations, of which the
individuals of a same species contain a variable number of alleles.

`## Loading required package: MADPop`

`## Loading required package: rstan`

`## Loading required package: StanHeaders`

```
##
## rstan version 2.32.5 (Stan version 2.32.2)
```

```
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)
```

Our data consists of \(N = 215\) recordings of Major Histocompatibility Complex (MHC) genotypes of lake trout from \(K = 11\) lakes in Ontario, Canada. For each of the fish, between 1-4 alleles in the MHC genotype are recorded. This is partially because duplicate genes are undetectable by current instrumentation, and possibly because the fish possess a variable number of alleles of a given MHC gene.

Our dataset `fish215`

is included with
**MADPop**. A random sample from it looks like this:

```
## Lake A1 A2 A3 A4
## 68 Michipicoten r.3 r.4
## 167 Kingscote r.1 r.8
## 129 Dickey v.3 r.4
## 162 Kingscote w.5
## 43 Crystal r.4 t.7
## 14 Hogan x.4 r.4
```

The first column is the lake name (or population ID) for each sample,
the remaining four columns are for potentially recorded allele codes
(`A1`

-`A4`

). Here the code to identify a unique
allele is a small letter followed by a number, but it could have been
the sequence of integers \(1, 2,\ldots,
A\), which for the `fish215`

data is \(A = 57\) unique alleles.

It is relatively straightforward to import a CSV file into the format
above. An example of this is given along with our raw data in the
**extdata** directory of the local copy of the
**MADPop** package.

Suppose that we wish to compare two lakes, say Dickey and Simcoe. The
allele counts in these lakes are in the table below. It is a subset of
the full contingency table on all \(K =
11\) lakes, which is produced by the **MADPop**
function `UM.suff()`

:

```
popId <- c("Dickey", "Simcoe") # lakes to compare
Xsuff <- UM.suff(fish215) # summary statistics for dataset
ctab <- Xsuff$tab[popId,] # contingency table
ctab <- ctab[,colSums(ctab) > 0] # remove alleles with no counts
#ctab
rbind(ctab, Total = colSums(ctab))
```

```
## 1.20 1.4.5 1.4.9.45 1.5 10 20 20.54 27.28.29 3 4 4.10 4.11 4.20.27
## Dickey 1 1 0 2 0 2 1 1 0 0 1 0 1
## Simcoe 0 0 1 1 2 0 0 0 1 1 2 1 0
## Total 1 1 1 3 2 2 1 1 1 1 3 1 1
## 4.20.31 4.20.45 4.27 4.33 4.5 4.53.55 4.56 4.7 4.7.10 5 5.20 5.30 5.32
## Dickey 1 1 1 0 1 1 1 0 0 0 1 1 1
## Simcoe 0 0 0 1 1 0 0 3 1 1 0 0 0
## Total 1 1 1 1 2 1 1 3 1 1 1 1 1
## 5.7 7 9.11
## Dickey 1 0 0
## Simcoe 1 2 1
## Total 2 2 1
```

The unique allele identifiers are encoded as integers between \(1\) and \(A\) and separated by dots. The original
allele names are stored in `Xsuff$A`

, such that the genotype
of the first column `1.20`

is

`## [1] 1 20`

```
## A1 A20
## "r.1" "u.4"
```

There are \(C = 29\) genotype combinations observed in these two lakes, corresponding to each column of the table.

In the two-population problem we have \(K = 2\) lakes with \(N_1\) and \(N_2\) fish sampled from each. Let \(\boldsymbol Y_k = (Y_{k1}, \ldots Y_{kC})\) denote the counts for each genotype observed in lake \(k\), such that \(\sum_{i=1}^C Y_{ki} = N_k\). The sampling model for these data is \[ \boldsymbol Y_k \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \boldsymbol \rho_k), \quad k = 1,2, \] where \(\boldsymbol \rho_k = (\rho_{k1},...,\rho_{kC})\) are the population proportions of each genotype, and \(\sum_{i=1}^C \rho_{ki} = 1\).

Our objective is to test \[ \begin{split} H_0 &: \textrm{The two populations have the same genotype proportions} \\ & \phantom{: } \iff \boldsymbol \rho_1 = \boldsymbol \rho_2. \end{split} \] The classical test statistics for assessing \(H_0\) are Pearson’s Chi-Square statistic \(\mathcal X\) and the Likelihood Ratio statistic \(\Lambda\), \[ \mathcal X = \sum_{k=1}^2 \sum_{i=1}^C \frac{(N_k\hat\rho_i - Y_{ki})^2}{N_k\hat\rho_i}, \qquad \Lambda = 2 \sum_{k=1}^2 \sum_{i=1}^C Y_{ki} \log\left(\frac{Y_{ki}}{N_k\hat\rho_i}\right), \qquad \hat \rho_i = \frac{Y_{1i} + Y_{2i}}{N_1 + N_2}. \] Under \(H_0\), the asymptotic distribution of either of these test statistics \(T = \mathcal X\) or \(\Lambda\) is \(\chi^2_{(C-1)}\), such that the \(p\)-value \[ \textrm{p}_\textrm{v}= \textrm{Pr}(T > T_\textrm{obs} \mid H_0) \] for an observed value of \(T_\textrm{obs}\) can be estimated as follows:

```
# observed values of the test statistics
chi2.obs <- chi2.stat(ctab) # Pearson's chi^2
LRT.obs <- LRT.stat(ctab) # LR test statistic
T.obs <- c(chi2 = chi2.obs, LRT = LRT.obs)
# p-value with asymptotic calculation
C <- ncol(ctab)
pv.asy <- pchisq(q = T.obs, df = C-1, lower.tail = FALSE)
signif(pv.asy, 2)
```

```
## chi2 LRT
## 0.330 0.041
```

The Chi-Square and LR tests are asymptotically equivalent and so
should give roughly the same \(p\)-values. The huge discrepancy observed
here indicates that the sample sizes are too small for asymptotics to
kick in. A more reliable \(p\)-value
estimate can be obtained by the Bootstrap method, which in this case
consists of simulating \(M\) contigency
tables with \(\boldsymbol Y_k^{(m)}
\stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \hat{\boldsymbol
\rho})\), \(m = 1, \ldots, M\),
where \(\hat{\boldsymbol \rho}\) is the
estimate of the common probability vector \(\boldsymbol \rho_1 = \boldsymbol \rho_2\)
under \(H_0\). For each contingency
table \((\boldsymbol Y_1^{(m)}, \boldsymbol
Y_2^{(m)})\), we calculate the test statistic \(T^{(m)}\), and the bootstrapped p-value is
defined as \[
\textrm{p}_\textrm{v}^{\textrm{boot}}= \frac 1 M \sum_{m=1}^M
\delta(T^{(m)} \ge T_\textrm{obs}).
\] \(\textrm{p}_\textrm{v}^{\textrm{boot}}\) is
calculated with **MADPop** as follows:

```
N1 <- sum(ctab[1,]) # size of first sample
N2 <- sum(ctab[2,]) # size of second sample
rho.hat <- colSums(ctab)/(N1+N2) # common probability vector
# bootstrap distribution of the test statistics
# set verbose = TRUE for progress output
system.time({
T.boot <- UM.eqtest(N1 = N1, N2 = N2, p0 = rho.hat, nreps = 1e4,
verbose = FALSE)
})
```

```
## user system elapsed
## 0.224 0.014 0.240
```

```
## chi2 LRT
## 0.0068 0.0059
```

Note that the bootstrap \(p\)-values for both tests are roughly the same and decisively reject \(H_0\), whereas the less reliable asymptotic \(p\)-values both failed to reject (at quite different significance levels).

Bootstrapping overcomes many deficiencies of the asymptotic \(p\)-value calculation. However,
bootstrapping has a tendency to reject \(H_0\) when sample sizes are small. To see
why this is, consider all columns of `ctab`

which have only
one genotype count between the two lakes:

```
itab1 <- colSums(ctab) == 1 # single count genotypes
cbind(ctab[,itab1],
Other = rowSums(ctab[,!itab1]),
Total = rowSums(ctab))
```

```
## 1.20 1.4.5 1.4.9.45 20.54 27.28.29 3 4 4.11 4.20.27 4.20.31 4.20.45 4.27
## Dickey 1 1 0 1 1 0 0 0 1 1 1 1
## Simcoe 0 0 1 0 0 1 1 1 0 0 0 0
## 4.33 4.53.55 4.56 4.7.10 5 5.20 5.30 5.32 9.11 Other Total
## Dickey 0 1 1 0 0 1 1 1 0 7 20
## Simcoe 1 0 0 1 1 0 0 0 1 12 20
```

There are \(c_1 = 21\) such columns,
accounting for \(\hat p_1 = 0.525\) of
the common genotype distribution under \(H_0\), as estimated from the two-lake
sample. For each of these columns, observing counts in one lake but not
the other provides evidence against \(H_0\). Moreover, under the estimated common
distribution \(\hat {\boldsymbol
\rho}\), it is very unlikely to have counts in only one of the
lakes for each of these \(c_1 = 21\)
genotypes. Therefore, the data appear to provide very strong evidence
against \(H_0\). However, it is not so
unlikely to have \(c_1 = 21\) one-count
genotypes if the true number of unique genotypes in these two lakes is
much larger than the observed value of \(C =
29\). With \(C = 29\) unique
genotypes in only \(N = N_1 + N_2 =
40\) fish samples, it is quite plausible that a new sample of
fish would yield several genotypes which are not present in the original
two-lake sample `ctab`

.

One way to obtain information about the unobserved genotypes in lakes
Dickey and Simcoe is to consider the genotypes in *all* \(K = 11\) Ontario lakes for which we have
collected data. A natural way to do this is through a hierarchical
model: \[
\begin{split}
\boldsymbol Y_k \mid \boldsymbol \rho_k &
\stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \boldsymbol
\rho_k), \quad k = 1,\ldots,K \\
\boldsymbol \rho_k &
\stackrel{\textrm{iid}}{\sim}\textrm{Dirichlet}(\boldsymbol \alpha),
\qquad \boldsymbol \alpha= (\alpha_1, \ldots, \alpha_C).
\end{split}
\] That is, each population is allowed to have its own
probability vector \(\boldsymbol
\rho_k\). However, these probability vectors are drawn from a
common Dirichlet distribution. The common distribution specifies that
before any data is drawn, we have \[
E[\boldsymbol \rho] = \bar{\boldsymbol \alpha}, \qquad
\textrm{var}(\rho_i) = \frac{(\bar \alpha_i)(1-\bar \alpha_i)}{\alpha_0
+ 1},
\] where \(\alpha_0 = \sum_{i=1}^C
\alpha_i\) and \(\bar{\boldsymbol
\alpha} = \boldsymbol \alpha/ \alpha_0\). Moreover, the posterior
distribution of \(\boldsymbol \rho_k\)
given \(\boldsymbol \alpha\) and the
data is also Dirichlet: \[
\boldsymbol \rho_k \mid \boldsymbol
Y\stackrel{\textrm{ind}}{\sim}\textrm{Dirichlet}(\boldsymbol \alpha+
\boldsymbol Y_k).
\] In this sense, the posterior estimate of the proportion of
genotype \(i\) in population \(k\), \(\rho_{ki}\), can be non-zero even if \(Y*{ki} = 0\). In practice, \(\boldsymbol \alpha\) is estimated from
\(\boldsymbol Y\), the data from all
\(K\) populations (details in the
following section). The extent to which the genotypes observed in one
lake affect inference in the other lakes is determined by \(\alpha*0\) (the larger this value, the
larger the effect).

The likelihood function for this model corresponds to a Dirichlet-Multinomial distribution, \[ \begin{split} \mathcal L(\boldsymbol \alpha\mid \boldsymbol Y) = \prod_{k=1}^K p(\boldsymbol Y_k \mid \boldsymbol \alpha) & = \prod_{k=1}^K \int p(\boldsymbol Y_k \mid \boldsymbol \rho_k) p(\boldsymbol \rho_k \mid \boldsymbol \alpha) \,\mathrm{d}\boldsymbol \rho_k \\ & = \prod_{k=1}^K \left[\frac{N_k!\cdot\Gamma(\alpha_0)}{\Gamma(N_k+\alpha_0)} \prod_{i=1}^C \frac{\Gamma(Y_{ki} + \alpha_i)}{Y_{ki}!\cdot\Gamma(\alpha_i)}\right], \end{split} \] from which \(\boldsymbol \alpha\) can be estimated by maximum likelihood (Minka 2000). Alternatively we estimate \(\boldsymbol \alpha\) using a Bayesian approach, which is more readily extensible to more complex genotype models, for which \(\mathcal L(\boldsymbol \alpha\mid \boldsymbol Y)\) has no closed form (see Future Work).

First, we specify a prior distribution \[
\pi(\alpha_0, \bar{\boldsymbol \alpha}) = \frac{1}{(1+ \alpha_0)^2},
\] which is a uniform distribution on the prior probability
vector \(\bar{\boldsymbol \alpha}\),
and a uniform distribution on the variance-like parameter \(A = (1+\alpha_0)^{-1}\), in the sense that
the prior mean and variance of a given genotype population probability
are \[
E[\rho_{kj}] = \tilde \alpha_j, \qquad \textrm{var}(\rho_{kj}) = A \cdot
\tilde \alpha_j(1-\tilde \alpha_j).
\] Then, we sample from the posterior distribution \(p(\boldsymbol \alpha\mid \boldsymbol Y) \propto
\mathcal L(\boldsymbol \alpha\mid \boldsymbol Y) \pi(\boldsymbol
\alpha)\) using a Markov chain Monte Carlo (MCMC) algorithm
provided by the **Stan** programming language (Stan Development Team
2016), as detailed below.

Under the hierarchical model, the null hypothesis of equal genetic proportions between two populations, say \(k = 1,2\), is \(H_0: \boldsymbol \rho_1 = \boldsymbol \rho_2 = \boldsymbol \rho_{12}\). Our Bayesian hypothesis-testing strategy is implemented in two steps:

- Sample from the posterior distribution \(p(\boldsymbol \alpha\mid \boldsymbol Y, H_0)\).
- Sample from the posterior distribution of either classical test
statistic, \(T = \mathcal X\) or \(\Lambda\), thus obtaining a
*posterior p-value*\[ \textrm{p}_\textrm{v}^{\textrm{post}}= \textrm{Pr}(T > T_\textrm{obs} \mid \boldsymbol Y, H_0). \]

In order to estimate \(\boldsymbol
\alpha\) under \(H_0\), we apply
the Dirichlet-Multinomial distribution to \(K-1\) lakes, with the first two being
collapsed into a common count vector \(\boldsymbol Y_{12} = \boldsymbol Y_1 + \boldsymbol
Y_2\) with sample size \(N_{12} = N_1 +
N_2\). MCMC sampling is implemented via a Hybrid Monte Carlo
(HMC) variant provided by the **R** package
**rstan**. In **MADPop**, sampling from \(p(\boldsymbol \alpha\mid \boldsymbol Y,
H_0)\) is accomplished with the function `hUM.post()`

,
where “hUM” stands for *hierarchical Unconstrained Multinomial*
model. We start by merging the two samples from equal distributions
under \(H_0\):

`## [1] "Dickey" "Simcoe"`

```
eqId0 <- paste0(popId, collapse = ".") # merged lake name
popId0 <- as.character(fish215$Lake) # all lake names
popId0[popId0 %in% popId] <- eqId0
table(popId0, dnn = NULL) # merged lake counts
```

```
## Crystal Dickey.Simcoe Hogan Kingscote Macdonald
## 20 40 21 18 19
## Manitou Michipicoten Opeongo Seneca Slate
## 20 20 20 18 19
```

Next, we sample from \(p(\boldsymbol
\alpha\mid \boldsymbol Y, H_0)\) using
**rstan**:

```
X0 <- cbind(Id = popId0, fish215[-1]) # allele data with merged lakes
nsamples <- 1e4
fit0 <- hUM.post(nsamples = nsamples, X = X0,
rhoId = eqId0, # output rho only for merged lakes
chains = 1, # next two arguments are passed to rstan
warmup = min(1e4, floor(nsamples/10)),
full.stan.out = FALSE)
```

```
##
## SAMPLING FOR MODEL 'DirichletMultinomial' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 5.9e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.59 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 1: Iteration: 1001 / 10000 [ 10%] (Sampling)
## Chain 1: Iteration: 2000 / 10000 [ 20%] (Sampling)
## Chain 1: Iteration: 3000 / 10000 [ 30%] (Sampling)
## Chain 1: Iteration: 4000 / 10000 [ 40%] (Sampling)
## Chain 1: Iteration: 5000 / 10000 [ 50%] (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.83 seconds (Warm-up)
## Chain 1: 7.557 seconds (Sampling)
## Chain 1: 8.387 seconds (Total)
## Chain 1:
```

**rstan** typically outputs a number of messages and
warnings, many of which are harmless. The **MADPop**
package lists **rstan** as a dependency, such that its
sophisticated tuning mechanism is exposed. Optionally,
`hUM.post()`

returns the entire **rstan** output
(`full.stan.out = TRUE`

), which can be run through
**rstan**’s MCMC convergence diagnostics.

By setting the `rhoId`

argument of
`hUM.post()`

, the `fit0`

object contains samples
from \(p(\boldsymbol \alpha, \boldsymbol
\rho_{12} \mid \boldsymbol Y, H_0)\). Boxplots of the 50 highest
posterior probability genotypes in the common distribution \(\boldsymbol \rho_{12}\) are displayed
below:

```
rho.post <- fit0$rho[,1,] # p(rho12 | Y)
# sort genotype counts by decreasing order
rho.count <- colSums(Xsuff$tab[popId,])
rho.ord <- order(colMeans(rho.post), decreasing = TRUE)
# plot
nbxp <- 50 # number of genotypes for which to make boxplots
clrs <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2",
"#D55E00", "#CC79A7")
rho.ct <- rho.count[rho.ord[1:nbxp]]
rho.lgd <- unique(sort(rho.ct))
rho.box <- rho.post[,rho.ord[1:nbxp]]
par(mar = c(5,4.5,.5,.5)+.1, cex.axis = .7)
boxplot(x = rho.box,
las = 2, col = clrs[rho.ct+1], pch = 16, cex = .2)
title(xlab = "Genotype", line = 4)
title(ylab = expression(p(bold(rho)[(12)]*" | "*bold(Y))))
legend("topright", legend = rev(rho.lgd), fill = rev(clrs[rho.lgd+1]),
title = "Counts", ncol = 2, cex = .8)
```

This is calculated analogously to the bootstrapped p-value, by
generating \(M\) contingency tables
with \(\boldsymbol Y_k^{(m)}
\stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \boldsymbol
\rho_{12}^{(m)}\), \(m = 1, \ldots,
M\). However, for each contingency table, the common probability
vector \(\boldsymbol \rho_{12}^{(m)}\)
is different, corresponding to a random draw from \(p(\boldsymbol \rho_{12} \mid \boldsymbol Y,
H_0)\). The test statistic \(T^{(m)}\) is then calculated and we have
\[
\textrm{p}_\textrm{v}^{\textrm{post}}= \frac 1 M \sum_{m=1}^M
\delta(T^{(m)} \ge T_{\textrm{obs}}).
\] \(\textrm{p}_\textrm{v}^{\textrm{post}}\) is
calculated with **MADPop** as follows:

```
## user system elapsed
## 0.387 0.032 0.422
```

We can now compare the three types of p-values (asymptotic, bootstrapped, posterior) for each test statistic (\(\mathcal X\) and \(\Lambda\)):

Asymptotic | Bootstrap | Posterior | |
---|---|---|---|

\(\mathcal X\) | 33.0 | 0.68 | 13 |

\(\Lambda\) | 4.1 | 0.59 | 12 |

We see that \(\textrm{p}_\textrm{v}^{\textrm{post}}\) is much more conservative than \(\textrm{p}_\textrm{v}^{\textrm{boot}}\).

Here we used a completely unconstrained Multinomial model for the
genotype counts, in the sense that the only restriction on \(\boldsymbol \rho_k\) is that \(\rho_{ki} \ge 0\) and \(\sum_{i=1}^C \rho_{ki} = 1\). However, it
is possible to impose genetic constraints such as Hardy-Weinberg
equilibrium, preferential mating, etc., which effectively reduce the
degrees of freedom of \(\boldsymbol
\rho_k\) below \(C-1\). In this
case, the closed-form likelihood \(\mathcal
L(\boldsymbol \alpha\mid \boldsymbol Y)\) is typically
unavailable, but the **Stan** code can easily be modified
to sample from \(p(\boldsymbol \alpha,
\boldsymbol{\mathcal R}\mid \boldsymbol Y)\), where \(\boldsymbol{\mathcal R}= (\boldsymbol \rho_1,
\ldots, \boldsymbol \rho_K)\). We hope to feature some of these
extensions to the basic Dirichlet-Multinomial model in the next version
of **MADPop**.

Minka, T.P. 2000. “Estimating a Dirichlet Distribution.”
*Technical Report*. https://tminka.github.io/papers/dirichlet/minka-dirichlet.pdf.

Stan Development Team. 2016. “Rstan: The R Interface to
Stan.” 2016. https://mc-stan.org/rstan/.