On the exactness of permutation tests

In this article, we discuss the exactness property of permutation tests, which is closely related to how \(p\)-values are computed from the permutations. This article is a summary of the paper by Phipson and Smyth (2010).

Traditional formulation of a statistical test

A statistical test aims at determining whether some observed data can be considered to be strong evidence in favor of a so-called alternative hypothesis \(H_a\) compared to a so-called null hypothesis \(H_0\). To do that, a test statistic \(T\) is defined such that:

Once such a test statistic is available and we observe some data, we can denote by \(t_\mathrm{obs}\) the value of the test statistic computed from the observed data and define the so-called p-value as the null hypothesis tail probability: \[ p_\infty = \mathbb{P}_{H_0} \left( T \ge t_\mathrm{obs} \right). \]The p-value \(p_\infty\) is by definition uniformly distributed on \((0,1)\) under the null hypothesis. Hence, we can define the so-called significance level \(\alpha \in (0,1)\) and decide to reject \(H_0\) in favor of \(H_1\) when \(p_\infty \le \alpha\). By doing this, the probability of wrongly rejecting \(H_0\), also known as the probability of type I errors, is simply: \[ \mathbb{P}_{H_0} \left( p_\infty \le \alpha \right) = \alpha. \]The significance level \(\alpha\) therefore matches by design the probability of type I errors, which means that choosing \(\alpha\) allows to control the probability of type I errors. We say that the test is exact.

When the null distribution of \(T\) is not known, you do not have access to \(p_\infty\). A possible solution is then to resort to resampling techniques to approach the null hypothesis. Once you get an approximate null distribution, the question is how do you compute a p-value that provides an exact statistical test. There are two approaches to this problem: the first estimates the true p-value \(p_\infty\) from the approximate null distribution while the second proposes an alternative definition of the p-value that can be straightforwardly computed. Let us expand on both approaches.

The two-sample problem in a permutational framework

We start with two samples \(X_1, \dots, X_{n_x} \stackrel{iid}{\sim} \mathcal{D}(\theta_x)\) and \(Y_1, \dots, Y_{n_y} \stackrel{iid}{\sim} \mathcal{D}(\theta_y)\). We want to know whether the two distributions are the same or not on the basis of the two samples we collected. In this parametric setting, it boils down to testing the following hypotheses: \[ H_0: \theta_x = \theta_y \quad \mbox{vs} \quad \theta_x \neq \theta_y. \]Let \(T\) be a statistic that depends on the two samples which is suited for elucidating this test, i.e.:

Now, for performing the test, one should also know (an approximation of) the distribution of \(T\) under the null hypothesis, also known as the null distribution, from which the p-value associated to this test can be computed for instance.

If one knows the exact null distribution, then there is no need to resort to permutations. However, if the null distribution is not known, permutations come in handy for approaching it.

The idea is that, under the null hypothesis, the two samples come from the same distribution. Hence, we can pull them together as one big sample of size \(n = n_x + n_y\) generated by that common distribution. At this point, we can split the pooled sample into two random subsets of size \(n_x\) and \(n_y\) respectively, and use them to compute a value of the statistic \(T\). If we repeat many times this splitting strategy, say \(m\) times, we end up with \(m\) values of the statistic from which we can compute the empirical distribution, known as the permutation distribution, which approaches the null distribution.

Permutation p-value as an unbiased estimator of \(p_\infty\)

Let \(t_\mathrm{obs}\) be the value of the statistic computed from the original two samples, \(m\) be the number of permutations used to approach the null distribution and \(B\) be a random variable that counts the number of test statistic values greater than or equal to \(t_\mathrm{obs}\).

By definition, the random variable \(B\) follows a binomial distribution of size \(m\) and rate of success \(p_\infty\). Hence, we can define the following unbiased estimator of \(p_\infty\): \[ \widehat{p_\infty} = \frac{B}{m}. \]

However, when one uses this estimator of the p-value for the purpose of hypothesis testing, the resulting test is not exact. Let us investigate why.

First, remember that the true p-value \(p_\infty\) is a random variable itself, in the sense that its value changes as soon as \(t_\mathrm{obs}\) changes i.e. each time the whole experiment is reconducted. Hence, the probability of wrongly rejecting the null hypothesis using \(\widehat{p_\infty}\) reads: \[ \mathbb{P} \left( \widehat{p_\infty} \le \alpha \right) = \int_\mathbb{R} \mathbb{P} \left( \widehat{p_\infty} \le \alpha | p \right) f_{p_\infty}(p) dp = \int_0^1 \mathbb{P} \left( \widehat{p_\infty} \le \alpha | p \right) dp, \]because \(p_\infty\) is uniformly distributed on \((0,1)\) under the null hypothesis.

Next, notice that \(\widehat{p_\infty}\) can only take on a finite set of values \(\left\{ 0, \frac{1}{m}, \frac{2}{m}, \dots, \frac{m-1}{m}, 1 \right\}\). Hence, we have for any \(b \in 0, 1, \dots, m\): \[ \mathbb{P} \left( \widehat{p_\infty} = \frac{b}{m} \right) = \int_0^1 \mathbb{P} \left( \widehat{p_\infty} = \left. \frac{b}{m} \right| p \right) dp = \int_0^1 \binom{m}{b} p^b (1-p)^{m-b} dp = \frac{1}{m + 1}. \]

We can therefore deduce that: \[ \mathbb{P} \left( \widehat{p_\infty} \le \alpha \right) = \frac{\lfloor m \alpha \rfloor + 1}{m + 1} \neq \alpha. \]

The following R code shows graphically that using \(\widehat{p_\infty}\) as p-value does not provide an exact test:

alpha <- seq(0.01, 0.1, by = 0.01)
m <- c(10, 100, 1000)
p1 <- crossing(alpha, m) %>% 
    p = (floor(m * alpha) + 1) / (m + 1), 
    mf = paste("m =", m)
  ) %>% 
  ggplot(aes(alpha, p, color = mf)) + 
  geom_point() + 
  geom_abline(aes(intercept = 0, slope = 1)) + 
    x = "Significance level",
    y = "Probability of wrongly rejecting H0"
  ) + 
  facet_wrap(vars(mf)) + 
  scale_color_viridis_d() + 
  scale_y_continuous(limits = c(0, 0.1)) + 
  coord_equal() + 
fig <- p1 %>% 
  plotly::ggplotly() %>% 
  widget = fig, 
  file = "exactness-fig1.html", 
  selfcontained = rmarkdown::pandoc_available("1.12.3")
  src = "exactness-fig1.html",
  scrolling = "no", 
  seamless = "seamless",
  frameBorder = "0",
  width = "100%", 
  height = 400