Ken's Blog: Why the signed rank test is a waste of time

Ken Butler

packages

library(tidyverse)

the one-sample t-test

Suppose we have one sample of independent, identically distributed observations from some population, and we want to see whether we believe the population mean or median is some value, like 15. The standard test is the one-sample \(t\)-test, where here we are pretending that we have a reason to know that the mean is greater than 15 if it is not 15:

x <- rnorm(n = 10, mean = 16, sd = 3)
x

 [1] 20.86560 13.76096 15.19321 13.90139 16.63971 18.12691 12.76501
 [8] 18.37393 16.01214 19.28764

t.test(x, mu = 15, alternative = "greater")


    One Sample t-test

data:  x
t = 1.7818, df = 9, p-value = 0.05423
alternative hypothesis: true mean is greater than 15
95 percent confidence interval:
 14.95704      Inf
sample estimates:
mean of x 
 16.49265

In this case, the P-value is 0.0542, so we do not quite reject the null hypothesis that the population mean is 15: there is no evidence (at \(\alpha = 0.05\)) that the population mean is greater than 15. In this case, we have made a type II error, because the population mean is actually 16, and so the null hypothesis is actually wrong but we failed to reject it.

We use these data again later.

The theory behind the \(t\)-test is that the population from which the sample is taken has a normal distribution. This is assessed in practice by looking at a histogram or a normal quantile plot of the data:

ggplot(tibble(x), aes(sample = x)) + stat_qq() + stat_qq_line()

With a small sample, it is hard to detect whether a deviation from normality like this indicates a non-normal population or is just randomness.¹ In this case, I actually generated my sample from a normal distribution, so I know the answer here is randomness.

There is another issue here, the Central Limit Theorem. This says, in words, that the sampling distribution of the sample mean from a large sample will be approximately normal, no matter what the population distribution is. How close the approximation is will depend on how non-normal the population is; if the population is very non-normal (for example, very skewed or has extreme outliers), it might take a very large sample for the approximation to be of any use.

Example: the chi-squared distribution is right-skewed, with one parameter, the degrees of freedom. As the degrees of freedom increases, the distribution becomes less skewed and more normal in shape.²

Consider the chi-squared distribution with 12 df:

tibble(x = seq(0, 30, 0.1)) %>% 
  mutate(y = dchisq(x, df = 12)) %>% 
  ggplot(aes(x = x, y = y)) + geom_line()

This is mildly skewed. Is a sample of size 20 from this distribution large enough to use the \(t\)-test? We can simulate the sampling distribution of the sample mean, since we know what the population is, by drawing many (in this case 1000) samples from it and seeing now normal the simulated sample means look:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rchisq(n = 20, df = 12))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

This is a tiny bit skewed right (the very largest values are slightly too large and the very smallest ones not quite small enough, though the rest of the values hug the line), but I would consider this close enough to trust the \(t\)-test.

Now consider the chi-squared distribution with 3 df, which is more skewed:

tibble(x = seq(0, 10, 0.1)) %>% 
  mutate(y = dchisq(x, df = 3)) %>% 
  ggplot(aes(x = x, y = y)) + geom_line()

How normal is the sampling distribution of the sampling mean now, again with a sample of size 20?

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rchisq(n = 20, df = 3))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

This time, the normal quantile plot definitely strays from the line in a way that indicates a right-skewed non-normal sampling distribution of the sample mean. With this sample size, if the population is as skewed as a chi-squared distribution with 12 degrees of freedom, the \(t\)-test is fine, but if it is as skewed as a chi-squared distribution with 3 degrees of freedom, the \(t\)-test is at best questionable.

So, consideration of whether to use a \(t\)-test has two parts: how normal the population is (answered by asking how normal your sample is), and how large the sample is. The larger the sample size is, the less the normality matters, but it is an awkward judgement call to assess whether the non-normality in the data distribution matters enough given the sample size.³

the sign test

If you have decided that your sample does not have close enough to a normal distribution (given the sample size), and therefore that you should not be using the \(t\)-test, what do you do? Two standard options are the sign test and the signed-rank test, with the latter often being recommended over the former because of the former’s lack of power. These tests are both non-parametric, in that they do not depend on the data having (at least approximately) any specific distribution.

For the sign test, you count how many of your observations are above and below the null median. Here we use the same data as we used for the \(t\)-test:

tibble(x) %>% 
  count(x > 15)

# A tibble: 2 × 2
  `x > 15`     n
  <lgl>    <int>
1 FALSE        3
2 TRUE         7

The number of values (say) above the null median is the test statistic. If the null hypothesis is true, each value is independently either above or below the null median with probability 0.5, and thus the test statistic has a binomial distribution with \(n\) equal to the sample size and \(p = 0.5\). Hence the P-value for an upper-tailed test is

sum(dbinom(7:10, size = 10, prob = 0.5))

[1] 0.171875

The split of 7 values above 15 and 3 below is still fairly close⁴ to 50–50, and so the P-value is large, much larger than for the \(t\)-test.

The sign test does not use the data very efficiently: it only counts whether each data value is above or below the hypothesized median. Thus, if you are in a position to use the \(t\)-test that uses the actual data values, you should do so. However, it is completely assumption-free: as long as the observations really are independent, it does not matter at all what the population distribution looks like.

the signed rank test

The signed-rank test occupies a kind of middle ground between the sign test and the \(t\)-test.

Here’s how it works for our data, testing for a median of 15, against an upper-tailed alternative:

tibble(x) %>% 
  mutate(diff = x - 15) %>% 
  mutate(abs_diff = abs(diff)) %>% 
  mutate(rk = rank(abs_diff)) -> d
d

# A tibble: 10 × 4
       x   diff abs_diff    rk
   <dbl>  <dbl>    <dbl> <dbl>
 1  20.9  5.87     5.87     10
 2  13.8 -1.24     1.24      4
 3  15.2  0.193    0.193     1
 4  13.9 -1.10     1.10      3
 5  16.6  1.64     1.64      5
 6  18.1  3.13     3.13      7
 7  12.8 -2.23     2.23      6
 8  18.4  3.37     3.37      8
 9  16.0  1.01     1.01      2
10  19.3  4.29     4.29      9

Subtract the hypothesized median from each data value, and then rank the differences from smallest to largest in terms of absolute value. The smallest difference in size is 0.193, which gets rank 1, and the largest in size is \(5.87\), which gets rank 10. One of the negative differences is \(-2.23\), which is the fifth largest in size (has rank 6).

The next stage is to sum up the ranks separately for the positive and negative differences:

d %>% group_by(diff > 0) %>% 
  summarize(sum = sum(rk))

# A tibble: 2 × 2
  `diff > 0`   sum
  <lgl>      <dbl>
1 FALSE         13
2 TRUE          42

There are only three negative differences, so their ranks add up to only 13, compared to a sum of 42 for the positive differences.⁵ For an upper-tailed test, the test statistic is the sum of the positive differences, which, if large enough, will lead to rejection of the null hypothesis.

Is 42 large enough to reject the null with?

wilcox.test(x, mu = 15, alternative = "greater")


    Wilcoxon signed rank exact test

data:  x
V = 42, p-value = 0.08008
alternative hypothesis: true location is greater than 15

The P-value is 0.0801, not small enough to reject the null median of 15 in favour of a larger value. It is a little bigger than for the \(t\)-test, but smaller than for the sign test.

A historical note: the name usually attached to the signed-rank test is Frank Wilcoxon. He worked out the null distribution of the signed rank statistic (an exercise in combinatorics).

The R function is a bit of a confusing misnomer, because there was also a statistician called Walter Francis Willcox, who had nothing to do with this test.

assessing the signed rank test

The signed-rank test seemed to behave well enough in our example, with actually normal data. But the point of mentioning the test is as something to use when the data are not normal.

So let’s take some samples from our skewed chi-squared distribution with 3 df.

This distribution has this median:

med <- qchisq(0.5, 3)
med

[1] 2.365974

and away we go. I’ll do 10,000 simulations this time:

tibble(sim=1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rchisq(10, 3))) %>% 
  mutate(my_test = list(wilcox.test(my_sample, mu = med, alternative = "greater"))) %>% 
  mutate(my_p = my_test$p.value) -> dd
dd

# A tibble: 10,000 × 4
# Rowwise: 
     sim my_sample  my_test   my_p
   <int> <list>     <list>   <dbl>
 1     1 <dbl [10]> <htest> 0.862 
 2     2 <dbl [10]> <htest> 0.722 
 3     3 <dbl [10]> <htest> 0.5   
 4     4 <dbl [10]> <htest> 0.615 
 5     5 <dbl [10]> <htest> 0.0322
 6     6 <dbl [10]> <htest> 0.947 
 7     7 <dbl [10]> <htest> 0.0322
 8     8 <dbl [10]> <htest> 0.0244
 9     9 <dbl [10]> <htest> 0.0967
10    10 <dbl [10]> <htest> 0.539 
# … with 9,990 more rows

Since the null hypothesis is true, the P-values should have a uniform distribution:

ggplot(dd, aes(x = my_p)) + geom_histogram(bins = 12)

That doesn’t look very uniform, but rather skewed to the right, with too many low values, so that the test rejects more often than it should:⁶

dd %>% count(my_p <= 0.05)

# A tibble: 2 × 2
# Rowwise: 
  `my_p <= 0.05`     n
  <lgl>          <int>
1 FALSE           9194
2 TRUE             806

A supposed test at \(\alpha = 0.05\) actually has a probability near 0.08 of making a type I error. (That’s why I did 10,000 simulations, in the hopes of eliminating sampling variability as a reason for it being different than 0.05.)

To investigate what happened, let’s look at one random sample and see whether we can reason it out:⁷

set.seed(457297)

x <- rchisq(10, 3)
x

 [1] 5.6369920 7.8136036 2.8290842 8.2463800 0.6228536 0.8004611
 [7] 2.2627925 6.3275717 1.2881783 0.6634575

and go through the calculations for the signed rank statistic again:

tibble(x) %>% 
  mutate(diff = x - med) %>% 
  mutate(abs_diff = abs(diff)) %>% 
  mutate(rk = rank(abs_diff)) -> d
d

# A tibble: 10 × 4
       x   diff abs_diff    rk
   <dbl>  <dbl>    <dbl> <dbl>
 1 5.64   3.27     3.27      7
 2 7.81   5.45     5.45      9
 3 2.83   0.463    0.463     2
 4 8.25   5.88     5.88     10
 5 0.623 -1.74     1.74      6
 6 0.800 -1.57     1.57      4
 7 2.26  -0.103    0.103     1
 8 6.33   3.96     3.96      8
 9 1.29  -1.08     1.08      3
10 0.663 -1.70     1.70      5

There are five positive and five negative differences, exactly as we would expect. But the positive differences are the four largest ones in size, so that the sum of the ranks for the positive differences is quite a bit larger than the sum of the ranks for the negative differences: 36 as against 19:

d %>% 
  group_by(diff > 0) %>% 
  summarize(sum = sum(rk))

# A tibble: 2 × 2
  `diff > 0`   sum
  <lgl>      <dbl>
1 FALSE         19
2 TRUE          36

This seems like a bit of a difference in rank sums, given that the null hypothesis is actually true.

Why did this happen, and why might it happen again? The population distribution is skewed to the right, so that there will occasionally be sample values much larger than the null median (even if that median is correct). There can not be sample values much smaller than the null median, because the distribution is bunched up at the bottom. That means that the positive differences will tend to be the largest ones in size, and hence the test statistic will tend to be bigger, and the P-value smaller, than ought to be the case.

conclusions

The usual get-out for the above is to say that the signed-rank test only applies to symmetric distributions. Except that, one of the principal ways that the \(t\)-test can fail is that the population distribution is skewed, and what we are then saying is that in that situation, we cannot use the signed-rank test either. Really, the only situation in which the signed-rank test has any value is when the population is symmetric with long tails or outliers, which seems to me a small fraction of the times when you would not want to use a \(t\)-test.

So, the official recommendation is:

when the population distribution seems normal enough (given the sample size), use the \(t\)-test
when the population distribution is not normal enough but is apparently symmetric, use the signed-rank test
otherwise, use the sign test.

The second of those seems a bit unlikely (or unlikely to be sure enough about in practice), so that when I teach this stuff, it’s the \(t\)-test or the sign test. As I have explained, the signed-rank test is only very rarely useful, and therefore, I contend, it is a waste of time to learn about it.

This is a good place to use Di Cook’s idea of a “line-up”, where you generate eight normal quantile plots of actual normal data, and then add your actual normal quantile plot, shuffle them around, and see whether you can pick which one is your data. If you can, your data is different from what random normals would produce.↩︎
In the limit as degrees of freedom increases, the distribution is normal.↩︎
When you have actual data from some unknown distribution, one way to get a sense of the sampling distribution of the sample mean is to use the bootstrap: generate a large number of samples from the sample with replacement, work out the mean of each one, and then see whether that distribution of sample means is close to normal. This is still a subjective call, but at least it is only one thing to assess, rather than having to combine an assessment of normality with an assessment of sample size.↩︎
In the sense that if you tossed a fair coin 10 times, you would not be terribly surprised to see 7 heads and 3 tails.↩︎
The three negative differences average to a rank of about 4.3, while the seven positive differences average to a rank of 6. Thus the positive differences are typically bigger in size than the negative ones are.↩︎
A 95% confidence interval for the true type I error probability is from 0.075 to 0.086, so it is definitely higher than 0.05. prop.test is a nice way to get this interval.↩︎
I am cheating and using one that I think makes the point clear.↩︎

Why the signed rank test is a waste of time