Welch analysis of variance


The standard analysis of variance based on the \(F\)-test has two major assumptions:

  1. Normally distributed data
  2. Equal variance within each group.

The analysis can handle a certain amount of non-normality, but the equal-variance assumption is important because it is required for the idea of “an” error variance to make sense (that is what the error mean square is estimating).

Is it possible to make an ANOVA that can allow for the groups to have different variances? The answer is yes, and it’s the same mechanism as is behind the Welch-Satterthwaite \(t\)-test, which we’ll review first.

Pooled and Welch-Satterthwaite \(t\)-tests

If you have done statistical theory, you probably derived the pooled two-sample \(t\)-test. This is for comparing the means of two populations based on samples from each. The assumptions are the same as for the ANOVA: normally-distributed data, with equal variances in the two groups. Since the two within-group variances are the same, there is only one “error variance” \(\sigma^2\) to estimate, and this is estimated by \(s_p^2\), which is a weighted average of the sample variances from the two groups. The test statistic is then

\[ t = { \bar{x}_1 - \bar{x}_2 \over s_p \sqrt{{1\over n_1}+ {1\over n_2}}} \]

and since there is one population variance \(\sigma^2\) estimated by a sample variance \(s_p^2\), the test statistic \(t\) does indeed have a exact \(t\) distribution with as many degrees of freedom as \(s_p^2\) does, namely \(n_1+n_2-2\).

What if the two groups have unequal variances? Welch and Satterthwaite (independently, as far as I know) had the idea of estimating the two separate \(\sigma^2\)’s by the two sample variances and calculating this \(t\)-statistic instead:

\[ t = { \bar{x}_1 - \bar{x}_2 \over \sqrt{{s_1^2\over n_1}+ {s_2^2\over n_2}}} \] which is the obvious thing. Unfortunately, however, this does not have a \(t\) distribution, because two population variances have been estimated by sample variances.

Welch and Satterthwaite noticed that this \(t\) does have a distribution that looks very like a \(t\), so they hit upon the idea of approximating it by a \(t\). They worked out the variance of this \(t\) under the null hypothesis of equal means, and compared this to the variance of a \(t\) distribution, which depends on its degrees of freedom. This allows a “method of moments” approximation of the degrees of freedom, which is the very complicated formula in a footnote in DeVeaux et al. The complication doesn’t matter in practice, though, since you let your software work out the (fractional) degrees of freedom and get a P-value. If you’re stuck doing this by hand, a “conservative” approximation to the df is the smaller sample size minus 1.

In practice, it is probably good to use the Welch-Satterthwaite \(t\)-test as a default without thinking too hard about it, because the two group variances are not usually “approximately equal”, and it doesn’t make too much sense to act as if they are. That said, the Welch-Satterthwaite and pooled \(t\)-tests often give similar results, as in the example below.

Some researchers are investigating a new method for teaching children to read. There are 44 children in their study; they randomly assign about half of the children to the new reading method (“treatment”) and about half to the standard reading method (“control”). The response variable was the score on a reading test, administered at the end of the study. The groups came out slightly unequal in size. Evidently, once a child has learned to read, they cannot learn to read again, so the design for this study had to be two independent samples rather than something like matched pairs.

Let’s load the Tidyverse:

## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2     ✔ purrr   0.2.4     
## ✔ tibble  1.4.2          ✔ dplyr   0.7.4     
## ✔ tidyr   0.8.0          ✔ stringr 1.3.0     
## ✔ readr   1.1.1          ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

and then read in the data, values separated by spaces:

drp=read_delim("drp.txt"," ")
## Parsed with column specification:
## cols(
##   group = col_character(),
##   score = col_integer()
## )

and summarize the values:

drp %>% group_by(group) %>% 
## # A tibble: 2 x 3
##   group  mean    sd
##   <chr> <dbl> <dbl>
## 1 c      41.5  17.1
## 2 t      51.5  11.0

The means are 10 points different, and the immediate question is whether these are far enough apart to be significantly different. Note also, though, that the sample standard deviations are not that similar either. R’s t.test does the Welch-Satterthwaite test by default:

##  Welch Two Sample t-test
## data:  score by group
## t = -2.3109, df = 37.855, p-value = 0.01319
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -2.691293
## sample estimates:
## mean in group c mean in group t 
##        41.52174        51.47619

I did a one-sided test, since I wanted to see whether the new reading method was an improvement. At \(\alpha=0.05\), it is; the P-value is 0.0132.

What happens if we (evidently mistakenly) run the pooled \(t\)-test here? That is done this way:

##  Two Sample t-test
## data:  score by group
## t = -2.2666, df = 42, p-value = 0.01431
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -2.567497
## sample estimates:
## mean in group c mean in group t 
##        41.52174        51.47619

The P-value, 0.0143, is very similar, and the conclusion at \(\alpha=0.05\) is the same. Maybe the sample variances were “not too unequal”.

The Welch ANOVA

Welch went on and applied his idea to the ANOVA \(F\)-test. His idea was to leave the numerator df in the \(F\) the same, but to tweak the denominator df to account for the groups having unequal variances.

Let’s illustrate using some data on jumping rats. The idea of the experiment was to see whether jumping (as a proxy for physical exercise) makes bones stronger. The rats were randomly divided into three groups: a control group, that did no extra jumping, and two treatment groups, one that jumped high (highjump) and one not so high (lowjump). Both treatment groups spent the same amount of time jumping. The idea with having two treatment groups was to see whether any exercise made a difference, or whether more vigorous exercise was what mattered. The response variable was bone density, a higher value meaning stronger bones.

rats=read_delim("jumping.txt"," ")
## Parsed with column specification:
## cols(
##   group = col_character(),
##   density = col_integer()
## )

Let’s look at boxplots by group:


The control group has outliers, and the low-jumping group has larger spread than the high-jumping group (whose distribution looks skewed). Ignoring the apparent non-normality, what happens when we account for the differences in spread?

First, the regular ANOVA, which we are not at all sure we should trust:

##             Df Sum Sq Mean Sq F value Pr(>F)   
## group        2   7434    3717   7.978 0.0019 **
## Residuals   27  12579     466                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and then the Welch ANOVA, which is done with oneway.test:

##  One-way analysis of means (not assuming equal variances)
## data:  density and group
## F = 8.8164, num df = 2.000, denom df = 17.405, p-value = 0.002268

The P-values are again not that different: 0.0019 vs. 0.0022. You see that the Welch ANOVA has a slightly different \(F\) statistic (in the same way that the Welch-Satterthwaite \(t\)-test has a slightly different \(t\)-statistic), the numerator df is the same, but the denominator df is different, and fractional.

In any case, there are differences among the mean bone densities in the different groups.

The next question is to understand where the differences lie. With regular ANOVA, a standard method is Tukey’s:

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## Fit: aov(formula = density ~ group, data = rats)
## $group
##                   diff       lwr       upr     p adj
## Highjump-Control  37.6  13.66604 61.533957 0.0016388
## Lowjump-Control   11.4 -12.53396 35.333957 0.4744032
## Lowjump-Highjump -26.2 -50.13396 -2.266043 0.0297843

This says the that mean bone density for the high-jumping group is significantly higher than for the other two groups, which are not significantly different. This says, therefore, that it is the high jumping that makes a difference.

The standard follow-up for Welch’s ANOVA appears to be “Games-Howell”. This is part of the function oneway in the userfriendlyscience package:

## Warning in oneway(x = group, y = density, posthoc = "games-howell"): ###
## Warning: the x variable (group) is not a factor! Converting it myself - but
## note that variables in R have data types, and it's advisable to set these
## adequately (use for example 'as.factor'; see '?as.factor' for help)!
## ### Oneway Anova for y=density and x=group (groups: Control, Highjump, Lowjump)
## Omega squared: 95% CI = [.07; .55], point estimate = .32
## Eta Squared: 95% CI = [.11; .52], point estimate = .37
##                                      SS Df      MS    F    p
## Between groups (error + effect) 7433.87  2 3716.93 7.98 .002
## Within groups (error only)      12579.5 27  465.91          
## ### Post hoc test: games-howell
##                   diff  ci.lo ci.hi    t    df    p
## Highjump-Control  37.6  11.28 63.92 3.72 14.83 .006
## Lowjump-Control   11.4 -15.90 38.70 1.08 16.19 .542
## Lowjump-Highjump -26.2 -46.80 -5.60 3.25 17.60 .012

Note that the ANOVA at the top is the standard \(F\)-test, not the Welch ANOVA, but the Games-Howell below is correct. The Games-Howell adjusted P-values differ slightly from Tukey, but tell exactly the same story: no difference between low jump and control, but high jump is better than both. We can therefore be more confident that the conclusions we drew before are not dependent on the unequal spreads.

The warning, by the way, is that the variable group is character rather than a factor (since read_delim does not turn strings into factors when reading from a file). Since the group column does successfully distinguish the groups, the warning can be ignored.


I would recommend doing the Welch ANOVA using oneway.test (worrying only about the approximate normality of the data, and not about equal variances), and follow up with Games-Howell (ignoring the first part of the oneway output). This would be consistent with routinely using the Welch-Satterthwaite \(t\)-test over the pooled test, on the grounds that you never know whether the group variances are “sufficiently equal”. It is always possible to run the regular ANOVA and Tukey as well; if the results are similar, evidently the groups have similar variances and it doesn’t really matter what you do.

I think the Welch ANOVA deserves to be far better known.