Worksheet 7

Published

October 18, 2024

Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.

If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial.

Note: questions 11 through 25 are about analysis of variance. Some of these questions may make more sense after Thursday’s lecture. You might like to leave those until then. I will publish all my solutions on Wednesday evening as usual. Be guided by Thursday’s lecture as to how much of this you will need to know for the midterm. This makes for a rather heavyweight worksheet, but I want to make sure you get some practice at everything you might need for the midterm. (This same material is on Assignment 6, which will not be due until after the midterm.)

Fuel efficiency comparison

Some cars have on-board computers that calculate quantities related to the car’s performance. One of the things measured is the fuel efficiency, that is, how much gasoline the car uses. On an American car, this is measured in miles per (US) gallon. On one type of vehicle equipped with such a computer, the fuel efficiency was measured each time the gas tank was filled up, and the computer was then reset. Twenty observations were made, and are in http://ritsokiguess.site/datafiles/mpgcomparison.txt. The computer’s values are in the column Computer. The driver also calculated the fuel efficiency by hand, by noting the number of miles driven between fill-ups, and the number of gallons of gas required to fill the tank each time. The driver’s values are in Driver. The final column Diff is the difference between the computer’s value and the driver’s value for each fill-up. The data values are separated by tabs.

Read in and display (some of) the data.

Solution

library(tidyverse)

This is like the athletes data, so read_tsv:

my_url <- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
fuel <- read_tsv(my_url)

Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fuel

20 observations, with columns as promised. The first column Fill-up labels the fill-ups from 1 to 20, in case we needed to identify them. (This is an additional clue that we have paired data, or repeated measures in general.¹)

Extra: another way to read the data, and why it works: since the “tab” is a delimiter, distinguishing one data value from the next, you might be wondering whether you could use read_delim somehow. It turns out that you can, but you have to figure out how to represent “tab” as something that read_delim can understand. One way is to search for it, of course, but another way is to see what happens when you use read_delim the way you already know (assuming the data values are separated by spaces):

my_url <- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
fuel0 <- read_delim(my_url, " ")

Rows: 20 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): Fill-up    Computer    Driver  Diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fuel0

This fails, but how it fails is instructive.² There is now one column, amusingly called Fill-up Computer Driver Diff, whose values are text, the four numbers glued together with \t in between. This is how R recognizes a tab character.³ So you can try read_delim by using this as the second input rather than a space.

You could get to this point with less insight by saying to yourself “this thing \t, whatever it is, is what separates the data values, so how about I try this as the second input to read_delim?”

Here’s how it goes:

my_url <- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
fuel0 <- read_delim(my_url, "\t")

Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fuel0

and it works, with four numeric columns as promised.

Another option is to use read_delim without a second input:

my_url <- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
fuel0 <- read_delim(my_url)

Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fuel0

This works, but you need to understand what is going on, something beyond “well, it seems to work”. One way is to look at the message from read_delim that says what got read in:

── Column specification ────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff

The four quantitative columns, but look above that, where it says Delimiter: \t. The delimiter is what separates each data value from the next one on the same line. The symbol \t means “tab”, which you could assert, but it is better to offer some sort of evidence that this is correct. To do that, search for something like R \t.

The first hit for me was actually not R-related, but was a Stack Overflow question (scroll down to the two answers) that has the answer we need. The symbol \t comes from Unix (and is therefore also in Linux and the C programming language, and because R’s heritage is C, R also).

\(\blacksquare\)

What is it that makes this paired data? Explain briefly.

Solution

There are two measurements for each fill-up: the computer’s calculation of gas mileage, and the driver’s. These go together because they are two calculations of what is supposed to be the same thing. (Another hint is that we have differences, with the veiled suggestion that they might be useful for something.)

If you want to have a general way of tackling this kind of problem, ask yourself “is there more than one observation of the same thing per individual?” Then go on and talk about what the individuals and observations are for the data you’re looking at. In this case, the individuals are fill-ups, and there are two observations per fill-up: one by the driver and one by the computer. Compare this with the children learning to read (from lecture): there were \(23+21 = 44\) children altogether (individuals), and each child had one reading score, based on the reading method they happened to be assigned to. So those are 44 independent observations, and a two-sample test is the way to go for them.

Extra 1: To go back to our gas mileages, here is another way you might have received the data:

This looks more like a two-sample layout, with one column of gas_mileage values and a second column indicating whether the Driver or Computer observed⁴ them. But the first two rows were observed at the same fill-up, so we know that these are actually paired (by fill-up) and the layout of the data for our analysis is wrong. (Later in the course, we learn how to convert between the previous “wider” layout and this “longer” one. I didn’t show you the code here, because I didn’t want to confuse you at this point.)

Extra 2: if your individuals have more than two measurements each, you can’t shoehorn it into a one-sample model any more, because there are too many differences among⁵ measurements now. These are repeated measures data, and the modern way to analyze these is using mixed models. The student GPA example in the link shows how it goes; the one in the link is repeated measures because we have six observations (of GPA) for each individual (student).

\(\blacksquare\)

Draw a suitable graph of these data, bearing in mind what you might want to learn from your graph.

Solution

In a matched pairs situation, what matters is whether the differences have enough of a normal distribution. The separate distributions of the computer’s and driver’s results are of no importance. So make a graph of the differences. We are specifically interested in normality, so a normal quantile plot is best:

ggplot(fuel, aes(sample = Diff)) + stat_qq() + stat_qq_line()

The next best plot is a histogram of the differences:

ggplot(fuel, aes(x = Diff)) + geom_histogram(bins = 6)

This one actually looks left-skewed, or has an outlier at the bottom. The appearance may very well depend on the number of bins you choose.

Extra: the distributions of the computer’s and driver’s gas mileage measurements themselves are totally irrelevant to whether you would run a matched pairs test of any kind; the only thing that matters is the distribution of the differences.

The only other graph you could justify (and even then it is a bit of a stretch) would be a scatterplot of the two measurements at each fill-up:

ggplot(fuel, aes(x = Driver, y = Computer)) + geom_point() + 
  geom_abline(intercept = 0, slope = 1)

By adding the line \(y = x\) (which has intercept 0 and slope 1) to the graph, you can see how the pairs of measurements compare to each other; for most of them, the computer’s measurement is larger than the corresponding driver’s measurement (the point is above the line), and for only a few is it smaller. If you had decided to do a matched-pairs sign test, this graph would be interesting (with the line), because the sign test is based on how many points are above the line vs. how many are below. But what this graph does not do is tell you whether you need to do a matched-pairs sign test in the first place, because it does not tell you about the distribution of the differences (in particular, whether that is normal enough). Thus, the scatterplot, even with the line, is inferior to even the histogram of differences.

\(\blacksquare\)

Is there any difference between the average results of the driver and the computer? (Average could be mean or median, whichever you think is best). Do an appropriate test.

Solution

The choices are a matched-pairs \(t\)-test, or a matched-pairs sign test (on the differences). To choose between those, look back at your graph. My take is that the only possible problem is the smallest (most negative) difference, but that is not very much smaller than expected. This is especially so given the sample size (20), which means that the Central Limit Theorem will help us enough to take care of the small outlier.

I think, therefore, that a paired \(t\)-test is the way to go, to test the null that the mean difference is zero (against the two-sided alternative that it is not zero, since we were looking for any difference). There are two ways you could do this: as literal matched pairs:

with(fuel, t.test(Computer, Driver, paired = TRUE))


    Paired t-test

data:  Computer and Driver
t = 4.358, df = 19, p-value = 0.0003386
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 1.418847 4.041153
sample estimates:
mean difference 
           2.73

or, as a one-sample test on the differences:

with(fuel, t.test(Diff, mu = 0))


    One Sample t-test

data:  Diff
t = 4.358, df = 19, p-value = 0.0003386
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 1.418847 4.041153
sample estimates:
mean of x 
     2.73

with the same result.

If you thought that low value was too much of an outlier, the right thing would have been a sign test that the median difference is zero (vs. not zero), thus (using the smmr package):

sign_test(fuel, Diff, 0)

$above_below
below above 
    3    17 

$p_values
  alternative     p_value
1       lower 0.999798775
2       upper 0.001288414
3   two-sided 0.002576828

In any of those cases, we conclude that the average difference is not zero, since the P-values are less than 0.05. (The right one for the sign test is 0.0026.)

I don’t mind which test you do, as long as it is one of the ways of doing matched pairs, and you justify your choice by going back to your graph. Doing one of the tests without saying why is being a STAB22 or STAB57-level statistician (in fact, it may not even be that), not a professional-level applied statistician.

Extra 1: you could find out how the average difference is not zero by looking at a confidence interval; the one for the mean difference is from 1.4 to 4.0. If you did a sign test, you would do this to get a confidence interval:

ci_median(fuel, Diff)

[1] 1.102148 4.500000

1.1 to 4.5, a little wider,⁶ but otherwise very similar.

So we think the computer was estimating the gas mileage to be that much higher than the driver on average.

Extra 2: The P-value for the \(t\)-test was almost 10 times smaller than for the sign test. This is because the \(t\)-test was using the actual differences, not just whether they were greater or less than zero as the sign test did (17 out of the 20 differences were positive). If you thought, as I did, that it was all right to use the \(t\)-test, then the smaller P-value is trustworthy. (The usual thing, if two tests give similar results, as they do here, is that the one with the smaller P-value is better, but that is not always something to rely on. You still need to make an assessment.)

Extra 3: A bootstrapped sampling distribution of the sample mean would be a better way to decide how happy you are with the \(t\)-test. This will incorporate the sample size into your consideration. It is really the same thing as the one for the one-sample \(t\), bearing in mind that it’s the differences you need to work with:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(fuel$Diff, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

This is a tiny bit left-skewed still (the lowest values are very slightly too low, and the highest ones are slightly bunched up). The apparent low-end outlier in the distribution of differences barely even shows up in the sampling distribution. (If we had a smaller sample size, like 10 or especially 5, we would see it here more clearly, and then we would know that a sample size of 5 or 10 is not big enough.)

In other words, despite what you might think from your earlier plot, the matched pairs \(t\)-test would be very much defensible here.

\(\blacksquare\)

The people who designed the car’s computer are interested in whether the values calculated by the computer and by the driver on the same fill-up are usually close together. Explain briefly why it is that looking at the average (mean or median) difference is not enough. Describe what you would look at in addition, and how that would help.

Solution

The mean (or median) difference can be close to zero without the individual differences being close to zero. The driver could be sometimes way higher and sometimes way lower, and the average difference could come out close to zero, even though each individual pair of measurements is not that close together. Thus looking only at the average difference is not enough.

There are (at least) two (related) things you might choose from to look at, in addition to the mean or median:

the spread of the differences (the SD or inter-quartile range). If the average difference and the spread of differences are both close to zero, that would mean that the driver’s and computer’s values are consistently similar.
a suitable confidence interval here (for the mean or median difference, as appropriate for what you did) would also get at this point.

To think about the spread: use the SD if you did the \(t\)-test for the mean, and use the IQR if you did the sign test for the median, that is, one of these:

fuel %>% summarize(diff_sd = sd(Diff))

fuel %>% summarize(diff_iqr = IQR(Diff))

as appropriate. Make a call about whether you think your measure of spread is small or large. If you think it’s small, you can say that the differences are consistent with each other, and therefore that the computer’s and driver’s values are typically different by about the same amount (we concluded that they are different). If you think your measure of spread is large, then the driver’s and computer’s values are inconsistently different from each other (which would be bad news for the people who designed the car’s computer, as long as you think the driver was being reasonably careful in their record-keeping).

If you said that the confidence interval for the mean/median was the thing to look at, then pull the interval out of the \(t\)-test output or run ci_median, as appropriate. The computer’s miles per gallon is consistently between 1.4 and 4.0 miles per gallon higher (mean) or 1.1 and 4.5 miles per gallon higher (median). Make a call about whether you consider this interval short or long, bearing in mind it’s based on the difference between two numbers that are about 40, and then say that the computer’s measurement is larger than the driver’s by a consistent amount (if you think the interval is short) or by an amount that varies quite a bit (if you think the interval is long).

Extra: in a situation like the one here, where there is a significant difference between the means, knowing also that the standard deviation is small would tell us that the computer is consistently giving a value that is bigger than the driver’s by about the same amount every time. Another way to think about this is to plot the driver’s and computer’s values against each other with a (solid) reference line⁷ like this:

ggplot(fuel, aes(x = Driver, y = Computer)) + geom_point() +
  geom_abline(intercept = 0, slope = 1) +
  geom_abline(intercept = 3, slope = 1, linetype = "dashed")

The solid reference line is where a point would be if the driver and computer had the same value. Here, there seems to be some consistency in the points being up and to the left of, but parallel to, the reference line. This would be saying that the computer’s value would be fairly consistently more than the driver’s by about 3 miles per gallon (eyeballing the plot; the dashed line), with some exceptions.

\(\blacksquare\)

Neuropathy and pain relief

The data in http://ritsokiguess.site/datafiles/exactRankTests_neuropathy.csv are the results of a study of pain relief in diabetic patients. The patients were randomly assigned to a standard treatment (“control”) or to a new treatment (“treat”), and for each patient a pain score was recorded, with a higher score indicating better pain relief. There were 58 patients altogether, 30 in the control group and 28 in the treatment group.

Read the data and display at least some of it.

As ever:

my_url <- 'http://ritsokiguess.site/datafiles/exactRankTests_neuropathy.csv'
neuropathy <- read_csv(my_url)

Rows: 58 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): group
dbl (1): pain

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

neuropathy

There are indeed 58 patients, in two groups control and treat.⁸

Extra: Precisely, a pain score was measured twice for each patient, once before the study started and once after four weeks, and the variable pain you see here is the (natural) log of the baseline score divided by the final score.⁹ This is the researchers’ way of summarizing pain relief by one number, instead of requiring us to deal with matched pairs (in a two-sample context, which would be rather confusing).

Draw a suitable graph of the two variables.

The obvious thing, given that we are not explicitly looking for normality (yet), but just getting a general sense of centre and spread and shape, is a boxplot:

ggplot(neuropathy, aes(x = group, y = pain)) + geom_boxplot()

Interpretation coming up.

Does your plot suggest that there will be a significant treatment effect, or not? Explain briefly.

The treatment median is a bit higher than the control median, but the treatment scores (especially) are very variable, so I would guess that the smallish difference in medians is not going to be significant: the difference in medians is not very big compared to how much variability there is.

The researchers decided to run a Mood’s median test to compare the pain scores for the treatment and control groups. Why do you think they decided to do that, on the evidence you have so far?

It looks as if neither distribution is very close to normal, because both groups have an extreme low outlier, and it also looks as if both groups have an upper outlier as well. (I would call these outliers rather than long tails because their values seem to be a long way away from the other observations, quite a bit lower or higher).

The other thing to think about is sample size. We have 30 observations in the control group and 28 in the treatment group; these are not small samples, but the outliers, especially the low one in each group, seem more extreme than you would expect samples of this size to be able to handle.

Extra: that was as far as I was expecting you to go, but I seem to have the bootstrap sampling distribution of the sample mean here, at least for the control group:

neuropathy %>% filter(group == "control") -> controls
controls

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(controls$pain, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

This, for me, is surprisingly good. I guess a little copying and pasting will reveal whether the treatment group is as good:

neuropathy %>% filter(group == "treat") -> treats
treats

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(treats$pain, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

Pretty much, just a bit of funny stuff at the upper end. So maybe a \(t\)-test would have been all right in actual fact.

Run Mood’s median test on these data. What do you conclude?

Having run library(smmr) somewhere:

median_test(neuropathy, pain, group)

$grand_median
[1] 0.1635

$table
         above
group     above below
  control    13    17
  treat      16    12

$test
       what     value
1 statistic 1.1047619
2        df 1.0000000
3   P-value 0.2932234

With a (two-sided) P-value of 0.293, there is no evidence of any difference in median pain scores between the treatment and control groups. That is, there is no evidence of any treatment effect on the pain score (positive or negative).

I didn’t say anything about one-sided or two-sided, which means that you can make your own decision about that. You can take the attitude that nowhere here does it say anything about looking for evidence in a particular direction, which favours a two-sided test. Or you can say that a higher score indicates better pain relief, and a new treatment is supposed to be better (as with the kids learning to read from lecture), so we should do a one-sided test. As ever, have a reason for doing what you did.

To do this test one-sided, there is a bit more work to do:

are we on the correct side? Yes, because a small majority of the control patients were below the grand median and a small majority of the treatment patients were above, as you would expect from a treatment that is better than control (if only a little). This comes from the table of aboves and belows in the median_test output.
we can therefore halve the two-sided P-value to get a one-sided one:

0.2932 / 2

[1] 0.1466

This is still not small enough to be significant, so there is not strong enough evidence that the new treatment offers better pain relief than the standard one (control).

Extra 1: if you look at the table of aboves and belows, you see that they are close to 50-50 above and below the grand median, certainly not as unbalanced as you would expect to see if there really was a treatment effect. (A Welch two-sample \(t\)-test has a smaller but still non-significant P-value, and is not significant one-sided either.)

Extra 2: in my source for these data, it was believed that only some of the patients would respond positively to the treatment. This would be consistent with not very compelling evidence for a treatment effect (for the treatment group as a whole), and also with the treatment group having a larger spread. According to the boxplot, it does seem to be true that some of the patients in the treatment group had a better pain score than those in the control group, but not by any means everybody.

This would be a tough thing to test, because we don’t know which patients in the treatment group are the ones that responded positively to the treatment. We might guess that they are the ones with the highest pain scores (best pain relief), but there is no certainty of this, because a patient might happen to do very well with the new treatment just by chance. A second issue here is that if the new treatment works for some patients, how do you tell whether a particular patient is going to respond well to it before they start on it?

Cuckoo eggs

The cuckoo is a European bird that lays its eggs in the nests of other birds (rather than building nests itself). The other bird, known as a “host”, raises and cares for the newly hatched cuckoo chick as if it was its own. Each cuckoo returns to the same territory year after year and lays its eggs in a nest of the same host species. Thus, cuckoos are actually several sub-species, each with a different host bird that it lays its eggs in the nests of. In a study, 120 cuckoo eggs were found in the nests of six other bird species: hedge sparrow, meadow pipit, pied wagtail, robin, tree pipit, and wren. These are birds of different sizes, so researchers were interested in whether the cuckoo eggs laid in the nests of different host birds differed in size as well. (For example, wrens are small birds, so you might be interested in whether cuckoo eggs laid in wren nests are smaller than cuckoo eggs laid in the nests of other birds. If this is the case, the cuckoo eggs will look less different from the wren eggs in the same nest.)

The data are in http://ritsokiguess.site/datafiles/cuckoo.txt.

Read in and display (some of) the data. Note that some of the host bird names are misspelled. (You do not need to correct the misspellings.)

Solution

The values are separated by single spaces,¹⁰ so:

my_url <- "http://ritsokiguess.site/datafiles/cuckoo.txt"
eggs <- read_delim(my_url, " ")

Rows: 120 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): bird_species
dbl (1): egg_length

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

eggs

There are indeed 120 eggs, with lengths (evidently measured in millimetres) and with the two “pipets” misspelled. The column called bird_species is the host bird; the eggs themselves are all cuckoo eggs.

\(\blacksquare\)

Bearing in mind that we will be interested in running some kind of ANOVA shortly, explain briefly why a normal quantile plot, for each host species separately, will be useful.

Solution

The major assumption of ANOVA is that the observations within each group are (approximately) normal in shape. Since we are specifically interested in normality, rather than shape generally, a normal quantile plot will be better than a boxplot.

\(\blacksquare\)

Draw a suitable normal quantile plot. Based on what you see, what would you recommend as a suitable test to compare the egg lengths in the nests of the different host species? Explain briefly.

Solution

Do the normal quantile plot facetted by groups (bird_species), since we are interested in normality within each group, not of all the observations taken together:

ggplot(eggs, aes(sample = egg_length)) + stat_qq() +
  stat_qq_line() + facet_wrap(~ bird_species)

We are looking for all the distributions to be roughly normal. Commentary (which you don’t need to write but you certainly need to think):

the Hedge Sparrow distribution has two or three mild outliers at the bottom. (This, to me, is better than calling it “long tails” because the high observation is (i) only one and (ii) not really too high.)
the Meadow Pipit has long tails (I say this because the pattern seems to be a feature of the whole distribution, rather than of a few observations. Having said that, those four lowest values do seem to be noticeably lower than the rest, so you can sell the idea of those four lowest values being outliers).
the Pied Wagtail is if anything short-tailed, so this is not a problem. You could also call this acceptably normal.
I cannot see any problems at all with the Robin or the Wren.
The Tree Pipit distribution is, as you see it, either acceptably normal, or very slightly skewed to the left (is that a curved shape?)

The second part of the thought process is sample size, because the larger the sample size within a group, the less picky you have to be about normality. You can guess this by looking at how many dots there appear to be on the normal quantile plots, or you can get the exact sample sizes (better):

eggs %>% count(bird_species)

So I would actually say that the Hedge Sparrow is the one that might be a problem. The (much) larger sample size of the Meadow Pipits ought to be large enough to take care of the long tails, but it is not clear whether a sample size of 14 Hedge Sparrows is large enough to take care of the low-end outliers.

So, in terms of what you actually need to write:

find one or more distributions whose normality you are concerned about (eg Hedge Sparrow and Meadow Pipit)
think about whether the sample sizes for the ones you are concerned about are small enough that the non-normality you found is still a problem. I would say that the large sample of Meadow Pipits (45) is large enough that I don’t need to worry: the long tails there are not extreme. But I am less sure about the Hedge Sparrows: the small sample size (14) might not be enough to accommodate the low-end outliers.
if there are any distributions that you are concerned about, express that you don’t think an ANOVA will be appropriate (and that a Mood’s median test will be better).

If you were happy enough with the normality, the best thing is then to think about whether the spreads are equal. You can do this by calculating SDs:

eggs %>% 
  group_by(bird_species) %>% 
  summarize(length_sd = sd(egg_length))

The SDs for the robins and wrens are the smallest of these, but it’s up to you whether you think they are enough smaller than the others to be worth worrying about. I’m inclined to think not, but if you think they are, then you would recommend a Welch ANOVA. A couple of other options for assessing spread are:

to draw boxplots, purely for the purpose of assessing spread (because we used the normal quantile plots for the stuff about normality)
to use the normal quantile plots to assess spread.

Boxplots look like this:

ggplot(eggs, aes(x = bird_species, y = egg_length)) + geom_boxplot()

The pied wagtail box looks noticeably taller than the others, so the spreads do not all appear to be the same.

To use the normal quantile plot, assess the slope of each line in each plot. Do the slopes look about the same? I think they more or less do, though you could say that the line on the Pied Wagtail plot is steeper than the others.

In terms of a recommendation:

if you thought the distributions were not normal enough, recommend Mood’s median test.
if you thought that normality was OK, but equal spreads was not, recommend the Welch ANOVA.
if you thought that both normality and equal spreads were good enough, recommend a regular ANOVA.

\(\blacksquare\)

Run an (ordinary) analysis of variance, including any follow-up if warranted. What do you conclude, in the context of the data? (Run this analysis even if you don’t think it’s the best thing to do.)

Solution

I wanted to get you some practice at doing this, hence my last sentence:

eggs.1 <- aov(egg_length ~ bird_species, data = eggs)
summary(eggs.1)

              Df Sum Sq Mean Sq F value   Pr(>F)    
bird_species   5  42.94   8.588   10.39 3.15e-08 ***
Residuals    114  94.25   0.827                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This P-value of \(3 \times 10^{-8}\) is much smaller than 0.05, so the mean cuckoo egg lengths are definitely not all the same for each host species (or, if you like, cuckoo egg length depends somehow on host species).

This is enough to say for now. To find out which host species differ from which in terms of mean cuckoo egg length, we need to fire up Tukey:

TukeyHSD(eggs.1)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = egg_length ~ bird_species, data = eggs)

$bird_species
                                diff          lwr         upr     p adj
MeadowPipet-HedgeSparrow -0.82253968 -1.629133605 -0.01594576 0.0428621
PiedWagtail-HedgeSparrow -0.21809524 -1.197559436  0.76136896 0.9872190
Robin-HedgeSparrow       -0.54642857 -1.511003196  0.41814605 0.5726153
TreePipet-HedgeSparrow   -0.03142857 -1.010892769  0.94803563 0.9999990
Wren-HedgeSparrow        -1.99142857 -2.970892769 -1.01196437 0.0000006
PiedWagtail-MeadowPipet   0.60444444 -0.181375330  1.39026422 0.2324603
Robin-MeadowPipet         0.27611111 -0.491069969  1.04329219 0.9021876
TreePipet-MeadowPipet     0.79111111  0.005291337  1.57693089 0.0474619
Wren-MeadowPipet         -1.16888889 -1.954708663 -0.38306911 0.0004861
Robin-PiedWagtail        -0.32833333 -1.275604766  0.61893810 0.9155004
TreePipet-PiedWagtail     0.18666667 -0.775762072  1.14909541 0.9932186
Wren-PiedWagtail         -1.77333333 -2.735762072 -0.81090459 0.0000070
TreePipet-Robin           0.51500000 -0.432271433  1.46227143 0.6159630
Wren-Robin               -1.44500000 -2.392271433 -0.49772857 0.0003183
Wren-TreePipet           -1.96000000 -2.922428738 -0.99757126 0.0000006

This is hard to make sense of because there are 6 groups and 15 comparisons. Seven of the comparisons are significant. The five very small P-values are all Wren and something else; if you look at those carefully, the eggs in the Wren nests are all smaller on average. The other two P-values are only just less than 0.05; in these two cases, the eggs in the Meadow Pipit nests are less long on average than those in the Hedge Sparrow or Tree Pipit nests.

Try to find a way to summarize the seven significant differences in a way that is easy to read and understand (thinking of your reader, again). Simply listing the significant ones doesn’t offer any insight about what they have in common.

\(\blacksquare\)

Run a Mood’s median test, and, if appropriate, follow-up tests. What do you now conclude, in the context of the data?

Solution

Make sure you have this somewhere:

library(smmr)

and then:

median_test(eggs, egg_length, bird_species)

$grand_median
[1] 22.35

$table
              above
group          above below
  HedgeSparrow    11     3
  MeadowPipet     17    28
  PiedWagtail     10     5
  Robin           10     6
  TreePipet       12     3
  Wren             0    15

$test
       what        value
1 statistic 3.032698e+01
2        df 5.000000e+00
3   P-value 1.271619e-05

This is also strongly significant, and indicates that the median cuckoo egg lengths in the nests of the six different species are not all the same. (Or that one or more of the median egg lengths is different, etc etc.) So, to find out which ones are different according to this procedure, fire up pairwise median tests:

pairwise_median_test(eggs, egg_length, bird_species)

There are fifteen of these (page down to see the other five). This actually is a dataframe, so you can sort the (adjusted) P-values into order without too much trouble. Or you can use filter to show only the ones that are less than 0.05. I like the sorting idea better, because then you can see whether there are any others whose P-value is close to 0.05, or, as in this case, confirm that there are not:

pairwise_median_test(eggs, egg_length, bird_species) %>% 
  arrange(adj_p_value)

There are only four significant differences here: Wren with everything except Pied Wagtail. If you look at the table of aboves and belows in the output from Mood’s median test, this is evidently because the Wren eggs are shorter than the others. (Presumably Wren vs. Pied Wagtail is not unbalanced enough to be significant.)

\(\blacksquare\)

Compare all your significant results from the previous two parts. Are the results substantially different? Explain briefly.

Solution

For the ANOVA and the Mood’s median test themselves, both P-values are very small (the one for the ANOVA is smaller), so there is no substantial difference there.

For the followup comparisons, the only ones that differ in significance between the two tests are Wren vs Pied Wagtail, and Meadow Pipit vs Hedge Sparrow and Tree Pipit. These are not significant in the pairwise median tests, but were in Tukey. (In addition, the last two comparisons were only just significant in the Tukey.) Given this, I would say that there is not a substantial difference between the results from the two procedures.

\(\blacksquare\)

Cars in 1993

The dataset in http://ritsokiguess.site/datafiles/Cars93.csv contains a lot of information about 93 different vehicles that were available in 1993. Each vehicle is classified as one of six Types, and for each vehicle, the gas mileage in miles per US gallon (of gas consumed) in city driving is recorded in MPG.city. Our aim in this question is to compare the city gas mileage of vehicles of different types, assuming that the ones in our dataset are (something like) a random sample of all vehicles available in North America in 1993.

Read in and display (some of) the data.

To exactly no-one’s surprise:

my_url <- "http://ritsokiguess.site/datafiles/Cars93.csv"
Cars93 <- read_csv(my_url)

Rows: 93 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): Manufacturer, Model, Type, AirBags, DriveTrain, Cylinders, Man.tra...
dbl (18): Min.Price, Price, Max.Price, MPG.city, MPG.highway, EngineSize, Ho...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cars93

Make a graph that will enable you to compare the distributions of city gas mileage for vehicles of different types.

Gas consumption is quantitative and vehicle type is categorical, so a boxplot:

ggplot(Cars93, aes(x = Type, y = MPG.city)) + geom_boxplot()

What can you say about the spread of distribution of gas mileage for vehicles where the median is large (compared to when it is small)?

The gas mileages for the types of vehicle where gas mileage is large are more spread out (have a larger IQR) than the ones where the gas mileage is small. For example, small cars have both the biggest median and the most spread-out distribution (tallest box), while the three groups of vehicles with the smallest medians (large and midsize cars and vans) also have the least spread-out distributions.

Extra: we can work out SD and mean for each vehicle type and then plot those against each other. The first step is a group-by and summarize as we learned it early on in the course:

Cars93 %>% 
  group_by(Type) %>% 
  summarize(mean_mpg = mean(MPG.city), sd_mpg = sd(MPG.city))

and then the second step is to realize that this is itself a dataframe, so we can make a plot from it: in particular, we can plot the group means and SDs against each other. This is done by piping the above dataframe into ggplot, so that in practice you would do it in one pipeline as shown below. The standard way to assess for any relationship between group means and SDs is to see how the SD depends on the mean (if at all), so the group SDs will be the \(y\) coordinate of the plot. I have added one extra refinement: labels to show which vehicle type is which on the plot:

Cars93 %>% 
  group_by(Type) %>% 
  summarize(mean_mpg = mean(MPG.city), sd_mpg = sd(MPG.city)) %>% 
  ggplot(aes(x = mean_mpg, y = sd_mpg, label = Type)) + geom_point() + geom_text_repel()

This shows clearly that the vehicle types with larger mean gas mileages also have larger SDs of mileages (and indeed that the SD goes up with the mean in a more or less linear way).

I got the labels by using geom_text_repel from package ggrepel (which you will have to install first if you want to try this). This puts the label next to the point it belongs to, so that you can see which point is which.¹¹

Extra 2: This is a problem for the ANOVA that we want to do to compare the gas mileages, because that would assume that the groups all have the same spread. One way around that would be to use a Welch ANOVA, but the normality is also at least somewhat questionable. You could use Mood’s median test, but I wanted this to be a problem that used aov, and the point of this Extra is to show you how I got it to this point.

When there is a relationship between group centre (mean, median) and spread (SD, IQR), particularly if one being larger goes with the other one being larger as well, things can often be helped by a transformation of the response variable. That is, instead of using gas mileage itself, we use some function of gas mileage like its square root or its logarithm. This has the effect of making larger values closer together (and smaller ones more spread out), and, with luck, will produce groups with more equal spread.

In the next part, I have you work with the reciprocal of gas mileage (which, as you will see, has a practical meaning for this problem), but I present it there as a fait accompli, and I wanted you to see that the idea actually came from somewhere. The technique I’m using is called the “Box-Cox transformation”, which we’ll see for real when we study regression, but the idea also applies in ANOVA. The boxcox function lives in a package called MASS (which you will also have to install and load first.¹²)

To run boxcox, you feed it a model formula and a dataframe, just as you would do with aov:

boxcox(MPG.city ~ Type, data = Cars93)

The output from boxcox is a graph. The symbol below the \(x\)-axis is the Greek letter “lambda”, and the \(x\)-axis tells you how to transform the response variable. The value of \(\lambda\) tells you what power to raise gas mileage to, in order to best even out the spread across the groups. A value of 2 would mean to square it, a value of 0.5 would mean to square-root it, and so on. Raising a number to the power zero always gives you 1 (not very helpful), so \(\lambda = 0\) has the special meaning of “take logs”.

What you do with the plot is to look where it reaches its maximum, in this case around \(-1.3\). The middle vertical dotted line guides you as to what value of \(\lambda\) gives the maximum value for the curve. The two outer vertical dotted lines show a 95% confidence interval for \(\lambda\); anything inside the interval is supported by the data. In this case, \(\lambda\) could be anything between about \(-2\) and \(-0.6\). When choosing a transformation, you want a “round number” like \(2, 1, 0.5, 0, -0.5, -1, 2\) that is inside the interval. In this case, that would be \(-1\). Raising to the power \(-1\) means taking the reciprocal, that is, “one over” the gas mileage. We will see in the next part that this is in fact a perfectly sensible measure of gas consumption, and so we are in the happy position of being able to use a transformation that makes sense.

If \(\lambda = 1\) were in your confidence interval (it is not here), this would be telling you that raising to the power 1, ie., doing nothing, would be supported by the data. In that case, it would be saying that doing any kind of transformation is not helpful.

On our graph, the peak and the confidence interval are scrunched up over on the left. boxcox tries to make a sensible decision about what values of \(\lambda\) to use, but if you don’t like what it gives you, you can provide your own, like this:

boxcox(MPG.city ~ Type, data = Cars93, lambda = seq(-4, 2, 0.1))

seq means “go from the first number to the second number in steps of the third number”. It’s like the full version of Python’s range with three inputs, or the “fill column” idea in Excel. Usually this idea is used to zoom the graph out, so that you can see more of it.

It is suggested that we should use a “reciprocal transformation” of gas mileage: that is, instead of using gas mileage itself in our analysis, we should use one divided by gas mileage instead. Explain briefly how this is also a sensible measure of gas consumption. (Hint: think about the units.)

The units of MPG.city in the data are miles per US gallon. To get the units of the reciprocal of this, we flip the units upside down: that is, the units of the reciprocal of MPG.city are US gallons per mile. This is also a sensible way of measuring how much gas a vehicle uses – indeed, you might even say it is a more sensible measure, because it is literally how much gas the vehicle consumes to travel a fixed distance, and a lower number means (sensibly) that the vehicle consumes less gas.

This is in fact equivalent to the Canadian measure of litres per 100 km, and differs from it by only a fixed multiple (because the number of miles in a kilometre and the number of gallons in a litre are both constants). The choice of 100 km is only to produce a human-sized number (litres per km would be a number less than 1).

Create and save a new column in your dataframe that contains the reciprocal of the gas mileage. Re-draw your plot from earlier to use your new column instead of the original MPG.city.

I’m going to call my new column gpm because it literally is gallons per mile:

Cars93 %>% 
  mutate(gpm = 1/MPG.city) -> Cars93

I saved it back into the original dataframe.

Then:

ggplot(Cars93, aes(x = Type, y = gpm)) + geom_boxplot()

Explain why it is now at least somewhat defensible to run an ordinary ANOVA.

The distributions of gpm for each type are at least approximately symmetric in shape (better than they were before), and also the heights of the boxes are more or less equal (again, better than they were before).

There will inevitably be some hand-waving here. Things are not perfect, but there are 93 observations altogether, with this many in each group:

Cars93 %>% count(Type)

so the Central Limit Theorem will help at least a little bit.

Extra: our problem before was that group SD went up with group mean. Is that still the case?

Cars93 %>% 
  group_by(Type) %>% 
  summarize(mean_gpm = mean(gpm), sd_gpm = sd(gpm)) %>% 
  ggplot(aes(x = mean_gpm, y = sd_gpm, label = Type)) + geom_point() + geom_text_repel()

not so much of a trend. Knowing the group mean tells you very little about the group SD.

Run a suitable analysis of variance, along with any appropriate followup. (You don’t need to interpret the followup yet.)

aov first:

Cars93.1 <- aov(gpm ~ Type, data = Cars93)
summary(Cars93.1)

            Df   Sum Sq   Mean Sq F value Pr(>F)    
Type         5 0.005767 0.0011535   35.99 <2e-16 ***
Residuals   87 0.002788 0.0000321                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There are definitely some differences among the groups, so run Tukey to find out where they are:

TukeyHSD(Cars93.1)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = gpm ~ Type, data = Cars93)

$Type
                        diff           lwr           upr     p adj
Large-Compact    0.010424951  0.0039630197  0.0168868833 0.0001373
Midsize-Compact  0.007251611  0.0018308870  0.0126723349 0.0025418
Small-Compact   -0.009711289 -0.0151860865 -0.0042364919 0.0000215
Sporty-Compact   0.002876609 -0.0031611172  0.0089143359 0.7338016
Van-Compact      0.014736297  0.0078620427  0.0216105509 0.0000002
Midsize-Large   -0.003173341 -0.0092657082  0.0029190272 0.6537776
Small-Large     -0.020136241 -0.0262767702 -0.0139957112 0.0000000
Sporty-Large    -0.007548342 -0.0141956602 -0.0009010239 0.0166082
Van-Large        0.004311345 -0.0031040485  0.0117267392 0.5392304
Small-Midsize   -0.016962900 -0.0219961683 -0.0119296321 0.0000000
Sporty-Midsize  -0.004375002 -0.0100154381  0.0012654349 0.2216964
Van-Midsize      0.007484686  0.0009566117  0.0140127599 0.0150996
Sporty-Small     0.012587899  0.0068954754  0.0182803218 0.0000001
Van-Small        0.024447586  0.0178745419  0.0310206301 0.0000000
Van-Sporty       0.011859687  0.0048108829  0.0189084919 0.0000625

This is rather a lot to interpret, so I’m not asking you to do so.

Is there any type of vehicle that has significantly better or significantly worse fuel consumption than all of the other vehicle types?

To give yourself a starting point: this would have to be the very best or the very worst Type. To find out that, look again at the boxplot you just drew, with what I called gpm:

ggplot(Cars93, aes(x = Type, y = gpm)) + geom_boxplot()

Small cars are the best (consume the least gas), and vans are the worst (consume the most). There seems to be something of a gap between the small cars and everything else, so scan through all the Small entries in your Tukey output and see that they are all significant. Hence the small cars have significantly better gas consumption than all the other types of vehicle. (This, you will realize, is one of the selling points of a small car.) For completeness, check the Vans and find that Van and Large do not differ significantly, so Vans are not significantly worse than all other vehicle types.

Extra 1: when I was your age, we used to interpret Tukey (calculated, of course, by hand) as below. First, list the means in order:¹³

Cars93 %>% 
  group_by(Type) %>% 
  summarize(mpg_mean = mean(MPG.city)) %>% 
  arrange(mpg_mean)

the old-fashioned way is to write out the group names and means in order, and then to draw a line under any groups that are not significantly different. Start with the worst one, Van, and see what it is not significantly different from (just Large). Then start with the next one, Large, and see that it is not significantly different from Midsize. And so on. The kind of “ladder” thing with the lines is quite common (and a pain to interpret). But there are no lines at all under Small, so Small is significantly different from all the others.

Van     Large    Midsize    Sporty      Compact    Small
17.00   18.36    19.55       21.79      22.69      29.86
-------------
        --------------
                 -----------------
                             ----------------

Run Mood’s median test on the original data (no assumption of normality or equal spreads needed), along with any appropriate followup. Compare the results with your ANOVA in the previous two questions.

median_test(Cars93, MPG.city, Type)

$grand_median
[1] 21

$table
         above
group     above below
  Compact    11     3
  Large       0    11
  Midsize     4    14
  Small      21     0
  Sporty      8     6
  Van         0     9

$test
       what        value
1 statistic 5.140800e+01
2        df 5.000000e+00
3   P-value 7.134267e-10

There are definitely some differences in median gas mileage between the types of cars (P-value \(7.1 \times 10^{-10}\)). To see what they are, the right followup is pairwise median tests:

pairwise_median_test(Cars93, MPG.city, Type) %>% 
  mutate(sig = adj_p_value <= 0.05)

Remember to use the adjusted P-value to assess significance here (because we are doing multiple tests all at once).

I created an extra column to indicated whether each comparison was significant or not (you don’t need to do this). This is because I wanted to do the lines thing again:

Van     Large    Midsize    Sporty      Compact    Small
17.00   18.36    19.55       21.79      22.69      29.86
-------------
        --------------------------
                             ----------------

Almost the same but not quite: Sporty and Large are no longer significantly different, but all the other significant differences are the same.

Extra: the usual take in this sort of situation is that finding a good transformation (our gallons per mile) and then doing aov is a better idea than going straight to Mood’s median test, because aov will be more powerful¹⁴ than Mood’s median test, if you can find an appropriate way to do it. This one lent itself to a transformation because the types with bigger means also had bigger spreads. If things are less clear than this, you might not be able to find a good transformation, and in that case, Mood’s median test is the way to go.

Footnotes

If we had 40 independent miles-per-gallon measurements, there would be no reason to have the Fill-up column, because there would be no basis to link a Computer value with a particular Driver one.↩︎
Actually, it works, but not the way we wanted it to: there is no error here, just not the kind of data we wanted.↩︎
It may look like two characters, but the backslash is called an “escape character” and doesn’t count.↩︎
I seem to be attributing sentience to the Computer here. Maybe recorded_by would have been a better name.↩︎
To be precise linguistically, there are “differences between” two things, as in matched pairs, and “differences among” three or more things, as in ANOVA.↩︎
This is very similar to the thing with the P-values in Extra 1; the sign test doesn’t use the data as efficiently, so its interval is a little wider.↩︎
This is the line \(y = x\), which has intercept 0 and slope 1.↩︎
Studies of this kind are usually designed to have the same number of patients in each group. The patients in the study will have had to sign a consent form that explains any risks involved in taking part in the study, along with a statement that they can withdraw from the study at any time. What can happen is that the groups start off the same size, but a few people drop out, and so the group sizes end up slightly different from each other. My guess is that this is what happened here.↩︎
The log of a number less than 1 is negative, which means that a pain value that is negative comes from a baseline score that is less than the final score. The idea is supposed to be less pain after the treatment than before, so a negative pain value indicates that the (standard or new) treatment is actually making the pain worse! Fortunately, there are not very many negative pain values.↩︎
The .txt file extension clues you in to the idea that this is something that is not a .csv and thus will need some thought.↩︎
There is also a geom_text that you can use without installing any extra packages, but that puts the text on top of the point, which is not the effect we are after here.↩︎
There is an annoyance here in that MASS, like the tidyverse, also has a function called select, but the MASS select does something different, and things get less confused if you load MASS first, before tidyverse. If you do that, you can continue to use select and have it do what you would expect.↩︎
These, I realize, are the mean MPG, compared using the analysis for gpm. Oops.↩︎
As ever, making better use of the data.↩︎