library(tidyverse)
Worksheet 7
Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.
If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial.
Note: questions 11 through 25 are about analysis of variance. Some of these questions may make more sense after Thursday’s lecture. You might like to leave those until then. I will publish all my solutions on Wednesday evening as usual. Be guided by Thursday’s lecture as to how much of this you will need to know for the midterm. This makes for a rather heavyweight worksheet, but I want to make sure you get some practice at everything you might need for the midterm. (This same material is on Assignment 6, which will not be due until after the midterm.)
Fuel efficiency comparison
Some cars have on-board computers that calculate quantities related to the car’s performance. One of the things measured is the fuel efficiency, that is, how much gasoline the car uses. On an American car, this is measured in miles per (US) gallon. On one type of vehicle equipped with such a computer, the fuel efficiency was measured each time the gas tank was filled up, and the computer was then reset. Twenty observations were made, and are in http://ritsokiguess.site/datafiles/mpgcomparison.txt. The computer’s values are in the column Computer
. The driver also calculated the fuel efficiency by hand, by noting the number of miles driven between fill-ups, and the number of gallons of gas required to fill the tank each time. The driver’s values are in Driver
. The final column Diff
is the difference between the computer’s value and the driver’s value for each fill-up. The data values are separated by tabs.
- Read in and display (some of) the data.
Solution
This is like the athletes data, so read_tsv
:
<- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
my_url <- read_tsv(my_url) fuel
Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fuel
20 observations, with columns as promised. The first column Fill-up
labels the fill-ups from 1 to 20, in case we needed to identify them. (This is an additional clue that we have paired data, or repeated measures in general.1)
Extra: another way to read the data, and why it works: since the “tab” is a delimiter, distinguishing one data value from the next, you might be wondering whether you could use read_delim
somehow. It turns out that you can, but you have to figure out how to represent “tab” as something that read_delim
can understand. One way is to search for it, of course, but another way is to see what happens when you use read_delim
the way you already know (assuming the data values are separated by spaces):
<- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
my_url <- read_delim(my_url, " ") fuel0
Rows: 20 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): Fill-up Computer Driver Diff
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fuel0
This fails, but how it fails is instructive.2 There is now one column, amusingly called Fill-up Computer Driver Diff
, whose values are text, the four numbers glued together with \t
in between. This is how R recognizes a tab character.3 So you can try read_delim
by using this as the second input rather than a space.
You could get to this point with less insight by saying to yourself “this thing \t
, whatever it is, is what separates the data values, so how about I try this as the second input to read_delim
?”
Here’s how it goes:
<- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
my_url <- read_delim(my_url, "\t") fuel0
Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fuel0
and it works, with four numeric columns as promised.
Another option is to use read_delim
without a second input:
<- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
my_url <- read_delim(my_url) fuel0
Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fuel0
This works, but you need to understand what is going on, something beyond “well, it seems to work”. One way is to look at the message from read_delim
that says what got read in:
── Column specification ────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff
The four quantitative columns, but look above that, where it says Delimiter: \t
. The delimiter is what separates each data value from the next one on the same line. The symbol \t
means “tab”, which you could assert, but it is better to offer some sort of evidence that this is correct. To do that, search for something like R \t
.
The first hit for me was actually not R-related, but was a Stack Overflow question (scroll down to the two answers) that has the answer we need. The symbol \t
comes from Unix (and is therefore also in Linux and the C programming language, and because R’s heritage is C, R also).
\(\blacksquare\)
- What is it that makes this paired data? Explain briefly.
Solution
There are two measurements for each fill-up: the computer’s calculation of gas mileage, and the driver’s. These go together because they are two calculations of what is supposed to be the same thing. (Another hint is that we have differences, with the veiled suggestion that they might be useful for something.)
If you want to have a general way of tackling this kind of problem, ask yourself “is there more than one observation of the same thing per individual?” Then go on and talk about what the individuals and observations are for the data you’re looking at. In this case, the individuals are fill-ups, and there are two observations per fill-up: one by the driver and one by the computer. Compare this with the children learning to read (from lecture): there were \(23+21 = 44\) children altogether (individuals), and each child had one reading score, based on the reading method they happened to be assigned to. So those are 44 independent observations, and a two-sample test is the way to go for them.
Extra 1: To go back to our gas mileages, here is another way you might have received the data:
This looks more like a two-sample layout, with one column of gas_mileage
values and a second column indicating whether the Driver or Computer observed4 them. But the first two rows were observed at the same fill-up, so we know that these are actually paired (by fill-up) and the layout of the data for our analysis is wrong. (Later in the course, we learn how to convert between the previous “wider” layout and this “longer” one. I didn’t show you the code here, because I didn’t want to confuse you at this point.)
Extra 2: if your individuals have more than two measurements each, you can’t shoehorn it into a one-sample model any more, because there are too many differences among5 measurements now. These are repeated measures data, and the modern way to analyze these is using mixed models. The student GPA example in the link shows how it goes; the one in the link is repeated measures because we have six observations (of GPA) for each individual (student).
\(\blacksquare\)
- Draw a suitable graph of these data, bearing in mind what you might want to learn from your graph.
Solution
In a matched pairs situation, what matters is whether the differences have enough of a normal distribution. The separate distributions of the computer’s and driver’s results are of no importance. So make a graph of the differences. We are specifically interested in normality, so a normal quantile plot is best:
ggplot(fuel, aes(sample = Diff)) + stat_qq() + stat_qq_line()
The next best plot is a histogram of the differences:
ggplot(fuel, aes(x = Diff)) + geom_histogram(bins = 6)
This one actually looks left-skewed, or has an outlier at the bottom. The appearance may very well depend on the number of bins you choose.
Extra: the distributions of the computer’s and driver’s gas mileage measurements themselves are totally irrelevant to whether you would run a matched pairs test of any kind; the only thing that matters is the distribution of the differences.
The only other graph you could justify (and even then it is a bit of a stretch) would be a scatterplot of the two measurements at each fill-up:
ggplot(fuel, aes(x = Driver, y = Computer)) + geom_point() +
geom_abline(intercept = 0, slope = 1)
By adding the line \(y = x\) (which has intercept 0 and slope 1) to the graph, you can see how the pairs of measurements compare to each other; for most of them, the computer’s measurement is larger than the corresponding driver’s measurement (the point is above the line), and for only a few is it smaller. If you had decided to do a matched-pairs sign test, this graph would be interesting (with the line), because the sign test is based on how many points are above the line vs. how many are below. But what this graph does not do is tell you whether you need to do a matched-pairs sign test in the first place, because it does not tell you about the distribution of the differences (in particular, whether that is normal enough). Thus, the scatterplot, even with the line, is inferior to even the histogram of differences.
\(\blacksquare\)
- Is there any difference between the average results of the driver and the computer? (Average could be mean or median, whichever you think is best). Do an appropriate test.
Solution
The choices are a matched-pairs \(t\)-test, or a matched-pairs sign test (on the differences). To choose between those, look back at your graph. My take is that the only possible problem is the smallest (most negative) difference, but that is not very much smaller than expected. This is especially so given the sample size (20), which means that the Central Limit Theorem will help us enough to take care of the small outlier.
I think, therefore, that a paired \(t\)-test is the way to go, to test the null that the mean difference is zero (against the two-sided alternative that it is not zero, since we were looking for any difference). There are two ways you could do this: as literal matched pairs:
with(fuel, t.test(Computer, Driver, paired = TRUE))
Paired t-test
data: Computer and Driver
t = 4.358, df = 19, p-value = 0.0003386
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
1.418847 4.041153
sample estimates:
mean difference
2.73
or, as a one-sample test on the differences:
with(fuel, t.test(Diff, mu = 0))
One Sample t-test
data: Diff
t = 4.358, df = 19, p-value = 0.0003386
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
1.418847 4.041153
sample estimates:
mean of x
2.73
with the same result.
If you thought that low value was too much of an outlier, the right thing would have been a sign test that the median difference is zero (vs. not zero), thus (using the smmr
package):
sign_test(fuel, Diff, 0)
$above_below
below above
3 17
$p_values
alternative p_value
1 lower 0.999798775
2 upper 0.001288414
3 two-sided 0.002576828
In any of those cases, we conclude that the average difference is not zero, since the P-values are less than 0.05. (The right one for the sign test is 0.0026.)
I don’t mind which test you do, as long as it is one of the ways of doing matched pairs, and you justify your choice by going back to your graph. Doing one of the tests without saying why is being a STAB22 or STAB57-level statistician (in fact, it may not even be that), not a professional-level applied statistician.
Extra 1: you could find out how the average difference is not zero by looking at a confidence interval; the one for the mean difference is from 1.4 to 4.0. If you did a sign test, you would do this to get a confidence interval:
ci_median(fuel, Diff)
[1] 1.102148 4.500000
1.1 to 4.5, a little wider,6 but otherwise very similar.
So we think the computer was estimating the gas mileage to be that much higher than the driver on average.
Extra 2: The P-value for the \(t\)-test was almost 10 times smaller than for the sign test. This is because the \(t\)-test was using the actual differences, not just whether they were greater or less than zero as the sign test did (17 out of the 20 differences were positive). If you thought, as I did, that it was all right to use the \(t\)-test, then the smaller P-value is trustworthy. (The usual thing, if two tests give similar results, as they do here, is that the one with the smaller P-value is better, but that is not always something to rely on. You still need to make an assessment.)
Extra 3: A bootstrapped sampling distribution of the sample mean would be a better way to decide how happy you are with the \(t\)-test. This will incorporate the sample size into your consideration. It is really the same thing as the one for the one-sample \(t\), bearing in mind that it’s the differences you need to work with:
tibble(sim = 1:10000) %>%
rowwise() %>%
mutate(my_sample = list(sample(fuel$Diff, replace = TRUE))) %>%
mutate(my_mean = mean(my_sample)) %>%
ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()
This is a tiny bit left-skewed still (the lowest values are very slightly too low, and the highest ones are slightly bunched up). The apparent low-end outlier in the distribution of differences barely even shows up in the sampling distribution. (If we had a smaller sample size, like 10 or especially 5, we would see it here more clearly, and then we would know that a sample size of 5 or 10 is not big enough.)
In other words, despite what you might think from your earlier plot, the matched pairs \(t\)-test would be very much defensible here.
\(\blacksquare\)
- The people who designed the car’s computer are interested in whether the values calculated by the computer and by the driver on the same fill-up are usually close together. Explain briefly why it is that looking at the average (mean or median) difference is not enough. Describe what you would look at in addition, and how that would help.
Solution
The mean (or median) difference can be close to zero without the individual differences being close to zero. The driver could be sometimes way higher and sometimes way lower, and the average difference could come out close to zero, even though each individual pair of measurements is not that close together. Thus looking only at the average difference is not enough.
There are (at least) two (related) things you might choose from to look at, in addition to the mean or median:
- the spread of the differences (the SD or inter-quartile range). If the average difference and the spread of differences are both close to zero, that would mean that the driver’s and computer’s values are consistently similar.
- a suitable confidence interval here (for the mean or median difference, as appropriate for what you did) would also get at this point.
To think about the spread: use the SD if you did the \(t\)-test for the mean, and use the IQR if you did the sign test for the median, that is, one of these:
%>% summarize(diff_sd = sd(Diff)) fuel
or
%>% summarize(diff_iqr = IQR(Diff)) fuel
as appropriate. Make a call about whether you think your measure of spread is small or large. If you think it’s small, you can say that the differences are consistent with each other, and therefore that the computer’s and driver’s values are typically different by about the same amount (we concluded that they are different). If you think your measure of spread is large, then the driver’s and computer’s values are inconsistently different from each other (which would be bad news for the people who designed the car’s computer, as long as you think the driver was being reasonably careful in their record-keeping).
If you said that the confidence interval for the mean/median was the thing to look at, then pull the interval out of the \(t\)-test output or run ci_median
, as appropriate. The computer’s miles per gallon is consistently between 1.4 and 4.0 miles per gallon higher (mean) or 1.1 and 4.5 miles per gallon higher (median). Make a call about whether you consider this interval short or long, bearing in mind it’s based on the difference between two numbers that are about 40, and then say that the computer’s measurement is larger than the driver’s by a consistent amount (if you think the interval is short) or by an amount that varies quite a bit (if you think the interval is long).
Extra: in a situation like the one here, where there is a significant difference between the means, knowing also that the standard deviation is small would tell us that the computer is consistently giving a value that is bigger than the driver’s by about the same amount every time. Another way to think about this is to plot the driver’s and computer’s values against each other with a (solid) reference line7 like this:
ggplot(fuel, aes(x = Driver, y = Computer)) + geom_point() +
geom_abline(intercept = 0, slope = 1) +
geom_abline(intercept = 3, slope = 1, linetype = "dashed")
The solid reference line is where a point would be if the driver and computer had the same value. Here, there seems to be some consistency in the points being up and to the left of, but parallel to, the reference line. This would be saying that the computer’s value would be fairly consistently more than the driver’s by about 3 miles per gallon (eyeballing the plot; the dashed line), with some exceptions.
\(\blacksquare\)
Neuropathy and pain relief
The data in http://ritsokiguess.site/datafiles/exactRankTests_neuropathy.csv are the results of a study of pain relief in diabetic patients. The patients were randomly assigned to a standard treatment (“control”) or to a new treatment (“treat”), and for each patient a pain score was recorded, with a higher score indicating better pain relief. There were 58 patients altogether, 30 in the control group and 28 in the treatment group.
- Read the data and display at least some of it.
As ever:
<- 'http://ritsokiguess.site/datafiles/exactRankTests_neuropathy.csv'
my_url <- read_csv(my_url) neuropathy
Rows: 58 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): group
dbl (1): pain
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
neuropathy
There are indeed 58 patients, in two groups control
and treat
.8
Extra: Precisely, a pain score was measured twice for each patient, once before the study started and once after four weeks, and the variable pain
you see here is the (natural) log of the baseline score divided by the final score.9 This is the researchers’ way of summarizing pain relief by one number, instead of requiring us to deal with matched pairs (in a two-sample context, which would be rather confusing).
- Draw a suitable graph of the two variables.
The obvious thing, given that we are not explicitly looking for normality (yet), but just getting a general sense of centre and spread and shape, is a boxplot:
ggplot(neuropathy, aes(x = group, y = pain)) + geom_boxplot()
Interpretation coming up.
- Does your plot suggest that there will be a significant treatment effect, or not? Explain briefly.
The treatment median is a bit higher than the control median, but the treatment scores (especially) are very variable, so I would guess that the smallish difference in medians is not going to be significant: the difference in medians is not very big compared to how much variability there is.
- The researchers decided to run a Mood’s median test to compare the pain scores for the treatment and control groups. Why do you think they decided to do that, on the evidence you have so far?
It looks as if neither distribution is very close to normal, because both groups have an extreme low outlier, and it also looks as if both groups have an upper outlier as well. (I would call these outliers rather than long tails because their values seem to be a long way away from the other observations, quite a bit lower or higher).
The other thing to think about is sample size. We have 30 observations in the control group and 28 in the treatment group; these are not small samples, but the outliers, especially the low one in each group, seem more extreme than you would expect samples of this size to be able to handle.
Extra: that was as far as I was expecting you to go, but I seem to have the bootstrap sampling distribution of the sample mean here, at least for the control group:
%>% filter(group == "control") -> controls
neuropathy controls
tibble(sim = 1:10000) %>%
rowwise() %>%
mutate(my_sample = list(sample(controls$pain, replace = TRUE))) %>%
mutate(my_mean = mean(my_sample)) %>%
ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()
This, for me, is surprisingly good. I guess a little copying and pasting will reveal whether the treatment group is as good:
%>% filter(group == "treat") -> treats
neuropathy treats
tibble(sim = 1:10000) %>%
rowwise() %>%
mutate(my_sample = list(sample(treats$pain, replace = TRUE))) %>%
mutate(my_mean = mean(my_sample)) %>%
ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()
Pretty much, just a bit of funny stuff at the upper end. So maybe a \(t\)-test would have been all right in actual fact.
- Run Mood’s median test on these data. What do you conclude?
Having run library(smmr)
somewhere:
median_test(neuropathy, pain, group)
$grand_median
[1] 0.1635
$table
above
group above below
control 13 17
treat 16 12
$test
what value
1 statistic 1.1047619
2 df 1.0000000
3 P-value 0.2932234
With a (two-sided) P-value of 0.293, there is no evidence of any difference in median pain scores between the treatment and control groups. That is, there is no evidence of any treatment effect on the pain score (positive or negative).
I didn’t say anything about one-sided or two-sided, which means that you can make your own decision about that. You can take the attitude that nowhere here does it say anything about looking for evidence in a particular direction, which favours a two-sided test. Or you can say that a higher score indicates better pain relief, and a new treatment is supposed to be better (as with the kids learning to read from lecture), so we should do a one-sided test. As ever, have a reason for doing what you did.
To do this test one-sided, there is a bit more work to do:
- are we on the correct side? Yes, because a small majority of the control patients were below the grand median and a small majority of the treatment patients were above, as you would expect from a treatment that is better than control (if only a little). This comes from the table of aboves and belows in the
median_test
output. - we can therefore halve the two-sided P-value to get a one-sided one:
0.2932 / 2
[1] 0.1466
This is still not small enough to be significant, so there is not strong enough evidence that the new treatment offers better pain relief than the standard one (control).
Extra 1: if you look at the table of aboves and belows, you see that they are close to 50-50 above and below the grand median, certainly not as unbalanced as you would expect to see if there really was a treatment effect. (A Welch two-sample \(t\)-test has a smaller but still non-significant P-value, and is not significant one-sided either.)
Extra 2: in my source for these data, it was believed that only some of the patients would respond positively to the treatment. This would be consistent with not very compelling evidence for a treatment effect (for the treatment group as a whole), and also with the treatment group having a larger spread. According to the boxplot, it does seem to be true that some of the patients in the treatment group had a better pain score than those in the control group, but not by any means everybody.
This would be a tough thing to test, because we don’t know which patients in the treatment group are the ones that responded positively to the treatment. We might guess that they are the ones with the highest pain
scores (best pain relief), but there is no certainty of this, because a patient might happen to do very well with the new treatment just by chance. A second issue here is that if the new treatment works for some patients, how do you tell whether a particular patient is going to respond well to it before they start on it?
Cuckoo eggs
The cuckoo is a European bird that lays its eggs in the nests of other birds (rather than building nests itself). The other bird, known as a “host”, raises and cares for the newly hatched cuckoo chick as if it was its own. Each cuckoo returns to the same territory year after year and lays its eggs in a nest of the same host species. Thus, cuckoos are actually several sub-species, each with a different host bird that it lays its eggs in the nests of. In a study, 120 cuckoo eggs were found in the nests of six other bird species: hedge sparrow, meadow pipit, pied wagtail, robin, tree pipit, and wren. These are birds of different sizes, so researchers were interested in whether the cuckoo eggs laid in the nests of different host birds differed in size as well. (For example, wrens are small birds, so you might be interested in whether cuckoo eggs laid in wren nests are smaller than cuckoo eggs laid in the nests of other birds. If this is the case, the cuckoo eggs will look less different from the wren eggs in the same nest.)
The data are in http://ritsokiguess.site/datafiles/cuckoo.txt.
- Read in and display (some of) the data. Note that some of the host bird names are misspelled. (You do not need to correct the misspellings.)
Solution
The values are separated by single spaces,10 so:
<- "http://ritsokiguess.site/datafiles/cuckoo.txt"
my_url <- read_delim(my_url, " ") eggs
Rows: 120 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): bird_species
dbl (1): egg_length
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
eggs
There are indeed 120 eggs, with lengths (evidently measured in millimetres) and with the two “pipets” misspelled. The column called bird_species
is the host bird; the eggs themselves are all cuckoo eggs.
\(\blacksquare\)
- Bearing in mind that we will be interested in running some kind of ANOVA shortly, explain briefly why a normal quantile plot, for each host species separately, will be useful.
Solution
The major assumption of ANOVA is that the observations within each group are (approximately) normal in shape. Since we are specifically interested in normality, rather than shape generally, a normal quantile plot will be better than a boxplot.
\(\blacksquare\)
- Draw a suitable normal quantile plot. Based on what you see, what would you recommend as a suitable test to compare the egg lengths in the nests of the different host species? Explain briefly.
Solution
Do the normal quantile plot facetted by groups (bird_species
), since we are interested in normality within each group, not of all the observations taken together:
ggplot(eggs, aes(sample = egg_length)) + stat_qq() +
stat_qq_line() + facet_wrap(~ bird_species)
We are looking for all the distributions to be roughly normal. Commentary (which you don’t need to write but you certainly need to think):
- the Hedge Sparrow distribution has two or three mild outliers at the bottom. (This, to me, is better than calling it “long tails” because the high observation is (i) only one and (ii) not really too high.)
- the Meadow Pipit has long tails (I say this because the pattern seems to be a feature of the whole distribution, rather than of a few observations. Having said that, those four lowest values do seem to be noticeably lower than the rest, so you can sell the idea of those four lowest values being outliers).
- the Pied Wagtail is if anything short-tailed, so this is not a problem. You could also call this acceptably normal.
- I cannot see any problems at all with the Robin or the Wren.
- The Tree Pipit distribution is, as you see it, either acceptably normal, or very slightly skewed to the left (is that a curved shape?)
The second part of the thought process is sample size, because the larger the sample size within a group, the less picky you have to be about normality. You can guess this by looking at how many dots there appear to be on the normal quantile plots, or you can get the exact sample sizes (better):
%>% count(bird_species) eggs
So I would actually say that the Hedge Sparrow is the one that might be a problem. The (much) larger sample size of the Meadow Pipits ought to be large enough to take care of the long tails, but it is not clear whether a sample size of 14 Hedge Sparrows is large enough to take care of the low-end outliers.
So, in terms of what you actually need to write:
- find one or more distributions whose normality you are concerned about (eg Hedge Sparrow and Meadow Pipit)
- think about whether the sample sizes for the ones you are concerned about are small enough that the non-normality you found is still a problem. I would say that the large sample of Meadow Pipits (45) is large enough that I don’t need to worry: the long tails there are not extreme. But I am less sure about the Hedge Sparrows: the small sample size (14) might not be enough to accommodate the low-end outliers.
- if there are any distributions that you are concerned about, express that you don’t think an ANOVA will be appropriate (and that a Mood’s median test will be better).
If you were happy enough with the normality, the best thing is then to think about whether the spreads are equal. You can do this by calculating SDs:
%>%
eggs group_by(bird_species) %>%
summarize(length_sd = sd(egg_length))
The SDs for the robins and wrens are the smallest of these, but it’s up to you whether you think they are enough smaller than the others to be worth worrying about. I’m inclined to think not, but if you think they are, then you would recommend a Welch ANOVA. A couple of other options for assessing spread are:
- to draw boxplots, purely for the purpose of assessing spread (because we used the normal quantile plots for the stuff about normality)
- to use the normal quantile plots to assess spread.
Boxplots look like this:
ggplot(eggs, aes(x = bird_species, y = egg_length)) + geom_boxplot()
The pied wagtail box looks noticeably taller than the others, so the spreads do not all appear to be the same.
To use the normal quantile plot, assess the slope of each line in each plot. Do the slopes look about the same? I think they more or less do, though you could say that the line on the Pied Wagtail plot is steeper than the others.
In terms of a recommendation:
- if you thought the distributions were not normal enough, recommend Mood’s median test.
- if you thought that normality was OK, but equal spreads was not, recommend the Welch ANOVA.
- if you thought that both normality and equal spreads were good enough, recommend a regular ANOVA.
\(\blacksquare\)
- Run an (ordinary) analysis of variance, including any follow-up if warranted. What do you conclude, in the context of the data? (Run this analysis even if you don’t think it’s the best thing to do.)
Solution
I wanted to get you some practice at doing this, hence my last sentence:
.1 <- aov(egg_length ~ bird_species, data = eggs)
eggssummary(eggs.1)
Df Sum Sq Mean Sq F value Pr(>F)
bird_species 5 42.94 8.588 10.39 3.15e-08 ***
Residuals 114 94.25 0.827
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This P-value of \(3 \times 10^{-8}\) is much smaller than 0.05, so the mean cuckoo egg lengths are definitely not all the same for each host species (or, if you like, cuckoo egg length depends somehow on host species).
This is enough to say for now. To find out which host species differ from which in terms of mean cuckoo egg length, we need to fire up Tukey:
TukeyHSD(eggs.1)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = egg_length ~ bird_species, data = eggs)
$bird_species
diff lwr upr p adj
MeadowPipet-HedgeSparrow -0.82253968 -1.629133605 -0.01594576 0.0428621
PiedWagtail-HedgeSparrow -0.21809524 -1.197559436 0.76136896 0.9872190
Robin-HedgeSparrow -0.54642857 -1.511003196 0.41814605 0.5726153
TreePipet-HedgeSparrow -0.03142857 -1.010892769 0.94803563 0.9999990
Wren-HedgeSparrow -1.99142857 -2.970892769 -1.01196437 0.0000006
PiedWagtail-MeadowPipet 0.60444444 -0.181375330 1.39026422 0.2324603
Robin-MeadowPipet 0.27611111 -0.491069969 1.04329219 0.9021876
TreePipet-MeadowPipet 0.79111111 0.005291337 1.57693089 0.0474619
Wren-MeadowPipet -1.16888889 -1.954708663 -0.38306911 0.0004861
Robin-PiedWagtail -0.32833333 -1.275604766 0.61893810 0.9155004
TreePipet-PiedWagtail 0.18666667 -0.775762072 1.14909541 0.9932186
Wren-PiedWagtail -1.77333333 -2.735762072 -0.81090459 0.0000070
TreePipet-Robin 0.51500000 -0.432271433 1.46227143 0.6159630
Wren-Robin -1.44500000 -2.392271433 -0.49772857 0.0003183
Wren-TreePipet -1.96000000 -2.922428738 -0.99757126 0.0000006
This is hard to make sense of because there are 6 groups and 15 comparisons. Seven of the comparisons are significant. The five very small P-values are all Wren and something else; if you look at those carefully, the eggs in the Wren nests are all smaller on average. The other two P-values are only just less than 0.05; in these two cases, the eggs in the Meadow Pipit nests are less long on average than those in the Hedge Sparrow or Tree Pipit nests.
Try to find a way to summarize the seven significant differences in a way that is easy to read and understand (thinking of your reader, again). Simply listing the significant ones doesn’t offer any insight about what they have in common.
\(\blacksquare\)
- Run a Mood’s median test, and, if appropriate, follow-up tests. What do you now conclude, in the context of the data?
Solution
Make sure you have this somewhere:
library(smmr)
and then:
median_test(eggs, egg_length, bird_species)
$grand_median
[1] 22.35
$table
above
group above below
HedgeSparrow 11 3
MeadowPipet 17 28
PiedWagtail 10 5
Robin 10 6
TreePipet 12 3
Wren 0 15
$test
what value
1 statistic 3.032698e+01
2 df 5.000000e+00
3 P-value 1.271619e-05
This is also strongly significant, and indicates that the median cuckoo egg lengths in the nests of the six different species are not all the same. (Or that one or more of the median egg lengths is different, etc etc.) So, to find out which ones are different according to this procedure, fire up pairwise median tests:
pairwise_median_test(eggs, egg_length, bird_species)
There are fifteen of these (page down to see the other five). This actually is a dataframe, so you can sort the (adjusted) P-values into order without too much trouble. Or you can use filter
to show only the ones that are less than 0.05. I like the sorting idea better, because then you can see whether there are any others whose P-value is close to 0.05, or, as in this case, confirm that there are not:
pairwise_median_test(eggs, egg_length, bird_species) %>%
arrange(adj_p_value)
There are only four significant differences here: Wren with everything except Pied Wagtail. If you look at the table of aboves and belows in the output from Mood’s median test, this is evidently because the Wren eggs are shorter than the others. (Presumably Wren vs. Pied Wagtail is not unbalanced enough to be significant.)
\(\blacksquare\)
- Compare all your significant results from the previous two parts. Are the results substantially different? Explain briefly.
Solution
For the ANOVA and the Mood’s median test themselves, both P-values are very small (the one for the ANOVA is smaller), so there is no substantial difference there.
For the followup comparisons, the only ones that differ in significance between the two tests are Wren vs Pied Wagtail, and Meadow Pipit vs Hedge Sparrow and Tree Pipit. These are not significant in the pairwise median tests, but were in Tukey. (In addition, the last two comparisons were only just significant in the Tukey.) Given this, I would say that there is not a substantial difference between the results from the two procedures.
\(\blacksquare\)
Cars in 1993
The dataset in http://ritsokiguess.site/datafiles/Cars93.csv contains a lot of information about 93 different vehicles that were available in 1993. Each vehicle is classified as one of six Type
s, and for each vehicle, the gas mileage in miles per US gallon (of gas consumed) in city driving is recorded in MPG.city
. Our aim in this question is to compare the city gas mileage of vehicles of different types, assuming that the ones in our dataset are (something like) a random sample of all vehicles available in North America in 1993.
- Read in and display (some of) the data.
To exactly no-one’s surprise:
<- "http://ritsokiguess.site/datafiles/Cars93.csv"
my_url <- read_csv(my_url) Cars93
Rows: 93 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Manufacturer, Model, Type, AirBags, DriveTrain, Cylinders, Man.tra...
dbl (18): Min.Price, Price, Max.Price, MPG.city, MPG.highway, EngineSize, Ho...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Cars93
- Make a graph that will enable you to compare the distributions of city gas mileage for vehicles of different types.
Gas consumption is quantitative and vehicle type is categorical, so a boxplot:
ggplot(Cars93, aes(x = Type, y = MPG.city)) + geom_boxplot()
- What can you say about the spread of distribution of gas mileage for vehicles where the median is large (compared to when it is small)?
The gas mileages for the types of vehicle where gas mileage is large are more spread out (have a larger IQR) than the ones where the gas mileage is small. For example, small cars have both the biggest median and the most spread-out distribution (tallest box), while the three groups of vehicles with the smallest medians (large and midsize cars and vans) also have the least spread-out distributions.
Extra: we can work out SD and mean for each vehicle type and then plot those against each other. The first step is a group-by and summarize as we learned it early on in the course:
%>%
Cars93 group_by(Type) %>%
summarize(mean_mpg = mean(MPG.city), sd_mpg = sd(MPG.city))
and then the second step is to realize that this is itself a dataframe, so we can make a plot from it: in particular, we can plot the group means and SDs against each other. This is done by piping the above dataframe into ggplot
, so that in practice you would do it in one pipeline as shown below. The standard way to assess for any relationship between group means and SDs is to see how the SD depends on the mean (if at all), so the group SDs will be the \(y\) coordinate of the plot. I have added one extra refinement: labels to show which vehicle type is which on the plot:
%>%
Cars93 group_by(Type) %>%
summarize(mean_mpg = mean(MPG.city), sd_mpg = sd(MPG.city)) %>%
ggplot(aes(x = mean_mpg, y = sd_mpg, label = Type)) + geom_point() + geom_text_repel()
This shows clearly that the vehicle types with larger mean gas mileages also have larger SDs of mileages (and indeed that the SD goes up with the mean in a more or less linear way).
I got the labels by using geom_text_repel
from package ggrepel
(which you will have to install first if you want to try this). This puts the label next to the point it belongs to, so that you can see which point is which.11
Extra 2: This is a problem for the ANOVA that we want to do to compare the gas mileages, because that would assume that the groups all have the same spread. One way around that would be to use a Welch ANOVA, but the normality is also at least somewhat questionable. You could use Mood’s median test, but I wanted this to be a problem that used aov
, and the point of this Extra is to show you how I got it to this point.
When there is a relationship between group centre (mean, median) and spread (SD, IQR), particularly if one being larger goes with the other one being larger as well, things can often be helped by a transformation of the response variable. That is, instead of using gas mileage itself, we use some function of gas mileage like its square root or its logarithm. This has the effect of making larger values closer together (and smaller ones more spread out), and, with luck, will produce groups with more equal spread.
In the next part, I have you work with the reciprocal of gas mileage (which, as you will see, has a practical meaning for this problem), but I present it there as a fait accompli, and I wanted you to see that the idea actually came from somewhere. The technique I’m using is called the “Box-Cox transformation”, which we’ll see for real when we study regression, but the idea also applies in ANOVA. The boxcox
function lives in a package called MASS
(which you will also have to install and load first.12)
To run boxcox
, you feed it a model formula and a dataframe, just as you would do with aov
:
boxcox(MPG.city ~ Type, data = Cars93)
The output from boxcox
is a graph. The symbol below the \(x\)-axis is the Greek letter “lambda”, and the \(x\)-axis tells you how to transform the response variable. The value of \(\lambda\) tells you what power to raise gas mileage to, in order to best even out the spread across the groups. A value of 2 would mean to square it, a value of 0.5 would mean to square-root it, and so on. Raising a number to the power zero always gives you 1 (not very helpful), so \(\lambda = 0\) has the special meaning of “take logs”.
What you do with the plot is to look where it reaches its maximum, in this case around \(-1.3\). The middle vertical dotted line guides you as to what value of \(\lambda\) gives the maximum value for the curve. The two outer vertical dotted lines show a 95% confidence interval for \(\lambda\); anything inside the interval is supported by the data. In this case, \(\lambda\) could be anything between about \(-2\) and \(-0.6\). When choosing a transformation, you want a “round number” like \(2, 1, 0.5, 0, -0.5, -1, 2\) that is inside the interval. In this case, that would be \(-1\). Raising to the power \(-1\) means taking the reciprocal, that is, “one over” the gas mileage. We will see in the next part that this is in fact a perfectly sensible measure of gas consumption, and so we are in the happy position of being able to use a transformation that makes sense.
If \(\lambda = 1\) were in your confidence interval (it is not here), this would be telling you that raising to the power 1, ie., doing nothing, would be supported by the data. In that case, it would be saying that doing any kind of transformation is not helpful.
On our graph, the peak and the confidence interval are scrunched up over on the left. boxcox
tries to make a sensible decision about what values of \(\lambda\) to use, but if you don’t like what it gives you, you can provide your own, like this:
boxcox(MPG.city ~ Type, data = Cars93, lambda = seq(-4, 2, 0.1))
seq
means “go from the first number to the second number in steps of the third number”. It’s like the full version of Python’s range
with three inputs, or the “fill column” idea in Excel. Usually this idea is used to zoom the graph out, so that you can see more of it.
- It is suggested that we should use a “reciprocal transformation” of gas mileage: that is, instead of using gas mileage itself in our analysis, we should use one divided by gas mileage instead. Explain briefly how this is also a sensible measure of gas consumption. (Hint: think about the units.)
The units of MPG.city
in the data are miles per US gallon. To get the units of the reciprocal of this, we flip the units upside down: that is, the units of the reciprocal of MPG.city
are US gallons per mile. This is also a sensible way of measuring how much gas a vehicle uses – indeed, you might even say it is a more sensible measure, because it is literally how much gas the vehicle consumes to travel a fixed distance, and a lower number means (sensibly) that the vehicle consumes less gas.
This is in fact equivalent to the Canadian measure of litres per 100 km, and differs from it by only a fixed multiple (because the number of miles in a kilometre and the number of gallons in a litre are both constants). The choice of 100 km is only to produce a human-sized number (litres per km would be a number less than 1).
- Create and save a new column in your dataframe that contains the reciprocal of the gas mileage. Re-draw your plot from earlier to use your new column instead of the original
MPG.city
.
I’m going to call my new column gpm
because it literally is gallons per mile:
%>%
Cars93 mutate(gpm = 1/MPG.city) -> Cars93
I saved it back into the original dataframe.
Then:
ggplot(Cars93, aes(x = Type, y = gpm)) + geom_boxplot()
- Explain why it is now at least somewhat defensible to run an ordinary ANOVA.
The distributions of gpm
for each type are at least approximately symmetric in shape (better than they were before), and also the heights of the boxes are more or less equal (again, better than they were before).
There will inevitably be some hand-waving here. Things are not perfect, but there are 93 observations altogether, with this many in each group:
%>% count(Type) Cars93
so the Central Limit Theorem will help at least a little bit.
Extra: our problem before was that group SD went up with group mean. Is that still the case?
%>%
Cars93 group_by(Type) %>%
summarize(mean_gpm = mean(gpm), sd_gpm = sd(gpm)) %>%
ggplot(aes(x = mean_gpm, y = sd_gpm, label = Type)) + geom_point() + geom_text_repel()
not so much of a trend. Knowing the group mean tells you very little about the group SD.
- Run a suitable analysis of variance, along with any appropriate followup. (You don’t need to interpret the followup yet.)
aov
first:
.1 <- aov(gpm ~ Type, data = Cars93)
Cars93summary(Cars93.1)
Df Sum Sq Mean Sq F value Pr(>F)
Type 5 0.005767 0.0011535 35.99 <2e-16 ***
Residuals 87 0.002788 0.0000321
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There are definitely some differences among the groups, so run Tukey to find out where they are:
TukeyHSD(Cars93.1)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = gpm ~ Type, data = Cars93)
$Type
diff lwr upr p adj
Large-Compact 0.010424951 0.0039630197 0.0168868833 0.0001373
Midsize-Compact 0.007251611 0.0018308870 0.0126723349 0.0025418
Small-Compact -0.009711289 -0.0151860865 -0.0042364919 0.0000215
Sporty-Compact 0.002876609 -0.0031611172 0.0089143359 0.7338016
Van-Compact 0.014736297 0.0078620427 0.0216105509 0.0000002
Midsize-Large -0.003173341 -0.0092657082 0.0029190272 0.6537776
Small-Large -0.020136241 -0.0262767702 -0.0139957112 0.0000000
Sporty-Large -0.007548342 -0.0141956602 -0.0009010239 0.0166082
Van-Large 0.004311345 -0.0031040485 0.0117267392 0.5392304
Small-Midsize -0.016962900 -0.0219961683 -0.0119296321 0.0000000
Sporty-Midsize -0.004375002 -0.0100154381 0.0012654349 0.2216964
Van-Midsize 0.007484686 0.0009566117 0.0140127599 0.0150996
Sporty-Small 0.012587899 0.0068954754 0.0182803218 0.0000001
Van-Small 0.024447586 0.0178745419 0.0310206301 0.0000000
Van-Sporty 0.011859687 0.0048108829 0.0189084919 0.0000625
This is rather a lot to interpret, so I’m not asking you to do so.
- Is there any type of vehicle that has significantly better or significantly worse fuel consumption than all of the other vehicle types?
To give yourself a starting point: this would have to be the very best or the very worst Type
. To find out that, look again at the boxplot you just drew, with what I called gpm
:
ggplot(Cars93, aes(x = Type, y = gpm)) + geom_boxplot()
Small cars are the best (consume the least gas), and vans are the worst (consume the most). There seems to be something of a gap between the small cars and everything else, so scan through all the Small entries in your Tukey output and see that they are all significant. Hence the small cars have significantly better gas consumption than all the other types of vehicle. (This, you will realize, is one of the selling points of a small car.) For completeness, check the Vans and find that Van
and Large
do not differ significantly, so Vans are not significantly worse than all other vehicle types.
Extra 1: when I was your age, we used to interpret Tukey (calculated, of course, by hand) as below. First, list the means in order:13
%>%
Cars93 group_by(Type) %>%
summarize(mpg_mean = mean(MPG.city)) %>%
arrange(mpg_mean)
the old-fashioned way is to write out the group names and means in order, and then to draw a line under any groups that are not significantly different. Start with the worst one, Van, and see what it is not significantly different from (just Large). Then start with the next one, Large, and see that it is not significantly different from Midsize. And so on. The kind of “ladder” thing with the lines is quite common (and a pain to interpret). But there are no lines at all under Small, so Small is significantly different from all the others.
Van Large Midsize Sporty Compact Small
17.00 18.36 19.55 21.79 22.69 29.86
-------------
--------------
-----------------
----------------
- Run Mood’s median test on the original data (no assumption of normality or equal spreads needed), along with any appropriate followup. Compare the results with your ANOVA in the previous two questions.
median_test(Cars93, MPG.city, Type)
$grand_median
[1] 21
$table
above
group above below
Compact 11 3
Large 0 11
Midsize 4 14
Small 21 0
Sporty 8 6
Van 0 9
$test
what value
1 statistic 5.140800e+01
2 df 5.000000e+00
3 P-value 7.134267e-10
There are definitely some differences in median gas mileage between the types of cars (P-value \(7.1 \times 10^{-10}\)). To see what they are, the right followup is pairwise median tests:
pairwise_median_test(Cars93, MPG.city, Type) %>%
mutate(sig = adj_p_value <= 0.05)
Remember to use the adjusted P-value to assess significance here (because we are doing multiple tests all at once).
I created an extra column to indicated whether each comparison was significant or not (you don’t need to do this). This is because I wanted to do the lines thing again:
Van Large Midsize Sporty Compact Small
17.00 18.36 19.55 21.79 22.69 29.86
-------------
--------------------------
----------------
Almost the same but not quite: Sporty and Large are no longer significantly different, but all the other significant differences are the same.
Extra: the usual take in this sort of situation is that finding a good transformation (our gallons per mile) and then doing aov
is a better idea than going straight to Mood’s median test, because aov
will be more powerful14 than Mood’s median test, if you can find an appropriate way to do it. This one lent itself to a transformation because the type
s with bigger means also had bigger spreads. If things are less clear than this, you might not be able to find a good transformation, and in that case, Mood’s median test is the way to go.
Footnotes
If we had 40 independent miles-per-gallon measurements, there would be no reason to have the
Fill-up
column, because there would be no basis to link a Computer value with a particular Driver one.↩︎Actually, it works, but not the way we wanted it to: there is no error here, just not the kind of data we wanted.↩︎
It may look like two characters, but the backslash is called an “escape character” and doesn’t count.↩︎
I seem to be attributing sentience to the Computer here. Maybe
recorded_by
would have been a better name.↩︎To be precise linguistically, there are “differences between” two things, as in matched pairs, and “differences among” three or more things, as in ANOVA.↩︎
This is very similar to the thing with the P-values in Extra 1; the sign test doesn’t use the data as efficiently, so its interval is a little wider.↩︎
This is the line \(y = x\), which has intercept 0 and slope 1.↩︎
Studies of this kind are usually designed to have the same number of patients in each group. The patients in the study will have had to sign a consent form that explains any risks involved in taking part in the study, along with a statement that they can withdraw from the study at any time. What can happen is that the groups start off the same size, but a few people drop out, and so the group sizes end up slightly different from each other. My guess is that this is what happened here.↩︎
The log of a number less than 1 is negative, which means that a
pain
value that is negative comes from a baseline score that is less than the final score. The idea is supposed to be less pain after the treatment than before, so a negativepain
value indicates that the (standard or new) treatment is actually making the pain worse! Fortunately, there are not very many negativepain
values.↩︎The
.txt
file extension clues you in to the idea that this is something that is not a.csv
and thus will need some thought.↩︎There is also a
geom_text
that you can use without installing any extra packages, but that puts the text on top of the point, which is not the effect we are after here.↩︎There is an annoyance here in that
MASS
, like thetidyverse
, also has a function calledselect
, but theMASS
select
does something different, and things get less confused if you loadMASS
first, beforetidyverse
. If you do that, you can continue to useselect
and have it do what you would expect.↩︎These, I realize, are the mean MPG, compared using the analysis for
gpm
. Oops.↩︎As ever, making better use of the data.↩︎