Worksheet 7

Published

February 26, 2024

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There might be extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

This is long, because there is two weeks’ worth of work here.

1 Fuel efficiency comparison

Some cars have on-board computers that calculate quantities related to the car’s performance. One of the things measured is the fuel efficiency, that is, how much gasoline the car uses. On an American car, this is measured in miles per (US) gallon. On one type of vehicle equipped with such a computer, the fuel efficiency was measured each time the gas tank was filled up, and the computer was then reset. Twenty observations were made, and are in http://ritsokiguess.site/datafiles/mpgcomparison.txt. The computer’s values are in the column Computer. The driver also calculated the fuel efficiency by hand, by noting the number of miles driven between fill-ups, and the number of gallons of gas required to fill the tank each time. The driver’s values are in Driver. The final column Diff is the difference between the computer’s value and the driver’s value for each fill-up. The data values are separated by tabs.

You’ve seen these data before.

Read in and display (some of) the data.
Draw a suitable graph of these data, bearing in mind what you might want to learn from your graph. (This is a repeat of previous work with these data.)
Is there any difference between the median results of the driver and the computer? Do an appropriate test.

Fuel efficiency comparison, my solutions

You’ve seen these data before.

Read in and display (some of) the data.

Solution

library(tidyverse)

This is like the athletes data, so read_tsv:

my_url <- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
fuel <- read_tsv(my_url)

Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fuel

\(\blacksquare\)

Draw a suitable graph of these data, bearing in mind what you might want to learn from your graph. (This is a repeat of previous work with these data.)

Solution

In a matched pairs situation, what matters is whether the differences have enough of a normal distribution. The separate distributions of the computer’s and driver’s results are of no importance. So make a graph of the differences. We are specifically interested in normality, so a normal quantile plot is best:

ggplot(fuel, aes(sample = Diff)) + stat_qq() + stat_qq_line()

comment xxx

\(\blacksquare\)

Is there any difference between the median results of the driver and the computer? Do an appropriate test.

Solution

To compare the medians, a sign test on the differences (because they are paired):

sign_test(fuel, Diff, 0)

$above_below
below above 
    3    17 

$p_values
  alternative     p_value
1       lower 0.999798775
2       upper 0.001288414
3   two-sided 0.002576828

The right P-value for the sign test is 0.0026. Hence, there is a significant difference in medians between the driver and the computer (or, the median difference between them is not zero).

\(\blacksquare\)

2 Canned tuna

Light tuna is sold in 170-gram cans. The tuna can be canned in either water or oil. Is there a difference in the typical selling price of tuna canned in water or in oil? To find out, 25 supermarkets were sampled. In 14 of them (randomly chosen), the price of a (randomly chosen) brand of tuna in water was recorded, and in the other 11 supermarkets, the price of a (randomly chosen) brand of tuna in oil was recorded. The data are in http://ritsokiguess.site/datafiles/tuna.csv, with the column canned_in saying whether the tuna was canned in oil or water, and the price being in cents.

You used these data previously to draw normal quantile plots.

Read in and display (some of) the data.
Make a suitable graph of these data.
We decided previously that running a \(t\)-test was a bad idea here. Run Mood’s median test for these data to determine whether the median selling price differs between tuna canned in water and in oil. What do you conclude, in the context of the data?
Explain briefly why the counts of values above and below the overall median (in the previous part) is entirely consistent with the P-value that you found.

Canned tuna: My solutions

Read in and display (some of) the data.

Solution

Making it easy for you:

my_url <- "http://ritsokiguess.site/datafiles/tuna.csv"
tuna <- read_csv(my_url)

Rows: 25 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): canned_in
dbl (1): price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

tuna

Scrolling down reveals that the cans of tuna canned in oil are at the bottom.

\(\blacksquare\)

Make a suitable graph of these data.

Solution

You are probably expecting a boxplot, and with one quantitative variable (price) and one categorical one (canned_in) that is exactly the thing:

ggplot(tuna, aes(x = canned_in, y = price)) + geom_boxplot()

Another graph you might consider is a normal quantile plot. This is particularly true if you have been looking ahead, and realize that normality of each of the two samples is something we should be concerned about. There are two samples, so a facetted normal quantile plot is the thing:

ggplot(tuna, aes(sample = price)) + stat_qq() + stat_qq_line() +
  facet_wrap(~ canned_in)

\(\blacksquare\)

We decided previously that running a \(t\)-test was a bad idea here. Run Mood’s median test for these data to determine whether the median selling price differs between tuna canned in water and in oil. What do you conclude, in the context of the data?

Solution

These distributions are far from normal (skewed to the right), and (as we determined before) the sample sizes are small (14 and 11), so it doesn’t look as if the Central Limit Theorem will be much help.

For Mood’s median test, use smmr (easier). I didn’t say anything about building it yourself:

median_test(tuna, price, canned_in)

$grand_median
[1] 67

$table
       above
group   above below
  oil       5     5
  water     7     6

$test
       what      value
1 statistic 0.03350816
2        df 1.00000000
3   P-value 0.85475695

Our null hypothesis is that the median price of tuna canned in water and tuna canned in oil is the same. Our alternative is two-sided, since we are looking for any difference: that is, that the two median prices are different. With a P-value of 0.855, much larger than 0.05, we cannot reject the null hypothesis (we “retain” the null, if you like that word better). There is no evidence of a difference in the median price of tuna canned in water and tuna canned in oil.

\(\blacksquare\)

Explain briefly why the counts of values above and below the overall median (in the previous part) is entirely consistent with the P-value that you found.

Solution

We failed to reject that the two group medians are equal, so we would expect to see a more or less even split of values above and below the overall median in both the groups. Specifically, that means we would expect close to half of the values in each group to be below the overall median and close to half of them to be above. (It has to be “half” because the overall median has to have half of all the data values below and half above).

To see how close that was to happening, look at the other part of the output in the previous part. You might have to click on R Console to see it. It turns out that for the tuna canned in oil, 5 prices were above the overall median and 5 were below, an exactly even split, and for the tuna canned in water, 7 were above and 6 below, very close to an even split. (This is, in fact, exactly the sort of thing we would have expected to see if the null hypothesis were actually true.) These are extremely close to being 50-50 splits, and so not only would we not expect to reject the null hypothesis, but we would also expect to get the very large P-value that we did.

(In case you were wondering where the other two values went, since there were 11 oil prices and 14 water ones, they were exactly equal to the overall median and so got counted as neither above nor below.)

\(\blacksquare\)

3 Cuckoo eggs

The cuckoo is a European bird that lays its eggs in the nests of other birds (rather than building nests itself). The other bird, known as a “host”, raises and cares for the newly hatched cuckoo chick as if it was its own. Each cuckoo returns to the same territory year after year and lays its eggs in a nest of the same host species. Thus, cuckoos are actually several sub-species, each with a different host bird that it lays its eggs in the nests of. In a study, 120 cuckoo eggs were found in the nests of six other bird species: hedge sparrow, meadow pipit, pied wagtail, robin, tree pipit, and wren. These are birds of different sizes, so researchers were interested in whether the cuckoo eggs laid in the nests of different host birds differed in size as well. (For example, wrens are small birds, so you might be interested in whether cuckoo eggs laid in wren nests are smaller than cuckoo eggs laid in the nests of other birds. If this is the case, the cuckoo eggs will look less different from the wren eggs in the same nest.)

The data are in http://ritsokiguess.site/datafiles/cuckoo.txt.

Read in and display (some of) the data. Note that some of the host bird names are misspelled. (You do not need to correct the misspellings.)
Bearing in mind that we will be interested in running some kind of ANOVA shortly, explain briefly why a normal quantile plot, for each host species separately, will be useful.
Draw a suitable normal quantile plot. Based on what you see, what would you recommend as a suitable test to compare the egg lengths in the nests of the different host species? Explain briefly.
Run an (ordinary) analysis of variance, including any follow-up if warranted. What do you conclude, in the context of the data? (Run this analysis even if you don’t think it’s the best thing to do.)
Run a Mood’s median test, and, if appropriate, follow-up tests. What do you now conclude, in the context of the data?
Compare all your significant results from the previous two parts. Are the results substantially different? Explain briefly.

Cuckoo eggs: my solutions

The data are in http://ritsokiguess.site/datafiles/cuckoo.txt.

Read in and display (some of) the data. Note that some of the host bird names are misspelled. (You do not need to correct the misspellings.)

Solution

The values are separated by single spaces, so:

my_url <- "http://ritsokiguess.site/datafiles/cuckoo.txt"
eggs <- read_delim(my_url, " ")

Rows: 120 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): bird_species
dbl (1): egg_length

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

eggs

There are indeed 120 eggs, with lengths (evidently measured in millimetres) and with the two “pipets” misspelled. The column called bird_species is the host bird; the eggs themselves are all cuckoo eggs.

\(\blacksquare\)

Bearing in mind that we will be interested in running some kind of ANOVA shortly, explain briefly why a normal quantile plot, for each host species separately, will be useful.

Solution

The major assumption of ANOVA is that the observations within each group are (approximately) normal in shape. Since we are specifically interested in normality, rather than shape generally, a normal quantile plot will be better than a boxplot.

\(\blacksquare\)

Draw a suitable normal quantile plot. Based on what you see, what would you recommend as a suitable test to compare the egg lengths in the nests of the different host species? Explain briefly.

Solution

Do the normal quantile plot facetted by groups (bird_species), since we are interested in normality within each group, not of all the observations taken together:

ggplot(eggs, aes(sample = egg_length)) + stat_qq() +
  stat_qq_line() + facet_wrap(~ bird_species)

We are looking for all the distributions to be roughly normal. Commentary (which you don’t need to write but you certainly need to think):

the Hedge Sparrow distribution has two or three mild outliers at the bottom. (This, to me, is better than calling it “long tails” because the high observation is (i) only one and (ii) not really too high.)
the Meadow Pipit has long tails (I say this because the pattern seems to be a feature of the whole distribution, rather than of a few observations. Having said that, those four lowest values do seem to be noticeably lower than the rest, so you can sell the idea of those four lowest values being outliers).
the Pied Wagtail is if anything short-tailed, so this is not a problem. You could also call this acceptably normal.
I cannot see any problems at all with the Robin or the Wren.
The Tree Pipit distribution is, as you see it, either acceptably normal, or very slightly skewed to the left (is that a curved shape?)

The second part of the thought process is sample size, because the larger the sample size within a group, the less picky you have to be about normality. You can guess this by looking at how many dots there appear to be on the normal quantile plots, or you can get the exact sample sizes (better):

eggs %>% count(bird_species)

So I would actually say that the Hedge Sparrow is the one that might be a problem. The (much) larger sample size of the Meadow Pipits ought to be large enough to take care of the long tails, but it is not clear whether a sample size of 14 Hedge Sparrows is large enough to take care of the low-end outliers.

So, in terms of what you actually need to write:

find one or more distributions whose normality you are concerned about (eg Hedge Sparrow and Meadow Pipit)
think about whether the sample sizes for the ones you are concerned about are small enough that the non-normality you found is still a problem. I would say that the large sample of Meadow Pipits (45) is large enough that I don’t need to worry: the long tails there are not extreme. But I am less sure about the Hedge Sparrows: the small sample size (14) might not be enough to accommodate the low-end outliers.
if there are any distributions that you are concerned about, express that you don’t think an ANOVA will be appropriate (and that a Mood’s median test will be better).

If you were happy enough with the normality, the best thing is then to think about whether the spreads are equal. You can do this by calculating SDs:

eggs %>% 
  group_by(bird_species) %>% 
  summarize(length_sd = sd(egg_length))

The SDs for the robins and wrens are the smallest of these, but it’s up to you whether you think they are enough smaller than the others to be worth worrying about. I’m inclined to think not, but if you think they are, then you would recommend a Welch ANOVA. A couple of other options for assessing spread are:

to draw boxplots, purely for the purpose of assessing spread (because we used the normal quantile plots for the stuff about normality)
to use the normal quantile plots to assess spread.

Boxplots look like this:

ggplot(eggs, aes(x = bird_species, y = egg_length)) + geom_boxplot()

The pied wagtail box looks noticeably taller than the others, so the spreads do not all appear to be the same.

To use the normal quantile plot, assess the slope of each line in each plot. Do the slopes look about the same? I think they more or less do, though you could say that the line on the Pied Wagtail plot is steeper than the others.

In terms of a recommendation:

if you thought the distributions were not normal enough, recommend Mood’s median test.
if you thought that normality was OK, but equal spreads was not, recommend the Welch ANOVA.
if you thought that both normality and equal spreads were good enough, recommend a regular ANOVA.

\(\blacksquare\)

Run an (ordinary) analysis of variance, including any follow-up if warranted. What do you conclude, in the context of the data? (Run this analysis even if you don’t think it’s the best thing to do.)

Solution

I wanted to get you some practice at doing this, hence my last sentence:

eggs.1 <- aov(egg_length ~ bird_species, data = eggs)
summary(eggs.1)

              Df Sum Sq Mean Sq F value   Pr(>F)    
bird_species   5  42.94   8.588   10.39 3.15e-08 ***
Residuals    114  94.25   0.827                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This P-value of \(3 \times 10^{-8}\) is much smaller than 0.05, so the mean cuckoo egg lengths are definitely not all the same for each host species (or, if you like, cuckoo egg length depends somehow on host species).

This is enough to say for now. To find out which host species differ from which in terms of mean cuckoo egg length, we need to fire up Tukey:

TukeyHSD(eggs.1)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = egg_length ~ bird_species, data = eggs)

$bird_species
                                diff          lwr         upr     p adj
MeadowPipet-HedgeSparrow -0.82253968 -1.629133605 -0.01594576 0.0428621
PiedWagtail-HedgeSparrow -0.21809524 -1.197559436  0.76136896 0.9872190
Robin-HedgeSparrow       -0.54642857 -1.511003196  0.41814605 0.5726153
TreePipet-HedgeSparrow   -0.03142857 -1.010892769  0.94803563 0.9999990
Wren-HedgeSparrow        -1.99142857 -2.970892769 -1.01196437 0.0000006
PiedWagtail-MeadowPipet   0.60444444 -0.181375330  1.39026422 0.2324603
Robin-MeadowPipet         0.27611111 -0.491069969  1.04329219 0.9021876
TreePipet-MeadowPipet     0.79111111  0.005291337  1.57693089 0.0474619
Wren-MeadowPipet         -1.16888889 -1.954708663 -0.38306911 0.0004861
Robin-PiedWagtail        -0.32833333 -1.275604766  0.61893810 0.9155004
TreePipet-PiedWagtail     0.18666667 -0.775762072  1.14909541 0.9932186
Wren-PiedWagtail         -1.77333333 -2.735762072 -0.81090459 0.0000070
TreePipet-Robin           0.51500000 -0.432271433  1.46227143 0.6159630
Wren-Robin               -1.44500000 -2.392271433 -0.49772857 0.0003183
Wren-TreePipet           -1.96000000 -2.922428738 -0.99757126 0.0000006

This is hard to make sense of because there are 6 groups and 15 comparisons. Seven of the comparisons are significant. The five very small P-values are all Wren and something else; if you look at those carefully, the eggs in the Wren nests are all smaller on average. The other two P-values are only just less than 0.05; in these two cases, the eggs in the Meadow Pipit nests are less long on average than those in the Hedge Sparrow or Tree Pipit nests.

Try to find a way to summarize the seven significant differences in a way that is easy to read and understand (thinking of your reader, again). Simply listing the significant ones doesn’t offer any insight about what they have in common.

\(\blacksquare\)

Run a Mood’s median test, and, if appropriate, follow-up tests. What do you now conclude, in the context of the data?

Solution

Make sure you have this somewhere:

library(smmr)

and then:

median_test(eggs, egg_length, bird_species)

$grand_median
[1] 22.35

$table
              above
group          above below
  HedgeSparrow    11     3
  MeadowPipet     17    28
  PiedWagtail     10     5
  Robin           10     6
  TreePipet       12     3
  Wren             0    15

$test
       what        value
1 statistic 3.032698e+01
2        df 5.000000e+00
3   P-value 1.271619e-05

This is also strongly significant, and indicates that the median cuckoo egg lengths in the nests of the six different species are not all the same. (Or that one or more of the median egg lengths is different, etc etc.) So, to find out which ones are different according to this procedure, fire up pairwise median tests:

pairwise_median_test(eggs, egg_length, bird_species)

There are fifteen of these (page down to see the other five). This actually is a dataframe, so you can sort the (adjusted) P-values into order without too much trouble. Or you can use filter to show only the ones that are less than 0.05. I like the sorting idea better, because then you can see whether there are any others whose P-value is close to 0.05, or, as in this case, confirm that there are not:

pairwise_median_test(eggs, egg_length, bird_species) %>% 
  arrange(adj_p_value)

There are only four significant differences here: Wren with everything except Pied Wagtail. If you look at the table of aboves and belows in the output from Mood’s median test, this is evidently because the Wren eggs are shorter than the others. (Presumably Wren vs. Pied Wagtail is not unbalanced enough to be significant.)

\(\blacksquare\)

Compare all your significant results from the previous two parts. Are the results substantially different? Explain briefly.

Solution

For the ANOVA and the Mood’s median test themselves, both P-values are very small (the one for the ANOVA is smaller), so there is no substantial difference there.

For the followup comparisons, the only ones that differ in significance between the two tests are Wren vs Pied Wagtail, and Meadow Pipit vs Hedge Sparrow and Tree Pipit. These are not significant in the pairwise median tests, but were in Tukey. (In addition, the last two comparisons were only just significant in the Tukey.) Given this, I would say that there is not a substantial difference between the results from the two procedures.

\(\blacksquare\)

4 Consistency of lab tests

Consistency between laboratory tests is important and yet the results may depend on who did the test and where the test was performed. In an experiment to test levels of consistency, a large jar of dried egg powder was divided up into a number of samples. Because the powder was homogenized (all the same), the fat content of the samples is actually the same, but this fact is hidden from the laboratories. Thus any observed differences between the results are differences among the labs. In each of the six labs, the fat content of the egg powder was measured eight times.

The data are in http://ritsokiguess.site/datafiles/eggs.csv.

Read in and display (some of) the data. (There are some other columns that we will not use in this question.)
Make a suitable plot of the variables of interest.
Describe briefly how any of the labs deviate from the general pattern.
Run an (ordinary) analysis of variance, and conduct any appropriate follow-ups. What do you conclude, in the context of the data? (Hint: using \(\alpha = 0.10\) will make your conclusions clearer.)
Run a suitable Mood’s median test on the same data, including any followups if warranted.

Consistency of lab tests: my solutions

The data are in http://ritsokiguess.site/datafiles/eggs.csv.

Read in and display (some of) the data. (There are some other columns that we will not use in this question.)

Solution

As ever. I chose not to call the dataframe fat because that is one of the columns in it, and I didn’t want to get confused. fat_measurements would have been fine:

my_url <- "http://ritsokiguess.site/datafiles/eggs.csv"
egg_powder <- read_csv(my_url)

Rows: 48 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Lab, Technician, Sample
dbl (1): Fat

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

egg_powder

The measured fat content is in the first column Fat. The labs, in the second column, are distinguished by Roman numerals (as text). There are two other columns Technician and Sample that we will ignore.

Extra: the original description of the data had some more detail:

Four samples were sent to each of six laboratories. Two of the samples were labeled as G and two as H, although in fact they were identical. The laboratories were instructed to give two samples to two different technicians. The technicians were then instructed to divide their samples into two parts and measure the fat content of each. So each laboratory reported eight measures, each technician four measures, that is, two replicated measures on each of two samples.

I’m guessing that may have gone “whoosh” over your head, so let’s try to break it down. The original egg powder was divided into 24 parts. 4 of these parts went to each of the 6 labs. Two of those four parts were labelled G and two were labelled H (to disguise the fact that they were actually all the same). This is in the column labelled Sample. Each lab was told that they needed to pick two technicians to measure the fat content. In the Technician column, these are labelled one and two within each lab. Each technician got one of the parts labelled G and one labelled H. They were instructed to split each part into two and measure the fat content of each one separately. (You’ll remember from your science classes that averaging up several independent measurements is more accurate than just measuring once, so the technicians would have been used to instructions like these.)

As a result, there are actually 48 observations altogether (24 parts each divided into two). We will ignore the Technician and the Sample entirely in our analysis, but I will come back to them later.

\(\blacksquare\)

Make a suitable plot of the variables of interest.

Solution

The two variables of interest to us are Fat (quantitative) and Lab (categorical). I am not (yet) talking about normality or assessing assumptions, so a normal quantile plot is not needed; a boxplot is just fine:

ggplot(egg_powder, aes(x = Lab, y = Fat)) + geom_boxplot()

I just noticed something: the first six Roman numerals are actually in alphabetical order, so they plot properly on the boxplot. A happy coincidence.

Or you can do a normal quantile plot, facetted (which I think is about equally good), but you might have greater difficulty interpreting it for the next part:

egg_powder %>% ggplot(aes(sample = Fat)) +
  stat_qq() + stat_qq_line() +
  facet_wrap(~ Lab)

\(\blacksquare\)

Describe briefly how any of the labs deviate from the general pattern.

Solution

I notice these things, most easily from the boxplots:

Lab I has a noticeably higher median than the other labs, and also a greater spread (IQR).
Lab V has a very small spread (IQR) but also a low outlier that appears to be a long way below the other values from that lab. (The effects of these on the SD, which is really what an ANOVA depends on, counterbalance each other: a small IQR will make the SD smaller, but an outlier will make it bigger.)
Lab VI has (maybe) a lower median than the other labs and a larger spread than the other labs except Lab I.

Say at least two things altogether out of these. To my mind the observations about labs I and V are the most important ones.

Aside: the observations from Lab V are really very consistent, except for one by the second technician on sample H, which really is much lower than the others. In lab I, the measurements by the first technician are all lower than those by the second technician, and two of them by the first technician are very low, hence the long lower tail. End of aside.

If you have normal quantile plots, you may see something like these:

Lab I’s results are higher on average than the others (the points are further up the page, or this is the only lab with any results above 0.6)
the results for Lab I and maybe Lab VI are more spread out than the others (steeper lines, or they fill more of the vertical space)
Labs I and V may have low outliers, but there are no points grossly off the line from the other labs. (Note the difference in interpretation from the boxplots: Lab I has greater spread overall, so the two lowest values appear on the end of a long whisker, and Lab V has much less spread overall. On a boxplot, outlierness is relative to the IQR, so Lab V looks much worse, but on a normal quantile plot, outlierness is just how far the outlying points are different from the others, in this case about 0.2 lower than the others from their lab.)

Here, say at least two things that are not only about outliers: say something about average and/or spread as well. This will be a little more difficult than if you had drawn boxplots, because normal quantile plots are specifically about assessing normality.

\(\blacksquare\)

Run an (ordinary) analysis of variance, and conduct any appropriate follow-ups. What do you conclude, in the context of the data? (Hint: using \(\alpha = 0.10\) will make your conclusions clearer.)

Solution

Set aside, for the moment, any misgivings you have about the appropriateness of this analysis (about outliers or unequal spreads) and run the aov first. I give the aov output a name with a number; depending on how I feel, the name might be of the dataframe, or, as here, the name of the response variable. You can use any name you like:

fat.1 <- aov(Fat ~ Lab, data = egg_powder)
summary(fat.1)

            Df Sum Sq Mean Sq F value   Pr(>F)    
Lab          5 0.4430 0.08861   6.415 0.000164 ***
Residuals   42 0.5801 0.01381                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a P-value of 0.00016, there are definitely some differences between the labs (or, perhaps more accurately, between the means of all possible measurements from these labs). To find out which labs differ from which, run Tukey on the output from aov:

TukeyHSD(fat.1)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Fat ~ Lab, data = egg_powder)

$Lab
           diff        lwr          upr     p adj
II-I   -0.24000 -0.4154155 -0.064584498 0.0025126
III-I  -0.17250 -0.3479155  0.002915502 0.0563043
IV-I   -0.20375 -0.3791655 -0.028334498 0.0145263
V-I    -0.22625 -0.4016655 -0.050834498 0.0049820
VI-I   -0.31250 -0.4879155 -0.137084498 0.0000530
III-II  0.06750 -0.1079155  0.242915502 0.8579428
IV-II   0.03625 -0.1391655  0.211665502 0.9891804
V-II    0.01375 -0.1616655  0.189165502 0.9998965
VI-II  -0.07250 -0.2479155  0.102915502 0.8178081
IV-III -0.03125 -0.2066655  0.144165502 0.9945375
V-III  -0.05375 -0.2291655  0.121665502 0.9405528
VI-III -0.14000 -0.3154155  0.035415502 0.1857223
V-IV   -0.02250 -0.1979155  0.152915502 0.9988519
VI-IV  -0.10875 -0.2841655  0.066665502 0.4457371
VI-V   -0.08625 -0.2616655  0.089165502 0.6859837

There are 15 comparisons, which is rather a lot to summarize, so scan down them and see what the significant ones have in common. (This is like what you did on the last Worksheet.) There are only five significant comparisons at \(\alpha = 0.10\), looking down the p adj column, and they all involve lab I. So the mean fat content is significantly different (higher) when measured at Lab I, compared to at the other labs.

If you used 0.05, the difference between labs I and III narrowly fails to be significant, which spoils the consistency, but nothing else is close to significance either way.

Extra: as I write this, the next part will have you do a Mood’s median test, for comparison, so I thought I’d do the Welch ANOVA here so that you don’t have to:

oneway.test(Fat ~ Lab, data = egg_powder)


    One-way analysis of means (not assuming equal variances)

data:  Fat and Lab
F = 3.1829, num df = 5.000, denom df = 19.381, p-value = 0.02914

Once again, this is significant (though not as strongly so as for aov), so there are some differences to find. The right followup here is the Games-Howell procedure, which lives in the package PMCMRplus. So install (with install.packages) and load (with library) that first, and then you can say this:

library(PMCMRplus)
gamesHowellTest(Fat ~ factor(Lab), data = egg_powder)


    Pairwise comparisons using Games-Howell test

data: Fat by factor(Lab)

    I     II    III   IV    V    
II  0.082 -     -     -     -    
III 0.284 0.575 -     -     -    
IV  0.187 0.971 0.984 -     -    
V   0.103 0.999 0.705 0.995 -    
VI  0.024 0.722 0.135 0.440 0.524


P value adjustment method: none

alternative hypothesis: two.sided

(you will mess up the uppercase and lowercase in the package name and the name of the function. I do all the time. Check this carefully if you get an error the first time you run it.)

At the same \(\alpha = 0.10\) that we used before, there are now only two significant differences, both involving Lab I: vs. lab VI and vs. lab II. The comparison with lab V is very close to significance at our \(\alpha\), but the other two are not significant, and nor are any of the comparisons involving the other labs.

A couple of observations with the Welch test:

the Welch ANOVA and the Games-Howell test are two separate procedures, so we don’t save the result of the first one to use in the second one.
this implementation of the Games-Howell test requires a genuine factor categorical variable, rather than a text one as we have used everywhere else. So we use the same factor trick to create one as we have done before (eg. when a categorical variable is numbers, which will get treated as quantitative if we are not careful).
in my experience, the P-values from the Games-Howell test tend to be bigger than for Tukey.
The difference here between the aov and the Welch P-values is more substantial than I was expecting, which suggests that maybe the labs really did have different spreads, and therefore it was important to do Welch rather than aov. (This is the same issue as using the Welch vs. pooled \(t\)-tests.)

I guess four can be “a couple”.

\(\blacksquare\)

Run a suitable Mood’s median test on the same data, including any followups if warranted.

Solution

This works exactly the same way as the Mood’s median test you did in the other question. It makes no difference to the test itself whether you have two groups or six:

median_test(egg_powder, Fat, Lab)

$grand_median
[1] 0.37

$table
     above
group above below
  I       6     2
  II      4     4
  III     5     1
  IV      4     3
  V       3     4
  VI      1     6

$test
       what     value
1 statistic 8.3551760
2        df 5.0000000
3   P-value 0.1377171

In this case, Mood’s median test is not significant, so that from this point of view there are no significant differences between the labs, and this is where you stop.

This was originally worth three points, which was rather generous for this, but you might have expected that you would have had to go on with the pairwise median tests. One of the points here is for saying that you need to stop, and doing so.

Extra 1: my guess is that Mood’s median test didn’t came out significant is the small sample sizes, only eight observations from each lab. With so few observations and so many labs, the aboves and belows would have needed to be even more unbalanced than this, and the fat measurements were too spread out for this to happen.

Extra 2: So which test should we use? It’s difficult to be sure, with the small sample sizes in each lab. You might have looked at the normal quantile plots and said that the normality doesn’t look very good, and therefore suggested doing Mood’s median test, but the results from that didn’t indicate that lab I was any different from the other labs.

We can get bootstrapped sampling distributions of the sample means for several groups at once. The code below shows how it goes. If you want to see how it works, run it one line at a time. If you do this, you can temporarily change the third line to, say, sim = list(1:4) to more easily see what is happening. I am doing more simulations to get a better sense of what the normal quantile plots are saying:

egg_powder %>% 
  nest_by(Lab) %>% 
  mutate(sim = list(1:10000)) %>% 
  unnest(sim) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(data$Fat, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line() +
  facet_wrap(~Lab, scales = "free")

I gave each facet its own scale, so we could get the best picture of the normality.

Really, the only one of those that is problematic is lab V, which is a little skewed to the left. I really don’t see any problems for us with any of the others. The issues with them are short tails, which won’t impact the means.

My original boxplots (and the normal quantile plots of the data) indicated that if we are all right with the normality, which it seems we mainly are, we should worry about unequal spreads. With that in mind, I’m thinking that the Welch ANOVA is the right way to go here, and the much stronger significance of the regular ANOVA was actually an illusion.

Extra 3: We haven’t actually used the technician or the sample in our analysis. I’m not going to use the sample, but I do want to think about the technicians. The immediate problem is that the technician labelled one is not the same person in each lab: the technicians used here were chosen from the ones that worked in that lab. If the same two technicians were used in all the labs, we would have a much easier time; it would be a “crossed” design, and we could do a straightforward two-way ANOVA like the ones you may have run into before. But this is a “nested” design, with the technicians “nested within” lab. To model this, we use this new notation:

fat.2 <- aov(Fat ~ Lab / Technician, data = egg_powder)
summary(fat.2)

               Df Sum Sq Mean Sq F value   Pr(>F)    
Lab             5 0.4430 0.08861   9.590 6.99e-06 ***
Lab:Technician  6 0.2475 0.04125   4.464  0.00179 ** 
Residuals      36 0.3326 0.00924                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(we are beyond the capabilities of Welch ANOVA, so we have to go back to aov.)

The notation means “model Fat in terms of Lab and Technician within Lab”. In the ANOVA table below, we get a test for Lab that is more strongly significant than it was before (in the aov). This is because we are testing for differences among Labs after allowing for differences between technicians within a lab.¹ For example, one of the technicians in a lab might tend to give high measurements and the other one low. In the first analysis, this would suggest a large variability in measurements within that lab (which would make it hard to tell the labs apart), but in this analysis, this variability would be taken up by the technician-within-lab line, which is also significant.

If you have run into a randomized blocks analysis, you may be familiar with the term “block” in experimental design. This means, loosely, something that may make a difference but that you don’t care about. Technicians are like that here: we don’t care much about the particular technicians that were used, but we want to get rid of any differences between them, to get a better comparison of the labs.

The Tukey here is rather unwieldy, because it has all sorts of Technician stuff that we don’t care about, but we can look at just the Lab stuff:

TukeyHSD(fat.2)$Lab

           diff         lwr          upr        p adj
II-I   -0.24000 -0.38459081 -0.095409195 2.087576e-04
III-I  -0.17250 -0.31709081 -0.027909195 1.163557e-02
IV-I   -0.20375 -0.34834081 -0.059159195 1.922463e-03
V-I    -0.22625 -0.37084081 -0.081659195 4.901648e-04
VI-I   -0.31250 -0.45709081 -0.167909195 2.133899e-06
III-II  0.06750 -0.07709081  0.212090805 7.240821e-01
IV-II   0.03625 -0.10834081  0.180840805 9.733269e-01
V-II    0.01375 -0.13084081  0.158340805 9.997181e-01
VI-II  -0.07250 -0.21709081  0.072090805 6.611505e-01
IV-III -0.03125 -0.17584081  0.113340805 9.861403e-01
V-III  -0.05375 -0.19834081  0.090840805 8.705387e-01
VI-III -0.14000 -0.28459081  0.004590805 6.240247e-02
V-IV   -0.02250 -0.16709081  0.122090805 9.969635e-01
VI-IV  -0.10875 -0.25334081  0.035840805 2.356038e-01
VI-V   -0.08625 -0.23084081  0.058340805 4.817411e-01

This is rather clearer than it was before: all the comparisons involving Lab I have a P-value less than about 0.01 (the largest one is 0.0116), and none of the others are significant (the smallest of the others is 0.062, labs III vs VI).

I’m not quite sure how to handle Sample, since that may also have a different effect in each lab (and possibly also for each technician).

We should probably not get too excited about these results, because we were worried about differences in spread among the labs. The way to think about that is to treat this ANOVA like a regression and look at its residuals, but that’s taking us too far afield for now.

\(\blacksquare\)

5 Home prices

A realtor kept track of the asking prices of 37 homes for sale in West Lafayette, Indiana, in a particular year. Unfortunately for us, the data are as shown in http://ritsokiguess.site/datafiles/homes.txt. The small numbers, 3 and 4, on lines by themselves, are the number of bedrooms of the homes whose asking prices follow that number. The actual asking prices, in dollars, are aligned in columns, up to six of them in a line.

We need to get these data into a tidy format, with one column being the number of bedrooms and the other being the asking price of the home.

The question has three parts, but you will spend most of your time and thinking on part (b). Part (c) is straightforward once you have done part (b).

Read in the (untidy) data from the file, in a way that you get all the data values. To make sure that R reads all six columns, set up a vector of six column names (it does not matter what they are) and tell the appropriate read_ function to use these as the column names. (You will have to look this last thing up. Say where you found the answer.)
Obtain a tidy dataframe, according to the specifications in the question. Describe your process clearly enough that anyone reading your answer would be able to do it themselves and see why it works.

Hints: you might find it useful to find out about the following, if you don’t already know about them:

drop_na
ifelse
fill

Also think about how you know whether a number in the data file is a selling price or a number of bedrooms.

Use your tidy dataframe to make a boxplot of these data.

Home prices: my solutions

We need to get these data into a tidy format, with one column being the number of bedrooms and the other being the asking price of the home.

The question has three parts, but you will spend most of your time and thinking on part (b). Part (c) is straightforward once you have done part (b).

Read in the (untidy) data from the file, in a way that you get all the data values. To make sure that R reads all six columns, set up a vector of six column names (it does not matter what they are) and tell the appropriate read_ function to use these as the column names. (You will have to look this last thing up. Say where you found the answer.)

Solution

These are aligned columns,² so read_table is what you need. Set up the column names first, any six distinct pieces of text. The way you tell read_table what to use as the column names (when you don’t want to use the first line of the data file) is col_names. If you read the help for read_table (by typing ?read_table in the console, or the same thing online), you’ll see that there is an optional input col_names whose default value is TRUE,³ but which can be changed to FALSE (don’t read any column names at all, using X1, X2 etc.) or a collection of column names that you specify, which is what we want here. I thought this was in my lecture notes, but it appears not to be, not at least until much later.

my_cols <- c("a", "b", "c", "d", "e", "f")
my_url <- "http://ritsokiguess.site/datafiles/homes.txt"
asking0 <- read_table(my_url, col_names = my_cols)


── Column specification ────────────────────────────────────────────────────────
cols(
  a = col_double(),
  b = col_double(),
  c = col_double(),
  d = col_double(),
  e = col_double(),
  f = col_double()
)

Warning: 4 parsing failures.
row col  expected    actual                                           file
  1  -- 6 columns 1 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  4  -- 6 columns 2 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  5  -- 6 columns 1 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  9  -- 6 columns 5 columns 'http://ritsokiguess.site/datafiles/homes.txt'

asking0

It looks as if we have everything that was in the data file, along with a whole bunch of missings (that we will get rid of later). I have given this a temporary name, saving the good name for the eventual tidy dataframe.

If you don’t set the column names first, you’ll get this:

read_table(my_url)


── Column specification ────────────────────────────────────────────────────────
cols(
  `4` = col_double()
)

Warning: 7 parsing failures.
row col  expected    actual                                           file
  1  -- 1 columns 6 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  2  -- 1 columns 6 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  3  -- 1 columns 2 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  5  -- 1 columns 6 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  6  -- 1 columns 6 columns 'http://ritsokiguess.site/datafiles/homes.txt'
... ... ......... ......... ..............................................
See problems(...) for more details.

The problem is that the first row only has one data value on it (the 4 of 4 bedrooms), and so read_table will assume that there is only one data value in each row all the way through, which is not true. This way is actually wrong two ways: the column name 4, that was read from the first line of the data file, is actually data and not a column name.

Setting col_names = FALSE also gets you only one column instead of six, for the same reason.

The only way to tell read_table that there are actually six columns is to specify six column names. Well, not quite the only way. One of the other options in read_table is skip; if you skip the first line of the data file you will at least get six columns:

read_table(my_url, skip = 1, col_names = FALSE)


── Column specification ────────────────────────────────────────────────────────
cols(
  X1 = col_double(),
  X2 = col_double(),
  X3 = col_double(),
  X4 = col_double(),
  X5 = col_double(),
  X6 = col_double()
)

Warning: 3 parsing failures.
row col  expected    actual                                           file
  3  -- 6 columns 2 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  4  -- 6 columns 1 columns 'http://ritsokiguess.site/datafiles/homes.txt'
  8  -- 6 columns 5 columns 'http://ritsokiguess.site/datafiles/homes.txt'

This at least gets all the house prices, but it does not get any indication that the first three rows belong to houses with four bedrooms.

\(\blacksquare\)

Obtain a tidy dataframe, according to the specifications in the question. Describe your process clearly enough that anyone reading your answer would be able to do it themselves and see why it works.

Hints: you might find it useful to find out about the following, if you don’t already know about them:

drop_na
ifelse
fill

Also think about how you know whether a number in the data file is a selling price or a number of bedrooms.

Solution

For the first step, we want to get all the prices in one column. This will unfortunately put the numbers of bedrooms in the same column, but we will separate those out shortly. Another way to say that is that the current six columns have no bearing on how the data should eventually end up: it makes no difference whether a data value is currently in column 1 (a) or 2 (b), for example. So, as is often the case, the first step is a pivot_longer:

asking0 %>% pivot_longer(everything(), 
                        names_to = "old_column", values_to = "price")

All of the prices, along with the numbers of bedrooms and a substantial number of missings, are in the column I called price.⁴

I decided to get rid of the missing values next; you could do it now or later:

asking0 %>% pivot_longer(everything(), 
                        names_to = "old_column", values_to = "price") %>% 
  drop_na(price)

We are down to 39 rows, which is actually right because two of those supposed prices are actually numbers of bedrooms.

You can also do these two steps in one, by adding an option to pivot_longer:

asking0 %>% pivot_longer(everything(), 
                        names_to = "old_column", values_to = "price", values_drop_na = TRUE)

This is perhaps better. I used this idea in lecture (in the tuberculosis example).

So now we need to make a column that is going to contain the number of bedrooms. How do we tell that we are looking at a number of bedrooms? If it’s a small number, say less than 10, it’s a number of bedrooms, and if it’s not, it’s an asking price, and we don’t know (yet) how many bedrooms it goes with.

So my idea is to create a new column called, say, bedrooms, that is the number of bedrooms if that’s what we’re looking at, and missing otherwise.⁵ This is what ifelse⁶ does:

asking0 %>% pivot_longer(everything(), 
                        names_to = "old_column", values_to = "price") %>% 
  drop_na(price) %>% 
  mutate(bedrooms = ifelse(price < 10, price, NA))

In the description of the data, the asking prices that came after 4 were for houses with 4 bedrooms (149900, 169900, …, 269900). The other ones, after the 3, were for houses with 3 bedrooms. That means that the bunch of missing bedrooms after the 4 should be filled in with 4, and the ones after 3 should be filled in with 3. This is exactly how fill works. (I should probably admit that I suspected that this was coming, so I set it up this way with how I did the ifelse. If you found out about fill first, you might also have had the idea that having the other values in the bedrooms column be missing would be a good idea, with your thought process being something like “how can I set this up so that fill will work?”.) Thus:

asking0 %>% pivot_longer(everything(), 
                        names_to = "old_column", values_to = "price") %>% 
  drop_na(price) %>% 
  mutate(bedrooms = ifelse(price < 10, price, NA)) %>% 
  fill(bedrooms)

fill is a very helpful tool. Here, you really need something that looks upwards to see what the previous number of bedrooms was, which is not a tidyverse-friendly kind of thing to do. The only other way I can think of is something like this, which happens to work here:

asking0 %>% pivot_longer(everything(), 
                        names_to = "old_column", values_to = "price") %>% 
  drop_na(price) %>% 
  mutate(junk = ifelse(price < 10, price, 10)) %>%
  mutate(bedrooms = cummin(junk))

cummin is “cumulative minimum”, that is, the minimum seen so far, counting down from the top. This worked here because the four-bedroom houses were listed before the three-bedroom ones, and in my ifelse the value used in junk if the thing in price was not a number of bedrooms was something bigger than either 4 or 3. If the three-bedroom house prices had been listed first, instead of using 10 in junk I would have had to use something smaller like 0, and then use cummax instead of cummin. You can see that the fill idea is a lot more elegant and involves a lot less checking of things.

Nearly there now. Two more steps:

get rid of the rows that have a number of bedrooms rather than an asking price in them (which you can detect because they have a number less than 10 in them for price).
get rid of the column that has the old column names in it (the ones we supplied when we read in the file). It turns out that the only thing we needed them for was to read in the file properly; they served no other purpose:

asking0 %>% pivot_longer(everything(), 
                        names_to = "old_column", values_to = "price") %>% 
  drop_na(price) %>% 
  mutate(bedrooms = ifelse(price < 10, price, NA)) %>% 
  fill(bedrooms) %>% 
  filter(price>10) %>%
  select(-old_column) -> asking
asking

Success! 37 rows for the 37 homes, and the right two columns.

This is basically what I did to create the data file of these data for you (the one you used when you encountered this data earlier, to do some kind of two-sample test).

For you, at a minimum you need the code in my final pipeline just above, plus an explanation of what each line in it does and the logic behind your process. It’s up to you whether you do an explanation of the whole thing, or whether you run it up to a certain point and explain what it has done so far (which is the way I did it).

I’m expecting that this part will take you quite a bit of time, and may involve several false trails (ideas you thought might work but turned out not to). The process of tidying a new data set is more like solving a puzzle than anything else. (At least in this case you know that there is an answer in the end, so this one really is like a puzzle.)

\(\blacksquare\)

Use your tidy dataframe to make a boxplot of these data.

Solution

Now you get to reap the rewards of your hard work: something that’s now easy, or at least easy-ish. The number of bedrooms is the nearest thing we have to a categorical variable, except that it’s actually a number, so we have to make it categorical by passing it into factor:

ggplot(asking, aes(x = factor(bedrooms), y = price)) +
  geom_boxplot()

The asking prices for 4-bedroom homes are on average clearly bigger than for 3-bedroom homes.

as.character also works instead of factor, either above or below. If you are bothered about the funny label on the \(x\)-axis, redefine bedrooms from a number to text/factor before you draw the boxplot:

asking %>% 
  mutate(bedrooms = as.character(bedrooms)) %>% 
  ggplot(aes(x = bedrooms, y = price)) +
    geom_boxplot()

and now the \(x\)-axis has a nice label.

Extra: I think this was the one where, before, I had you take logs of the asking prices first, which made the boxes look more symmetric so that a \(t\)-test was then reasonable to do. These distributions are very much skewed to the right:

asking %>% 
  mutate(bedrooms = as.character(bedrooms)) %>% 
  ggplot(aes(sample = price)) +
    stat_qq() + stat_qq_line() + facet_wrap(~bedrooms)

The distribution of asking prices for 3-bedroom houses is definitely skewed to the right (lower values too bunched up, higher values too spread out). The distribution of asking prices for 4-bedroom houses is surprisingly close to normal; all that is non-normal about it is that the lower tail is too short. The boxplot for these houses suggested a more skewed distribution than this; in fact, another explanation for the boxplot you got is that the lower tail is short compared to the normal. This is a possible explanation for this type of distribution; not a common one, but a possible one. It goes to show that sometimes you need to look at the normal quantile plot to see what is really going on.

\(\blacksquare\)

Footnotes

This is the same principle as including extra explanatory variables in a regression: by including them in the model, you are allowing for their effects in order to get a better assessment of the effects of the things you really care about.↩︎
At least the dataset is not that untidy!↩︎
Meaning, if you don’t specify it, the values from the first line of the file are used as column names.↩︎
There were actually 37 homes altogether, which gives you an idea of just how many missings we have. These are not actual missing data; they came from the blanks in the datafile, which indicate where there might have been data but there was actually none.↩︎
I am setting up for something later; you might have to do a bit of trial and error here.↩︎
Or, should you prefer, case_when.↩︎