Worksheet 8

Published

October 17, 2023

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There is extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

1 Fuel efficiency comparison

Some cars have on-board computers that calculate quantities related to the car’s performance. One of the things measured is the fuel efficiency, that is, how much gasoline the car uses. On an American car, this is measured in miles per (US) gallon. On one type of vehicle equipped with such a computer, the fuel efficiency was measured each time the gas tank was filled up, and the computer was then reset. Twenty observations were made, and are in http://ritsokiguess.site/datafiles/mpgcomparison.txt. The computer’s values are in the column Computer. The driver also calculated the fuel efficiency by hand, by noting the number of miles driven between fill-ups, and the number of gallons of gas required to fill the tank each time. The driver’s values are in Driver. The final column Diff is the difference between the computer’s value and the driver’s value for each fill-up. The data values are separated by tabs.

Read in and display (some of) the data.
What is it that makes this paired data? Explain briefly.
Draw a suitable graph of these data, bearing in mind what you might want to learn from your graph.
Is there any difference between the average results of the driver and the computer? (Average could be mean or median, whichever you think is best). Do an appropriate test.
The people who designed the car’s computer are interested in whether the values calculated by the computer and by the driver on the same fill-up are usually close together. Explain briefly why it is that looking at the average (mean or median) difference is not enough. Describe what you would look at in addition, and how that would help.

My solutions

Read in and display (some of) the data.

Solution

library(tidyverse)

This is like the athletes data, so read_tsv:

my_url <- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
fuel <- read_tsv(my_url)

Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fuel

20 observations, with columns as promised. The first column Fill-up labels the fill-ups from 1 to 20, in case we needed to identify them. (This is an additional clue that we have paired data, or repeated measures in general.¹)

See the Extra for another way to read in these data (and why it works).

\(\blacksquare\)

What is it that makes this paired data? Explain briefly.

Solution

There are two measurements for each fill-up: the computer’s calculation of gas mileage, and the driver’s. These go together because they are two calculations of what is supposed to be the same thing. (Another hint is that we have differences, with the veiled suggestion that they might be useful for something.)

If you want to have a general way of tackling this kind of problem, ask yourself “is there more than one observation of the same thing per individual?” Then go on and talk about what the individuals and observations are for the data you’re looking at. In this case, the individuals are fill-ups, and there are two observations per fill-up: one by the driver and one by the computer. Compare this with the children learning to read (from lecture): there were \(23+21 = 44\) children altogether (individuals), and each child had one reading score, based on the reading method they happened to be assigned to. So those are 44 independent observations, and a two-sample test is the way to go for them.

\(\blacksquare\)

Draw a suitable graph of these data, bearing in mind what you might want to learn from your graph.

Solution

In a matched pairs situation, what matters is whether the differences have enough of a normal distribution. The separate distributions of the computer’s and driver’s results are of no importance. So make a graph of the differences. We are specifically interested in normality, so a normal quantile plot is best:

ggplot(fuel, aes(sample = Diff)) + stat_qq() + stat_qq_line()

The next best plot is a histogram of the differences:

ggplot(fuel, aes(x = Diff)) + geom_histogram(bins = 6)

This one actually looks left-skewed, or has an outlier at the bottom. The appearance may very well depend on the number of bins you choose.

\(\blacksquare\)

Is there any difference between the average results of the driver and the computer? (Average could be mean or median, whichever you think is best). Do an appropriate test.

Solution

The choices are a matched-pairs \(t\)-test, or a matched-pairs sign test (on the differences). To choose between those, look back at your graph. My take is that the only possible problem is the smallest (most negative) difference, but that is not very much smaller than expected. This is especially so given the sample size (20), which means that the Central Limit Theorem will help us enough to take care of the small outlier.

I think, therefore, that a paired \(t\)-test is the way to go, to test the null that the mean difference is zero (against the two-sided alternative that it is not zero, since we were looking for any difference). There are two ways you could do this: as literal matched pairs:

with(fuel, t.test(Computer, Driver, paired = TRUE))


    Paired t-test

data:  Computer and Driver
t = 4.358, df = 19, p-value = 0.0003386
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 1.418847 4.041153
sample estimates:
mean difference 
           2.73

or, as a one-sample test on the differences:

with(fuel, t.test(Diff, mu = 0))


    One Sample t-test

data:  Diff
t = 4.358, df = 19, p-value = 0.0003386
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 1.418847 4.041153
sample estimates:
mean of x 
     2.73

with the same result.

If you thought that low value was too much of an outlier, the right thing would have been a sign test that the median difference is zero (vs. not zero), thus (using the smmr package):

sign_test(fuel, Diff, 0)

$above_below
below above 
    3    17 

$p_values
  alternative     p_value
1       lower 0.999798775
2       upper 0.001288414
3   two-sided 0.002576828

In any of those cases, we conclude that the average difference is not zero, since the P-values are less than 0.05. (The right one for the sign test is 0.0026.)

I don’t mind which test you do, as long as it is one of the ways of doing matched pairs, and you justify your choice by going back to your graph. Doing one of the tests without saying why is being a STAB22-level statistician (in fact, it may not even be that), not a professional-level applied statistician.

\(\blacksquare\)

The people who designed the car’s computer are interested in whether the values calculated by the computer and by the driver on the same fill-up are usually close together. Explain briefly why it is that looking at the average (mean or median) difference is not enough. Describe what you would look at in addition, and how that would help.

Solution

The mean (or median) difference can be close to zero without the individual differences being close to zero. The driver could be sometimes way higher and sometimes way lower, and the average difference could come out close to zero, even though each individual pair of measurements is not that close together. Thus looking only at the average difference is not enough.

There are (at least) two (related) things you might choose from to look at, in addition to the mean or median:

the spread of the differences (the SD or inter-quartile range). If the average difference and the spread of differences are both close to zero, that would mean that the driver’s and computer’s values are consistently similar.
a suitable confidence interval here (for the mean or median difference, as appropriate for what you did) would also get at this point.

To think about the spread: use the SD if you did the \(t\)-test for the mean, and use the IQR if you did the sign test for the median, that is, one of these:

fuel %>% summarize(diff_sd = sd(Diff))

fuel %>% summarize(diff_iqr = IQR(Diff))

as appropriate. Make a call about whether you think your measure of spread is small or large. If you think it’s small, you can say that the differences are consistent with each other, and therefore that the computer’s and driver’s values are typically different by about the same amount (we concluded that they are different). If you think your measure of spread is large, then the driver’s and computer’s values are inconsistently different from each other (which would be bad news for the people who designed the car’s computer, as long as you think the driver was being reasonably careful in their record-keeping).

If you said that the confidence interval for the mean/median was the thing to look at, then pull the interval out of the \(t\)-test output or run ci_median, as appropriate. The computer’s miles per gallon is consistently between 1.4 and 4.0 miles per gallon higher (mean) or 1.1 and 4.5 miles per gallon higher (median). Make a call about whether you consider this interval short or long, bearing in mind it’s based on the difference between two numbers that are about 40, and then say that the computer’s measurement is larger than the driver’s by a consistent amount (if you think the interval is short) or by an amount that varies quite a bit (if you think the interval is long).

\(\blacksquare\)

2 Canned tuna

Light tuna is sold in 170-gram cans. The tuna can be canned in either water or oil. Is there a difference in the typical selling price of tuna canned in water or in oil? To find out, 25 supermarkets were sampled. In 14 of them (randomly chosen), the price of a (randomly chosen) brand of tuna in water was recorded, and in the other 11 supermarkets, the price of a (randomly chosen) brand of tuna in oil was recorded. The data are in http://ritsokiguess.site/datafiles/tuna.csv, with the column canned_in saying whether the tuna was canned in oil or water, and the price being in cents.

You used these data previously to draw normal quantile plots.

Read in and display (some of) the data.
Make a suitable graph of these data.
We decided previously that running a \(t\)-test was a bad idea here. Run Mood’s median test for these data to determine whether the median selling price differs between tuna canned in water and in oil. What do you conclude, in the context of the data?
Explain briefly why the counts of values above and below the overall median (in the previous part) is entirely consistent with the P-value that you found.

Canned tuna: My solutions

Read in and display (some of) the data.

Solution

Making it easy for you:

my_url <- "http://ritsokiguess.site/datafiles/tuna.csv"
tuna <- read_csv(my_url)

Rows: 25 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): canned_in
dbl (1): price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

tuna

Scrolling down reveals that the cans of tuna canned in oil are at the bottom.

\(\blacksquare\)

Make a suitable graph of these data.

Solution

You are probably expecting a boxplot, and with one quantitative variable (price) and one categorical one (canned_in) that is exactly the thing:

ggplot(tuna, aes(x = canned_in, y = price)) + geom_boxplot()

Another graph you might consider is a normal quantile plot. This is particularly true if you have been looking ahead, and realize that normality of each of the two samples is something we should be concerned about. There are two samples, so a facetted normal quantile plot is the thing:

ggplot(tuna, aes(sample = price)) + stat_qq() + stat_qq_line() +
  facet_wrap(~ canned_in)

\(\blacksquare\)

We decided previously that running a \(t\)-test was a bad idea here. Run Mood’s median test for these data to determine whether the median selling price differs between tuna canned in water and in oil. What do you conclude, in the context of the data?

Solution

These distributions are far from normal (skewed to the right), and (as we determined before) the sample sizes are small (14 and 11), so it doesn’t look as if the Central Limit Theorem will be much help.

For Mood’s median test, use smmr (easier). I didn’t say anything about building it yourself:

median_test(tuna, price, canned_in)

$grand_median
[1] 67

$table
       above
group   above below
  oil       5     5
  water     7     6

$test
       what      value
1 statistic 0.03350816
2        df 1.00000000
3   P-value 0.85475695

Our null hypothesis is that the median price of tuna canned in water and tuna canned in oil is the same. Our alternative is two-sided, since we are looking for any difference: that is, that the two median prices are different. With a P-value of 0.855, much larger than 0.05, we cannot reject the null hypothesis (we “retain” the null, if you like that word better). There is no evidence of a difference in the median price of tuna canned in water and tuna canned in oil.

\(\blacksquare\)

Explain briefly why the counts of values above and below the overall median (in the previous part) is entirely consistent with the P-value that you found.

Solution

We failed to reject that the two group medians are equal, so we would expect to see a more or less even split of values above and below the overall median in both the groups. Specifically, that means we would expect close to half of the values in each group to be below the overall median and close to half of them to be above. (It has to be “half” because the overall median has to have half of all the data values below and half above).

To see how close that was to happening, look at the other part of the output in the previous part. You might have to click on R Console to see it. It turns out that for the tuna canned in oil, 5 prices were above the overall median and 5 were below, an exactly even split, and for the tuna canned in water, 7 were above and 6 below, very close to an even split. (This is, in fact, exactly the sort of thing we would have expected to see if the null hypothesis were actually true.) These are extremely close to being 50-50 splits, and so not only would we not expect to reject the null hypothesis, but we would also expect to get the very large P-value that we did.

(In case you were wondering where the other two values went, since there were 11 oil prices and 14 water ones, they were exactly equal to the overall median and so got counted as neither above nor below.)

\(\blacksquare\)

3 Fuel efficiency comparison: extras

(a):

Extra: since the “tab” is a delimiter, distinguishing one data value from the next, you might be wondering whether you could use read_delim somehow. It turns out that you can, but you have to figure out how to represent “tab” as something that read_delim can understand. One way is to search for it, of course, but another way is to see what happens when you use read_delim the way you already know (assuming the data values are separated by spaces):

my_url <- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
fuel0 <- read_delim(my_url, " ")

Rows: 20 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): Fill-up    Computer    Driver  Diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fuel0

This fails, but how it fails is instructive.² There is now one column, amusingly called Fill-up Computer Driver Diff, whose values are text, the four numbers glued together with \t in between. This is how R recognizes a tab character.³ So you can try read_delim by using this as the second input rather than a space.

You could get to this point with less insight by saying to yourself “this thing \t, whatever it is, is what separates the data values, so how about I try this as the second input to read_delim?”

Here’s how it goes:

my_url <- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
fuel0 <- read_delim(my_url, "\t")

Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fuel0

and it works, with four numeric columns as promised.

Another option is to use read_delim without a second input:

my_url <- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
fuel0 <- read_delim(my_url)

Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fuel0

This works, but you need to show that you know what is going on. One way is to look at the message from read_delim that says what got read in:

── Column specification ────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff

The four quantitative columns, but look above that, where it says Delimiter: \t. The delimiter is what separates each data value from the next one on the same line. The symbol \t means “tab”, which you could assert, but it is better to offer some sort of evidence that this is correct. To do that, search for something like R \t.

The first hit for me was actually not R-related, but was a Stack Overflow question (scroll down to the two answers) that has the answer we need. The symbol \t comes from Unix (and is therefore also in Linux and the C programming language, and because R’s heritage is C, R also).

(b):

Extra 1: To go back to our gas mileages, here is another way you might have received the data:

This looks more like a two-sample layout, with one column of gas_mileage values and a second column indicating whether the Driver or Computer observed⁴ them. But the first two rows were observed at the same fill-up, so we know that these are actually paired (by fill-up) and the layout of the data for our analysis is wrong. (Later in the course, we learn how to convert between the previous “wider” layout and this “longer” one. I didn’t show you the code here, because I didn’t want to confuse you at this point.)

Extra 2: if your individuals have more than two measurements each, you can’t shoehorn it into a one-sample model any more, because there are too many differences among⁵ measurements now. These are repeated measures data, and the modern way to analyze these is using mixed models. The student GPA example in the link shows how it goes; this is repeated measures because we have six observations (of GPA) for each individual (student).

(c):

Extra: the distributions of the computer’s and driver’s gas mileage measurements themselves are totally irrelevant to whether you would run a matched pairs test of any kind; the only thing that matters is the distribution of the differences.

The only other graph you could justify (and even then it is a bit of a stretch) would be a scatterplot of the two measurements at each fill-up:

ggplot(fuel, aes(x = Driver, y = Computer)) + geom_point() + 
  geom_abline(intercept = 0, slope = 1)

By adding the line \(y = x\) (which has intercept 0 and slope 1) to the graph, you can see how the pairs of measurements compare to each other; for most of them, the computer’s measurement is larger than the corresponding driver’s measurement (the point is above the line), and for only a few is it smaller. If you had decided to do a matched-pairs sign test, this graph would be interesting (with the line), because the sign test is based on how many points are above the line vs. how many are below. But what this graph does not do is tell you whether you need to do a matched-pairs sign test in the first place, because it does not tell you about the distribution of the differences (in particular, whether that is normal enough). Thus, the scatterplot, even with the line, is inferior to even the histogram of differences.

(d):

Extra 1: you could find out how the average difference is not zero by looking at a confidence interval; the one for the mean difference is from 1.4 to 4.0. If you did a sign test, you would do this to get a confidence interval:

ci_median(fuel, Diff)

[1] 1.102148 4.500000

1.1 to 4.5, a little wider,⁶ but otherwise very similar.

So we think the computer was estimating the gas mileage to be that much higher than the driver.

Extra 2: The P-value for the \(t\)-test was almost 10 times smaller than for the sign test. This is because the \(t\)-test was using the actual differences, not just whether they were greater or less than zero as the sign test did (17 out of the 20 differences were positive). If you thought, as I did, that it was all right to use the \(t\)-test, then the smaller P-value is trustworthy. (The usual thing, if two tests give similar results, as they do here, is that the one with the smaller P-value is better, but that is not always something to rely on. You still need to make an assessment.)

Extra 3: A bootstrapped sampling distribution of the sample mean would be a better way to decide how happy you are with the \(t\)-test. This will incorporate the sample size into your consideration. It is really the same thing as the one for the one-sample \(t\), bearing in mind that it’s the differences you need to work with:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(fuel$Diff, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

This is a tiny bit left-skewed still (the lowest values are very slightly too low, and the highest ones are slightly bunched up). The apparent low-end outlier in the distribution of differences barely even shows up in the sampling distribution. (If we had a smaller sample size, like 10 or especially 5, we would see it here more clearly, and then we would know that a sample size of 5 or 10 is not big enough.)

In other words, despite what you might think from your earlier plot, the matched pairs \(t\)-test would be very much defensible here.

(e):

Extra: in a situation like the one here, where there is a significant difference between the means, knowing also that the standard deviation is small would tell us that the computer is consistently giving a value that is bigger than the driver’s by about the same amount every time. Another way to think about this is to plot the driver’s and computer’s values against each other with a (solid) reference line⁷ like this:

ggplot(fuel, aes(x = Driver, y = Computer)) + geom_point() +
  geom_abline(intercept = 0, slope = 1) +
  geom_abline(intercept = 3, slope = 1, linetype = "dashed")

The solid reference line is where a point would be if the driver and computer had the same value. Here, there seems to be some consistency in the points being up and to the left of, but parallel to, the reference line. This would be saying that the computer’s value would be fairly consistently more than the driver’s by about 3 miles per gallon (eyeballing the plot; the dashed line), with some exceptions.

Footnotes

If we had 40 independent miles-per-gallon measurements, there would be no reason to have the Fill-up column, because there would be no basis to link a Computer value with a particular Driver one.↩︎
Actually, it works, but not the way we wanted it to: there is no error here, just not the kind of data we wanted.↩︎
It may look like two characters, but the backslash is called an “escape character” and doesn’t count.↩︎
I seem to be attributing sentience to the Computer here. Maybe recorded_by would have been a better name.↩︎
To be precise linguistically, there are “differences between” two things, as in matched pairs, and “differences among” three or more things, as in ANOVA.↩︎
This is very similar to the thing with the P-values in Extra 1; the sign test doesn’t use the data as efficiently, so its interval is a little wider.↩︎
This is the line \(y = x\), which has intercept 0 and slope 1.↩︎