STAC32 Assignment 6

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Digestion in horses

Horses eat straw, and horse breeders are interested in what will make straw easier for the horses to eat. In an experiment with six horses the digestibility coefficient (in suitable units, with a higher value meaning that the horse finds the straw easier to digest) was measured twice for each horse: once after the horse had been fed straw treated with NaOH (sodium hydroxide) and once after the horse had been fed ordinary straw. The results are in http://ritsokiguess.site/datafiles/digestcoefs.csv. The dataset contains columns identifying the horses, the digestibility coefficient when that horse ate ordinary straw, when it ate the sodium-hydroxide straw, and the difference between the two digestibility coefficients (naoh minus ordinary). The aim of the experiment was to identify any differences between the two treatments.

(a) (1 point) Read in and display the data.

There are only six rows, so you can display all of it this time:

my_url <- "http://ritsokiguess.site/datafiles/digestcoefs.csv"
horses <- read_csv(my_url)
Rows: 6 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): horse, ordinary, naoh, diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
horses

You can check that the columns referred to in the question are respectively: horse, ordinary, naoh, and diff.

(b) (2 points) Why is this a matched pairs experiment? Explain briefly.

Each horse was fed both of the two types of straw, so that there are two measurements on each individual horse. (The best answer gets at “there are two observations for each individual” by making it clear that you know which are the individuals (horses) and which are the observations (the two digestibility coefficients for the two types of straw).)

(c) (3 points) Make a graph that can be used to assess the key assumption of a matched pairs t-test. Explain briefly how your graph shows no obvious problems.

The best graph is a normal quantile plot of the differences, since (sufficient) normality of the differences is what matters here. The column diff contains the differences, so you can use them directly (no need to calculate differences):

ggplot(horses, aes(sample = diff)) + stat_qq() + stat_qq_line()

There is no real deviation from the line; the points are all close to it, and thus the differences are acceptably normal in distribution. This is especially important here since the sample size (6) is so small. ’need to pass my class to get into their program!” The second-best graph is something like a histogram of the differences, but it’s hard to judge normality with so few observations:

ggplot(horses, aes(x = diff)) + geom_histogram(bins = 3)

The best you can say here is “it’s not obviously non-normal”.

A one-sample boxplot is likely to convey the same sort of message (or non-message):

ggplot(horses, aes(x = 1, y = diff)) + geom_boxplot()

There is nothing obviously wrong here.

Extra: with only six observations, there is a lot that can happen by chance. If you had more observations than this, the bendiness in the normal quantile plot might be interesting: from the bottom left, a point on the line, one below, one on, two above, and one below. With more observations, this sort of thing might be indicative of short tails or of a second peak. But with so few, as we actually have, we cannot really say much about what the population looks like, and the best we can do is to say that it is “not obviously non-normal” or something like that; the double negative expresses that it might be normal and it might not be, and this is about as much as we can say.

(d) (3 points) Run a suitable matched pairs \(t\)-test. What do you conclude, in the context of the data?

This is the standard way to do it, two-sided since we are interested in any difference:

with(horses, t.test(ordinary, naoh, paired = TRUE))

    Paired t-test

data:  ordinary and naoh
t = -8.4769, df = 5, p-value = 0.0003754
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -16.877049  -9.022951
sample estimates:
mean difference 
         -12.95 

Or, since you have the differences, use them in a one-sample test:

with(horses, t.test(diff, mu = 0))

    One Sample t-test

data:  diff
t = 8.4769, df = 5, p-value = 0.0003754
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  9.022951 16.877049
sample estimates:
mean of x 
    12.95 

or run either of these using the dollar sign rather than with.

The P-value of 0.0038 is (much) smaller than 0.05, so we can reject the null hypothesis, and therefore conclude that there is a difference in the mean digestibility of ordinary straw and straw treated with sodium hydroxide.

You actually don’t need the mu = 0 in the second variation (it is the default), though I think it is clearer with it in.

(e) (2 points) Obtain a 95% confidence interval for the mean difference, and interpret it in the context of the data.

We did the test two-sided above, so you just read the interval from the output: -16.9 to -9.0 in the first version of the test. To interpret this, note that the differences were ordinary minus NaOH here, so with 95% confidence, the digestibility of the sodium-hydroxide-treated straw was between 9.0 and 16.9 units higher on average. (If you used the diff column, the confidence interval comes out positive because of the way around the differences were.)

We don’t have the units, and, since the data were given to one decimal, you can give two decimals (but not more) in your interval. Make sure you get across the idea of which digestibility coefficient is higher, and by how much.

The inference in this question needs to be two-sided, because without looking at the data, the researchers had no idea which way around the difference would be (if it was significant). It’s pretty clear which digestibility is higher once you look at the data, but you cannot do that when making your hypotheses.

2 Energy expenditure

Do lean people expend more energy than obese people? To find out, researchers took samples of lean and obese people and measured the 24-hour energy expenditure of each one (in megajoules). The data are in http://ritsokiguess.site/datafiles/energy.csv.

(a) (1 point) Read in and display (some of) the data.

As usual:

my_url <- "http://ritsokiguess.site/datafiles/energy.csv"
energy <- read_csv(my_url)
Rows: 22 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): stature
dbl (1): expend

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
energy

There are two columns: expend is the amount of energy expended, and stature is whether the person in question is lean or obese.

(b) (2 points) How many observations are there are in each sample?

This is as simple as count:

energy %>% count(stature)

There are 13 lean and 9 obese people, which you should say.

You could, I suppose, count them by hand, but why expend the energy1 on that when you have R to do it for you?

(c) (3 points) Make a suitable plot of the data, and comment briefly on how the lean and obese people compare on average in terms of energy expenditure.

A boxplot, with one quantitative and one categorical variable:

ggplot(energy, aes(x = stature, y = expend)) + geom_boxplot()

The energy expenditure is on average (median) much higher for obese people.

This plot really needs to be a boxplot, so that you can compare averages (medians).

Extra: Also, the spread is larger for them, and there are outliers in the lean group.

(d) (2 points) Explain briefly, on the basis of what you have seen so far, why a two-sample t-test will not be appropriate for comparing the energy usage of lean and obese people.

We need both groups to have approximately a normal distribution given the sample size. Looking at the boxplot, the lean group has severe outliers, both lower and upper, so the sufficient normality fails, with a sample size of 13. (Also, the obese group appears to have a right-skewed distribution, but you don’t need to consider this once you’ve found one group that fails for sufficient normality. In any case, the sample size might be big enough for the obese people.)

At this point, you could if you wanted draw a normal quantile plot (facetted for the two groups), though, since the boxplot tells us such a clear story here, it’s not clear that we will learn much more from it:

ggplot(energy, aes(sample = expend)) + stat_qq() + stat_qq_line() +
  facet_wrap(~ stature, scales = "free")

I put each plot on its own scale, since we have already seen that the typical values differ between the two groups.

The outliers in the lean group (one at the bottom and two at the top) show up very clearly. There is really no need to look at the obese group at all (since we already know that the \(t\)-test will fail), but you might see a sort-of right-skewed shape: there is a curve, even though the three highest values are right where you’d expect them to be on a normal distribution. See below to see whether the sample size of 9 is actually large enough to overcome that.

Extra: the inevitable bootstrapped sampling distributions of the sample means:

energy %>% nest_by(stature)

The two mini-data-frames in the list-column data have all the other columns that were in the original dataframe energy, that is to say, just expend, just for each of the people that were lean, and just for the ones that were obese. The [,1] under data is shorthand for “this has some number of rows (different for lean and obese), but only one column”.

Now that we have the data we want, let’s set up simulations:

energy %>% nest_by(stature) %>% 
  mutate(sim = list(1:10000)) %>% 
  unnest(sim)

Despite what you might see here, there are actually 20,000 rows, 10,000 each for lean and for obese.2

Now, we want a sample from each (from the expend in the appropriate data) and the mean of each of those samples, in famiiliar fashion:

energy %>% nest_by(stature) %>% 
  mutate(sim = list(1:10000)) %>% 
  unnest(sim) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(data$expend, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line() +
    facet_wrap(~stature, scales = "free")

The last two lines (or, one line broken into two parts) make normal quantile plots, separately for each group because each group is supposed to have a sampling distribution of the sample mean that is close to normal. I used scales = "free" because the typical energy usage of the two groups seems different. Both sets of points are pretty close to their lines, but:

  • the lean group is definitely skewed to the right. This is because of those outliers, and to the right because there were two extreme outliers at the top and only one at the bottom.3 This may not be as bad as you were expecting, with a sample size of only 13.

  • the obese group is actually not skewed to the right, so that the sample size, even though only 9, was large enough to overcome that. What happened here is that the lower tail came out to be short, because the lower tail of the original distribution was also short. This is not in itself a problem for the t-test, but you only need one group to go wrong, and the lean group did.

(e) (4 points) Run a suitable test to see whether obese people use more energy on average than lean people, on the evidence in this dataset. What do you conclude, in the context of the data?

Having ruled out a t-test, we are left with Mood’s median test, which compares the median of the two groups:

median_test(energy, expend, stature)
$grand_median
[1] 8.595

$table
       above
group   above below
  lean      2    11
  obese     9     0

$test
       what        value
1 statistic 1.523077e+01
2        df 1.000000e+00
3   P-value 9.514059e-05

Two points for doing this.

The P-value is 0.000095, very small. Remember that this test is two-sided, so this enables us to conclude only that the median energy consumption of the two groups is different. The energy consumption of the obese people might be higher or lower on this evidence.

One more point for making a two-sided conclusion from the evidence so far.

To make a one-sided conclusion, you need to argue that we are on the “correct side”: that is, if there is any difference, it is the obese people that are higher on average. I can see two ways you might do that here (either one of these, or something similar, is good):

  • from the boxplot, the obese people have a higher median energy expenditure

  • from the table of above and below, most of the lean people expended less than the grand median of 8.595, and all of the obese people expended more.

Thus we are on the correct side, and we can halve the P-value to get the even smaller

9.514e-5 / 2
[1] 4.757e-05

and we can conclude that obese people do indeed consume more energy than lean people. The last point for getting this far, or the last two points if you successfully went straight to a one-sided test.

Extra: if you were wondering about how the t-test comes out, well, lean is before obese alphabetically, so:

t.test(expend ~ stature, data = energy, alternative = "less")

    Welch Two Sample t-test

data:  expend by stature
t = -3.8555, df = 15.919, p-value = 0.0007053
alternative hypothesis: true difference in means between group lean and group obese is less than 0
95 percent confidence interval:
      -Inf -1.220763
sample estimates:
 mean in group lean mean in group obese 
           8.066154           10.297778 

and the conclusion is actually about the same.

You are probably expecting to find now a confidence interval for the difference in medians, but there is no obvious way to do that. Mood’s median test as we have done it tests only that the medians are the same (by computing the grand median and counting observations above and below that). To get a confidence interval, we would have to test whether the medians differed by a certain amount (and then include it in the confidence interval if that difference was not rejected, as for using the sign test to get a CI for a single median), which we do not have a good way to do. This question on Stack Overflow suggests a couple of ideas. One idea is based on the Hodges-Lehmann estimator, which is well-known but not quite the right thing for us.4 Another is based on the bootstrap, which ought to be understandable to us: draw bootstrap samples from each group, take the difference in medians, and take the middle 95% of that distribution. The starting point is the same as I did before to assess the normality of the sampling distributions of the sample means:

energy %>% nest_by(stature)

with the list-column data containing two mini data frames, one for lean and one for obese, each with one column called expend. Then we set up 4 bootstrap samples5 from each:

energy %>% nest_by(stature) %>% 
  mutate(sim = list(1:4)) %>% 
  unnest(sim) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(data$expend, replace = TRUE)))

Now we need the medians of each of those:

energy %>% nest_by(stature) %>% 
  mutate(sim = list(1:4)) %>% 
  unnest(sim) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(data$expend, replace = TRUE))) %>% 
  mutate(my_median = median(my_sample))

Here is where things turn different: we want to put the medians for the same simulation next to each other and subtract them, to get a bootstrap version of what the difference in medians might look like. This is an unexpected use of group_by and summarize:

energy %>% nest_by(stature) %>% 
  mutate(sim = list(1:4)) %>% 
  unnest(sim) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(data$expend, replace = TRUE))) %>% 
  mutate(my_median = median(my_sample)) %>% 
  group_by(sim) %>% 
  summarize(med1 = my_median[1], med2 = my_median[2])

We group by simulation (to keep things from the same simulation together) and then pull out the first median that belongs to each simulation (the one from the lean people) and then the second one (from the obese people). Square brackets are used to grab a particular median within a group.

Then we take the difference of the two medians within each sim, and pull out the 2.5 and 97.5 percentiles of the whole distribution of the differences (so that 95% of the bootstrap distribution is between them):

energy %>% nest_by(stature) %>% 
  mutate(sim = list(1:4)) %>% 
  unnest(sim) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(data$expend, replace = TRUE))) %>% 
  mutate(my_median = median(my_sample)) %>% 
  group_by(sim) %>% 
  summarize(med1 = my_median[1], med2 = my_median[2]) %>% 
  mutate(diff = med1 - med2) %>% 
  summarize(lwr = quantile(diff, 0.025), upr = quantile(diff, 0.975))

There is our proof of concept, even though it is kind of dopey with only four simulations. So let’s up the number of simulations to 10,000 (we are using the extremes of the distribution, so we need more simulations to estimate them accurately):

energy %>% nest_by(stature) %>% 
  mutate(sim = list(1:10000)) %>% 
  unnest(sim) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(data$expend, replace = TRUE))) %>% 
  mutate(my_median = median(my_sample)) %>% 
  group_by(sim) %>% 
  summarize(med1 = my_median[1], med2 = my_median[2]) %>% 
  mutate(diff = med1 - med2) %>% 
  summarize(lwr = quantile(diff, 0.025), upr = quantile(diff, 0.975))

and that, after some considerable effort, is a 95% CI for the difference in medians.

There is usually something funny about bootstrapping medians. Let’s take a look at the distribution of the column of differences, via a histogram:

energy %>% nest_by(stature) %>% 
  mutate(sim = list(1:10000)) %>% 
  unnest(sim) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(data$expend, replace = TRUE))) %>% 
  mutate(my_median = median(my_sample)) %>% 
  group_by(sim) %>% 
  summarize(med1 = my_median[1], med2 = my_median[2]) %>% 
  mutate(diff = med1 - med2) %>% 
  ggplot(aes(x = diff)) + geom_histogram(bins = 40)

I’ve used more bins than I usually would to show you that this is not close to a normal distribution, or even any kind of regular unimodal one.6 The problem is that the numerical values in a bootstrap sample are the same as the ones in the original sample (with possibly more or less of each) and the median must be one of those (or possibly halfway between them if the sample size is even). So there are not many possible values for the median of a bootstrap sample, and thus not very many possible values for the difference between the sample medians, and therefore you get this kind of “clumpy” bootstap distribution. As far as getting a confidence interval is concerned, this is not really a problem (you just take the appropriate percentiles, or in more advanced work with the bootstrap, you adjust things a bit, such as to use a “bias-corrected and accelerated” interval.7).

Footnotes

  1. Joke.↩︎

  2. This is because the pager for displaying the dataframe has an upper limit of 10,000 rows or 1000 pages. After all, are you really going to click over 1000 times to get down to the obese people?↩︎

  3. I suspect those three or so points at the bottom were when the bootstrap sample happenend to contain three or so copies of the low outlier.↩︎

  4. It uses the same idea of “what values would not be rejected by a test”, but from a different test, the rank sum test, which I think is inferior because it assumes the two distributions you are comparing have equal spread, in the same way that the pooled t-test does.↩︎

  5. To see whether it works. Once it does, we can scale it up.↩︎

  6. If we had done this for the mean, the distribution would probably have been a bit skewed, but would have had one clear peak and a nice smooth shape.↩︎

  7. The wisdom is that the percentile-based interval we got will generally be too short.↩︎