STAC32 Assignment 4

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Prison stress

Being in prison is stressful, for anybody. 26 prisoners took part in a study where their stress was measured at the start and end of the study. Some of the prisoners, chosen at random, completed a physical training program (for these prisoners, the Group column is Sport) and some did not (Group is Control). The researchers’ main aim was to see whether the physical training program reduced stress on average in the population of prisoners. The data are in http://www.ritsokiguess.site/datafiles/PrisonStress.csv, in four columns, respectively an identifier for the prisoner, whether or not they did physical training, their stress score at the start of the study, and their stress score at the end.

(a) (2 points) Read in and display (some of) the data.

Very much the usual. Give the dataframe a name of your choosing; stress is good as a name because none of the columns are actually called stress:

my_url <- "http://www.ritsokiguess.site/datafiles/PrisonStress.csv"
stress <- read_csv(my_url)
Rows: 26 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Subject, Group
dbl (2): PSSbefore, PSSafter

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
stress

As you see, the columns of stress scores are called PSSbefore and PSSafter, which is because PSS was the name of the stress scale the researchers used.

(b) (2 points) Make a suitable graph of the stress scores at the end of the study and whether or not each prisoner was in the Sport group.

These are the columns PSSafter (quantitative) and Group (categorical), so as is the way with these things, you need a boxplot:

ggplot(stress, aes(x = Group, y = PSSafter)) + geom_boxplot()

Unlike the one on the midterm, you don’t need to make the boxes go left and right (although no harm if you do). A vertical boxplot is fine here.

An alternative is to reason that you have one quantitative variable (PSSafter) and “too many” categorical variables (Group), so you do histograms facetted by Group:

ggplot(stress, aes(x = PSSafter)) + geom_histogram(bins = 5) +
  facet_wrap(~ Group, ncol = 1)

choosing a suitable number of bins (I think about 6 bins is as high as you want to go).

(c) (3 points) Run the most appropriate \(t\)-test to compare the stress scores at the end of the study for the two groups of prisoners. Bear in mind what the researchers are trying to show. What do you conclude from your test, in the context of the data?

This means comparing PSSafter between the two groups defined in Group. The researchers wanted to show that the average (mean) stress score is lower for the prisoners who did the physical training, so we need a one-sided test. The two groups in Group are Sport and Control; the second of these is first alphabetically, so to get the right test we need to say how Control compares to Sport in that (alphabetical) order.

The other thing to consider is whether we should be doing a Welch or a pooled test. I’m prepared to entertain either of these, as long as you state a reason for doing the one you do. My take is that the two groups differ slightly in spread (the boxes on the boxplots differ slightly in height), so I would do the Welch test:

t.test(PSSafter ~ Group, data = stress, 
       alternative = "greater")

    Welch Two Sample t-test

data:  PSSafter by Group
t = 1.3361, df = 21.325, p-value = 0.09781
alternative hypothesis: true difference in means between group Control and group Sport is greater than 0
95 percent confidence interval:
 -1.069768       Inf
sample estimates:
mean in group Control   mean in group Sport 
             23.72727              20.00000 

This gives a P-value of 0.09781, which is not smaller than 0.05, so I cannot reject the null hypothesis, and so there is no evidence here that the physical training reduces stress on average in prisoners.

I think you could also reasonably say that the two groups do not differ substantially in spread, on the basis that the boxes on the boxplot are not very different in height, and therefore that the pooled test would also work:

t.test(PSSafter ~ Group, data = stress, 
      var.equal = TRUE, alternative = "greater")

    Two Sample t-test

data:  PSSafter by Group
t = 1.3424, df = 24, p-value = 0.09601
alternative hypothesis: true difference in means between group Control and group Sport is greater than 0
95 percent confidence interval:
 -1.023091       Inf
sample estimates:
mean in group Control   mean in group Sport 
             23.72727              20.00000 

The P-value is almost identical, and the conclusion is the same, that there is no evidence that the physical training is effective in reducing stress.

The question to ask yourself, when looking over this afterwards, is therefore not “did I do the right test?”, but instead “did I do the test I did for a good reason?”.

(d) (3 points) Make a suitable plot of the stress measurements before the study for each group of prisoners. How, if at all, does that impact the conclusion you drew in the previous part? Explain briefly.

This is another boxplot, which is not in itself very exciting, but what is interesting is the conclusion it helps us to draw. (Therefore, one point only for the boxplot, and two for the conclusion this time):

ggplot(stress, aes(x = Group, y = PSSbefore)) +
  geom_boxplot()

This shows that the prisoners who did the physical training had quite a bit more stress at the start of the study than the control group. Compare that with your first boxplot, which said that after the study, the physical-training group had a slightly lower stress score than the control group.

Putting those two things together, the reduction in stress between the two groups was actually larger than our comparison of the after scores led us to believe. For example, you could look at the medians before and after (reading them off the boxplot): the control group went up from 15 to 26, while the Sport group went down from 23 to 21. (Looking at these numbers, the implication seems to be that stress will increase over time, but putting the prisoners through a physical training program will at least keep it about the same.)

Extra: that’s as far as you needed to go, but it’s worth thinking about how you might do that. One approach is to account for the before measurements somehow, possibly via a regression (which leads on to the technique known as analysis of covariance). A simplified version of that is to look at the difference between after and before for each subject, which is asking “did the Sport subjects improve by more than the Control ones did?”. The idea of taking differences between two measurements on the same subject might remind you of matched pairs.1 This is an odd experimental design, though, because it has elements of both two independent samples (the subjects doing Sport vs. the Control ones), and matched pairs (two observations for each subject).

Anyway, let’s work out the differences as after minus before, and then worry about what kind of difference we would see if there was a training effect. We’ll start with a boxplot of the differences:

stress %>% 
  mutate(diff = PSSafter - PSSbefore) -> PrisonStress
ggplot(PrisonStress, aes(x = Group, y = diff)) + geom_boxplot()

This is starting to look like a significant difference, and in the right direction: the Control group’s stress has gone up a little between before and after, and the Sport group’s stress has come down, by something like 5 points in both cases.

So let’s compare the differences for the two groups, once again with a two-sample t-test. I would favour Welch here (those two spreads do look different), and I have no particular concerns with normality, given the sample sizes:

t.test(diff ~ Group, data = PrisonStress, alternative = "greater")

    Welch Two Sample t-test

data:  diff by Group
t = 3.5908, df = 15.461, p-value = 0.001282
alternative hypothesis: true difference in means between group Control and group Sport is greater than 0
95 percent confidence interval:
 5.792541      Inf
sample estimates:
mean in group Control   mean in group Sport 
             7.363636             -3.933333 

With a P-value of 0.0013, there is certainly a significant difference in the differences2 between the Sport and Control groups; that is to say, the effect of doing Sport is to reduce the stress from what it was before, in comparison to the Control group where stress actually went up. The differences in stress between the two groups are different on average.

So that’s what the researchers were looking for.

Another graph you might draw is based on thinking of the after stress value as a response and the before one as explanatory (both quantitative), with the treatment group as categorical. This suggests drawing a scatterplot with the points labelled by treatment group:

ggplot(PrisonStress, aes(x = PSSbefore, y = PSSafter, colour = Group)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

There’s still a good bit of scatter, but most of the red points are above and most of the blue points are below, particularly if you compare subjects with similar before scores. The lines help us judge that: the red line is above the blue one, meaning that for any before score, the after score is higher for someone in the control group and lower for someone in the Sport group. That is, once you allow for how much stress a subject had before, they will have less stress after doing the physical training than they would if they were in the control group.

Analysis of covariance is based on fitting regression lines to the two separate groups (which is what the red and blue lines actually are). We don’t talk about regression until much later in the course, but the ideas are these:

stress.1 <- lm(PSSafter ~ PSSbefore + Group, data = PrisonStress)
summary(stress.1)

Call:
lm(formula = PSSafter ~ PSSbefore + Group, data = PrisonStress)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.564  -2.759   0.139   3.569   9.309 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  16.0909     2.7358   5.882 5.39e-06 ***
PSSbefore     0.4667     0.1298   3.595  0.00153 ** 
GroupSport   -7.2598     2.4731  -2.935  0.00743 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.717 on 23 degrees of freedom
Multiple R-squared:  0.4044,    Adjusted R-squared:  0.3526 
F-statistic: 7.809 on 2 and 23 DF,  p-value: 0.00258

The significantly positive number beside PSSbefore is an ordinary regression slope.3 This says that if your stress score before is higher by 1, your stress score after is expected to be higher by 0.47 units on this scale. This is not terribly surprising, and also not really what we care about. The significantly negative number next to GroupSport says that if you were in the physical training group, your stress score afterwards is on average lower by over 7 points, compared to if you were in the “baseline” Control group, even if your stress score before was the same. This is really quantifying the effect of the treatment while also allowing for or adjusting for other reasons the groups might have been different (like, the Sport group had higher stress before compared to the control group). The two-sample t-test you did on the After stress scores did not account for how the groups might have been different before (it assumed they were the same up to randomization, which it seems they were not), while this last analysis (and the one based on the differences) both account for everything that is going on. I hope you are feeling that these more sophisticated analyses are rather more satisfactory than the two-sample t-test you did.

(e) (2 points) Going back to your plot of (b), why might you be concerned about the Control group of prisoners for your \(t\)-test? Explain briefly (two reasons).

I can’t remember all the way back to (b), so I’ll draw the boxplot again:

ggplot(stress, aes(x = Group, y = PSSafter)) + geom_boxplot()

The control group distribution appears to be skewed to the left (long lower whisker). One point. The second point comes from looking at the sample size, as usual. You can either go back to your listing of the data (in (a)) and physically count how many observations are in the control group, or (better) you can use the fact that you know how to count things in R:

stress %>% count(Group)

There are only 11 observations in the Control group, so the Central Limit Theorem will help us some, but maybe not very much. So we should have some concern about the validity of our two-sample \(t\)-test, on the basis that one of the groups doesn’t look normal enough given the smallness of the sample.

If you drew histograms, you ought to get to about the same place: make some comment about the non-normality of the Control group (on mine it looks like a low outlier, but depending on your number of bins, it might look like a long left tail) together with a comment on the sample size.

Extras:

  • We’ll see in a minute whether we should have been concerned.

  • The Sport group is, to my mind, close enough to normal given its slightly larger sample size and only slightly longer upper tail. My histogram looks bimodal, but only slightly so, and roughly symmetric in shape, so I would guess that the sample size of 15 will take care of that.

(f) (4 points) Obtain a bootstrap sampling distribution of the sample mean for the PSSafter values in the Control group. From this distribution, do you think your \(t\)-test is reasonable? Explain briefly. (You may assume that we are happy with the distribution of PSSafter values in the Sport group.)

This is a lot like the two-sample one in the lecture notes: grab the values whose distribution you want to assess, and save them somewhere:

PrisonStress %>% filter(Group == "Control") -> controls

One point. Then repeatedly take bootstrap samples from the PSSafter values in the dataframe you just made (the next two points):

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(controls$PSSafter, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(x = my_mean)) + geom_histogram(bins = 10)

Then make a call about whether you think this looks normal (bell-shaped). I don’t mind about your call; I care about your reason for making your call, so one of these:

  • This looks a little skewed to the left still, so we should not have done a \(t\)-test here.

  • This looks very close to normal, so the \(t\)-test is fine.

The call with a good reason is the last point.

There is, as you see, still a judgement call to make, but this one is easier than thinking about a boxplot and a sample size (there is only one thing to think about here).

Extras:

  • This might be one of those cases where doing 10,000 simulations would help us to be more confident in our decision.

  • Later, we learn about the “normal quantile plot” which is easier to read if the normality of a distribution is particularly of interest to us, as it is here.

2 Power and the exponential distribution

The exponential distribution is often used to model lifetimes of things like light bulbs or electronic components. It is not normal (bell-curved) in shape. In R, the function rexp will draw a random sample from an exponential distribution. It has two inputs, the sample size, and the “rate” (which is one over the mean).

(a) (3 points) Draw a random sample of 100 observations from an exponential distribution with mean 5, saved as one column of a dataframe, and make a histogram of your sample. In what way is the distribution not normal in shape? Hint 1: tibble with the same sort of code as in a mutate will create a dataframe from scratch. Hint 2: use a set.seed first so that your random sample won’t change from one run to the next. You can use any number as input to the set.seed.

This is my set.seed:

set.seed(457299)

You should probably get into the habit of using your own random number seed.

Use tibble and the definition to create a dataframe. The second input is \(1/5 = 0.2\); it is easiest to let R do the calculation for you:

expo <- tibble(x = rexp(100, 1/5))

Two points for making that happen. We don’t know what the application is, so you can use “indefinite” names like mine.

Just one for making a histogram of it, since you know how to do that:

ggplot(expo, aes(x = x)) + geom_histogram(bins = 6)

This number of bins shows the shape well for me. Yours will be different, so you might need a different number of bins.

This fails to be normally-distributed because it is skewed to the right. (Yours should have the same shape even if it looks a bit different. The sample size is big, so your sample should be a reasonable representation of the actual distribution, which has this shape.)

(b) (3 points) For your sample, obtain a bootstrap sampling distribution of the sample mean. Comment briefly on its shape.

This is the same procedure as in lecture. Make a new dataframe to hold the simulations, and then sample from your one column in your dataframe (whatever you called the column and the dataframe). Mine, therefore, goes like this:

tibble(sim = 1:1000) %>%
  rowwise() %>% 
  mutate(my_sample = list(sample(expo$x, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(x = my_mean)) + geom_histogram(bins = 10)

Mine looks very much bell-shaped. If yours is right-skewed (which it might be), say that, but add an adjective like “slightly” to emphasize that it is not far from normal in shape.

(c) (2 points) Explain briefly why it makes sense to use a (one-sample) \(t\)-test for the population mean, even though the population does not appear to have a normal distribution.

The sampling distribution of the sample mean is normal (or close to normal) even though the distribution of the sample is very much not: the sample size is large enough to overcome the non-normality in the sample. (“The sample is large” is a one-point answer, because it doesn’t say how the large sample size helps.)

If your simulated sampling distribution is skewed right too much for your tastes, you’ll have to use words like “close to normal” to justify using a \(t\)-test here (which would imply that the P-value is close to correct even if it is not bang on).

(d) (4 points) Estimate by simulation the power of a one-sample \(t\)-test to reject the null hypothesis that the population mean is 6 (against a two-sided alternative), when sampling from an exponential distribution with mean 5 and using a sample of size 100.

This one is the same ingredients as the one in lecture, but is not exactly the same, because we are not sampling from a normal distribution here:

tibble(sim = 1:1000) %>% 
  rowwise() %>%
  mutate(my_sample = list(rexp(100, 1/5))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 6))) %>% 
  mutate(p_value = t_test$p.value) %>% 
  count(p_value <= 0.05)

The steps:

  • set up a dataframe with places for 1000 simulations (or more)
  • work rowwise
  • generate random samples from the true distribution, exponential with the true mean (of 5). This is the same code (inside the list) as you wrote back in part (a).
  • run a \(t\)-test on each of your simulated samples, testing that the mean is 6 (the null mean, which we know to be wrong). The output from t.test is a collection of things (the P-value, the sample mean, a confidence interval, etc.) so this needs to be wrapped in list (and will make a list-column).
  • pull out the P-value from each \(t\)-test, which is a single number each time, so no list needed
  • count how many of those P-values would lead to (correct) rejection of the (incorrect) null.

My estimated power is 0.533. I would expect yours to be somewhere within about 0.03 of that, with 1000 simulations. (If it is not, expect the grader to give your code a good look over, but if your code is correct, full credit.)

Extra: Where did the 6 come from? In this case, out of my head,4 but in general, this is the “subject matter knowledge” thing: a subject matter expert tells us what null hypothesis would usually be tested in this situation (the 6) and what sort of difference from that would be considered a scientifically interesting departure (the 5, 1 away).

I wanted to lead you through the process, so I started you off by generating a random sample from what was going to be the truth and introduced the null mean of 6 later, which is perhaps backwards. The logical way to think about power is that the null hypothesis represents the “standard theory” (6 here) and we are trying to see how likely we are to reject that null hypothesis when the truth is actually something else far enough from that null hypothesis to be of interest (5 here).

(e) (2 points) Suppose our aim is to estimate (by simulation) the sample size needed to get a power of 0.75 in this same situation. By copying and pasting your code and making a small edit to it, run another simulation that will go some way towards meeting that aim.

In my case (and probably yours), the power came out less than 0.75, so to achieve this much power, we will need a bigger sample size. The problem with the simulation approach is that we have no idea how much bigger the sample size needs to be, so to answer this question, copy and paste your code and change the sample size in the rexp line to any number at all bigger than 100, for example:

tibble(sim = 1:1000) %>% 
  rowwise() %>%
  mutate(my_sample = list(rexp(200, 1/5))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 6))) %>% 
  mutate(p_value = t_test$p.value) %>% 
  count(p_value <= 0.05)

This will get your power closer to 0.75. It doesn’t have to be much closer; any closer at all will do. I actually overshot some, so my next guess (if I were to make one) would be a sample size less than 200, but closer to 200 than 100 (say, something like 180). There is randomness here, so I can only hope to get so close. I actually was lucky enough to do pretty well.

You only need one more simulation of this kind.

If your first simulation for some reason gave you a power greater than 0.75, your second guess at the sample size should be something less than 100 (again, it doesn’t matter exactly what).

Extra: I said we “can only get so close”. To think about how close we might be to the “true power” (whatever it is), note that each simulated sample either leads to us correctly rejecting the null (success) or failing to reject the null and making a type II error (failure). Then we have 1000 of those repeated trials, and we count how many times we succeeded. My language here (“trial” with “success” and “failure”) is intended to remind you of a binomial distribution. Here we have \(n = 1000\) trials (simulated samples), and a probability \(p\) of correctly rejecting the null (that is the power we are trying to estimate). You might have run into a confidence interval formula for \(p\), based on the number of trials and successes observed; this is encoded in the R function prop.test, which takes (for a confidence interval) two inputs, the number of successes and the number of trials.

For my first simulation, I got 533 successful rejections, so a 95% confidence interval for the true power is

prop.test(533, 1000)

    1-sample proportions test with continuity correction

data:  533 out of 1000, null probability 0.5
X-squared = 4.225, df = 1, p-value = 0.03983
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5015104 0.5642330
sample estimates:
    p 
0.533 

From this simulation, the power is most likely between 0.502 and 0.564. (This is where I got my “within 0.03” above from.)

For my second simulation, I got 796 successful rejections out of 1000:

prop.test(796, 1000)

    1-sample proportions test with continuity correction

data:  796 out of 1000, null probability 0.5
X-squared = 349.28, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.7693929 0.8202964
sample estimates:
    p 
0.796 

The confidence interval is between 0.769 and 0.820. We were aiming for a power of 0.75, so now we know that a sample size of 200 is indeed too big, but it may not be much too big.

To get this confidence interval shorter, we need to do more simulations, like 10,000. Let me have a third go, using a sample size of 180 and 10,000 simulations:

tibble(sim = 1:10000) %>% 
  rowwise() %>%
  mutate(my_sample = list(rexp(180, 1/5))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 6))) %>% 
  mutate(p_value = t_test$p.value) %>% 
  count(p_value <= 0.05)

That you can see is really close (0.7456) to 0.75. We would expect the confidence interval for the true power to (i) be shorter (more simulations), and (ii) to contain 0.75:

prop.test(7456, 10000)

    1-sample proportions test with continuity correction

data:  7456 out of 10000, null probability 0.5
X-squared = 2411.8, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.7369202 0.7540901
sample estimates:
     p 
0.7456 

Between 0.737 and 0.754. We have now nailed down the power rather precisely, and we are not likely to get any closer than \(n = 180\) as a sample size to achieve this power.

Footnotes

  1. which we come back to later in the course.↩︎

  2. Yes, I know, but that’s what it is.↩︎

  3. The P-values are in the last column.↩︎

  4. I wanted to give you one that would produce a suitable power for a sample size of 100.↩︎