STAD29 Assignment 6

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Packages

library(tidyverse)

Toothbrushes

A study was carried out to assess the effect of two different types of toothbrushes (in Toothbrush) at removing plaque. The researchers were also concerned that there might be a difference between males and females, and so they recorded the Sex that each subject identified as. Two outcome values were recorded for each subject: the dental plaque index Before and After brushing. A dental plaque index of zero is the best possible score. Brushing cannot make the score worse;Before minus After is positive or zero. Assume that each subject was randomly assigned to one toothbrush.

The data are in http://ritsokiguess.site/datafiles/GLMsData_toothbrush.csv. There is also a Subject column that we will ignore for the purposes of our analysis.

  1. (1 point) Read in and display (some of) the data.
my_url <- 'http://ritsokiguess.site/datafiles/GLMsData_toothbrush.csv'
toothbrush <- read_csv(my_url)  
Rows: 52 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Sex, Toothbrush
dbl (3): Subject, Before, After

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
toothbrush

I got away with my dataframe name because toothbrush (the dataframe name) and Toothbrush (one of the columns in it) are in R terms different (uppercase and lowercase are different). You might (very reasonably) feel that your dataframe should have a more different name like brushing that is not as close to the name of one of the columns.

Extra: the analysis we are about to do assumes that each subject used only one type of toothbrush (randomly assigned), but in actual fact each subject used both types of toothbrushes, as you can tell:

toothbrush %>% 
  count(Subject, Toothbrush)

so that the analysis we are about to do is in fact wrong, and should be treated as a matched pairs (or repeated measures). It is, however, confusing that there are then two different types of matching (before and after, and the two types of toothbrushes that each subject actually used), and I didn’t want to confuse you more than necessary.

  1. (2 points) Create, save, and display some of a dataframe with a column diff that is the difference Before minus After for each row.

No more difficult than this:

toothbrush %>% mutate(diff = Before - After) -> toothbrush
toothbrush

You can eyeball the first few differences to convince yourself that your calculation is correct.

Extra: you actually have some freedom in an analysis like this to define the before-after difference in a way that makes sense to you. For example, since the lower limit of Before and After is zero, you might decide that something like a percent reduction is better:

toothbrush %>% 
  mutate(diff2 = (Before - After) / Before * 100)

or, similarly, you might feel that the ratio of Before and After is the relevant thing, except that then one of the subjects had an After value of zero, and then you would be dividing by zero and you would have to decide what to do about that.

  1. (2 points) Make a graph that shows how diff varies according to Toothbrush and Sex.

At this point, you are probably expecting a grouped boxplot, and you can convince yourself that this is the thing by noting that diff is quantitative and the other two variables are categorical.

You will need to pick one of your categorical variables to be x. There is not a strong reason to choose one over the other; they both have two levels. The tiniest indication is that you would imagine the researchers to be more interested in differences in Toothbrush, thinking of Sex as being a nuisance or blocking factor. That would say to put Sex as x and Toothbrush as fill:

ggplot(toothbrush, aes(fill = Toothbrush, y = diff, x = Sex)) + geom_boxplot()

but I have no objection if you switch them around:

ggplot(toothbrush, aes(x = Toothbrush, y = diff, fill = Sex)) + geom_boxplot()

  1. (2 points) Fit a suitable two-way ANOVA for predicting diff from toothbrush, sex, and their combination. Display the results.

Include an interaction, therefore, as you should always do at first with a two-way ANOVA:

toothbrush.1 <- aov(diff ~ Toothbrush * Sex, data = toothbrush)
summary(toothbrush.1)
               Df Sum Sq Mean Sq F value   Pr(>F)    
Toothbrush      1  0.751   0.751   1.804 0.185589    
Sex             1  6.446   6.446  15.477 0.000268 ***
Toothbrush:Sex  1  1.126   1.126   2.704 0.106662    
Residuals      48 19.992   0.416                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

That’s as far as you need to go here.

  1. (3 points) What do you conclude from your analysis? Hence, improve your model if it is possible to do so, explaining briefly.

The interaction is not significant (and therefore, whatever difference there might be between sexes is the same for each toothbrush).

This is where you stop by way of conclusion. To go further, remove the interaction and see what that does. The easiest way to do this is to copy-paste your model formula and replace the * with a +:

toothbrush.2 <- aov(diff ~ Toothbrush + Sex, data = toothbrush)
summary(toothbrush.2)
            Df Sum Sq Mean Sq F value   Pr(>F)    
Toothbrush   1  0.751   0.751   1.743 0.192887    
Sex          1  6.446   6.446  14.957 0.000325 ***
Residuals   49 21.118   0.431                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You can also use update, but for that you need to remember that the interaction term itself has a colon in it, so it doesn’t save you much typing:

toothbrush.2a <- update(toothbrush.1, . ~ . - Toothbrush:Sex)
summary(toothbrush.2a)
            Df Sum Sq Mean Sq F value   Pr(>F)    
Toothbrush   1  0.751   0.751   1.743 0.192887    
Sex          1  6.446   6.446  14.957 0.000325 ***
Residuals   49 21.118   0.431                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Whichever of these two ways you go, you need to say that the interaction can be removed because it is not significant.

I am happy if you stop here (I think that’s what I did in lecture), but you can reasonably continue by removing the now non-significant Toothbrush also:

toothbrush.3 <- aov(diff ~ Sex, data = toothbrush)
summary(toothbrush.3)
            Df Sum Sq Mean Sq F value   Pr(>F)    
Sex          1  6.446   6.446   14.74 0.000348 ***
Residuals   50 21.869   0.437                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. (2 points) Using your best analysis and your graph from earlier, what do you conclude?

There is an effect of Sex on diff, but there is not an effect of Toothbrush.

To see what that effect is, go back to your graph. Whichever one of the graphs you drew, the two Male boxes are higher than the corresponding Female boxes (for the same toothbrush), so the value of diff is on average higher for males than it is for females.

As to what that means: well, the males removed more plaque from their teeth in the study than the females did. That might mean that the males were better at brushing their teeth than the females, but it might also mean that the males had more plaque on their teeth to begin with, so that a regular brushing got more of it off; maybe the females tended to keep their teeth cleaner in the first place, and so they had less opportunity to remove a lot of it because there wasn’t much to remove.

A note that there is no need for Tukey here because there are two sexes and two toothbrushes; finding an effect of either means that the two levels are different, and there is no value in looking further.

Extra 1: to assess that last thing, it might be better to look at Before and After separately, rather than looking at diff. This means we now have two quantitative variables, suggesting a scatterplot, as well as two categorical variables. The one that seems to be of interest is Sex, so we’ll use colour for that, and we’ll use facets for Toothbrush in case there is anything there we missed before:

ggplot(toothbrush, aes(x = Before, y = After, colour = Sex)) +
  geom_point() + facet_wrap( ~ Toothbrush)

The most striking thing is that there is a difference in pattern between the toothbrushes. There were some people using the conventional toothbrush that had a very high plaque index after brushing with it (suggesting that the Hugger was actually more beneficial). As for Sex, most (but not all) of the people with high plaque index on the Conventional toothbrush (either before or after) were males and most of those with low plaque index were females. With the Hugger toothbrush, things were more mixed up.

Perhaps just looking at the difference Before minus After was too simple-minded.

Extra 2: we pretended all the way through that there were 52 different subjects who each used one (randomly-chosen) toothbrush, but there is actually no difficulty in having each subject use both toothbrushes (choosing one at random to use first, then waiting maybe a few days, then using the other). This means that we actually have a sort of doubly matched pairs: each subject has a before and after on both toothbrushes. If you think about how we did matched pairs, you’ll recall that we need all the measurements for one subject on one row of the dataframe, not two as we have here. This one seems tricky to reshape, though:

toothbrush %>% 
  pivot_wider(names_from = Toothbrush, 
              values_from = c(Before, After, diff)) -> toothbrush_wide
toothbrush_wide

That was my first guess, and it worked! I wasn’t sure what would happen with multiple columns in the values_from, but it did exactly what I was hoping for: it made new column names by combining the values_from column name with the names_from column value. We have 26 rows, one per subject, because the columns not mentioned in the pivot_wider were Subject and Sex; each subject only has one sex (!), so there are 26 subject-sex combinations.

If you scroll across to the right, you’ll see that we now have diff_Hugger and diff_Conventional columns that we could use for a matched pairs analysis (that unfortunately ignores Sex):

with(toothbrush_wide, t.test(diff_Hugger, diff_Conventional, paired = TRUE))

    Paired t-test

data:  diff_Hugger and diff_Conventional
t = 1.555, df = 25, p-value = 0.1325
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -0.07800101  0.55877024
sample estimates:
mean difference 
      0.2403846 

and there is still no significant difference between toothbrushes. You might concoct a sort of simple-effects analysis by looking at males and females separately:

toothbrush_wide %>% filter(Sex == "F") -> females_wide
toothbrush_wide %>% filter(Sex == "M") -> males_wide
with(females_wide, t.test(diff_Hugger, diff_Conventional, paired = TRUE))

    Paired t-test

data:  diff_Hugger and diff_Conventional
t = 2.2498, df = 13, p-value = 0.04242
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.02038166 1.00533262
sample estimates:
mean difference 
      0.5128571 
with(males_wide, t.test(diff_Hugger, diff_Conventional, paired = TRUE))

    Paired t-test

data:  diff_Hugger and diff_Conventional
t = -0.45532, df = 11, p-value = 0.6577
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -0.4521294  0.2971294
sample estimates:
mean difference 
        -0.0775 

and now we have found a significant difference between toothbrushes for the females, albeit by cheating a bit (if we had started out with an intention to look at males and females separately, it would have been better). The proper analysis in this context (which is a repeated measures) finds a significant interaction between Toothbrush and Sex, which then justifies doing simple effects (correctly also done as repeated measures), and that comes to the same conclusion as our simpler analysis did.

Fat rats

60 baby rats were fed different diets. Half of the rats were fed low-protein diets and half were fed high-protein diets. The source of protein also varied: beef, cereal, or pork, so that ten rats received each combination of protein amount and source. The weight gain for each of the rats was measured, and the experimenters wanted to see how the weight gain depended on the amount and source of protein. The data are in http://ritsokiguess.site/datafiles/fatrats.csv, with these columns:

  • Gain: Weight gain (in grams per week)
  • Protein: Level of protein (Hi or Lo)
  • Source: Source of protein (Beef, Cereal, or Pork).
  1. (1 point) Read in and display (some of) the data.

Zero surprises:

my_url <- "http://ritsokiguess.site/datafiles/fatrats.csv"
fatrats <- read_csv(my_url)
Rows: 60 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Protein, Source
dbl (1): Gain

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fatrats

rats is also a perfectly good name.

Extra: my source for these data was the FatRats dataset in the Stat2Data package (from a textbook called Stat2). With the original data, conclusions were less clear than I would have liked, because there were some outliers and long tails in the data distributions, so I cheated a bit by “Winsorizing” the data, moving the extreme values towards the centre of the data so that the distributions had fewer outliers and more equal tail length. I don’t like to cheat like this, but this way the conclusions come out a lot clearer, so I figured for educational purposes this was forgiveable. (For the original data, I would have had to make \(\alpha = 0.10\) throughout to get sensible conclusions, and having you negotiate that was not the point of this problem.)

  1. (2 points) Draw a suitable graph of these data, putting Source on the \(x\)-axis.

This, with one quantitative and two categorical variables, calls for a grouped boxplot. You could reasonably have either categorical variable on the \(x\) axis, but I am trying to make life easier for the grader here:

ggplot(fatrats, aes(x = Source, y = Gain, fill = Protein)) + geom_boxplot()

The response goes on the \(y\)-axis, and you can use either colour or fill for the other categorical variable. The former outlines the boxes, and the latter fills them in with colours. (This is the usual principle of colour colouring the outside and fill the inside of whatever you are drawing. Lines and points don’t have an inside, as far as ggplot is concerned, as you’ll see shortly.)

  1. (3 points) Draw an interaction plot, again putting Source on the \(x\)-axis.

There are two steps:

  • work out the mean response (Gain) for each combination of Source and Protein
  • plot those means against Source, colouring by Protein, and joining those points by lines. Remember the group thing to make sure you join the right things by lines (on an interaction plot, group and colour should point to the same thing):
fatrats %>% 
  group_by(Source, Protein) %>% 
  summarize(mean_gain = mean(Gain)) %>% 
  ggplot(aes(x = Source, y = mean_gain, 
             colour = Protein, group = Protein)) + 
  geom_point() + geom_line()
`summarise()` has grouped output by 'Source'. You can override using the
`.groups` argument.

Optionally, work out the means and save them first, then use that saved dataframe as input to ggplot. The result should be the same.

The ggplot stuff is rather long, so splitting the lines somewhere is a good idea, so that it is easier to read.

  1. (2 points) What do you conclude from your interaction plot? Explain briefly.

The two lines are very far from being parallel, so we would expect to see an interaction between Source and Protein. (The mean weight gains are very similar for the two groups of rats on Cereal, but much higher for high-protein diets when the protein source was one of the meats.)

Extra: I had you do it this way around so that the grader would have a uniform thing to check, but there’s no problem (in general) with switching the two explanatory variables around:

fatrats %>% 
  group_by(Source, Protein) %>% 
  summarize(mean_gain = mean(Gain)) %>% 
  ggplot(aes(x = Protein, y = mean_gain, 
             colour = Source, group = Source)) + 
  geom_point() + geom_line()
`summarise()` has grouped output by 'Source'. You can override using the
`.groups` argument.

The look is very different, but the conclusion is the same: the three lines are not (all) close to parallel (the Cereal line has a definitely different slope to the other two), so we expect to see an interaction.

It is clearer on this last plot that beef and pork have a very similar profile in terms of weight gain for each amount of protein, and that cereal is very different from them.

Extra extra: remember that an interaction plot doesn’t show variability. If you look back at your boxplot, you’ll note that there is a fairly large amount of variability (look at the heights of the boxes), so the evidence in favour of an interaction might not be as strong as the interaction plot suggests.

  1. (3 points) Fit a suitable two-way analysis of variance, and display the results. What do you conclude so far?

Include an interaction, because our first goal is to see whether it is significant:

fatrats.1 <- aov(Gain ~ Source * Protein, data = fatrats)
summary(fatrats.1)
               Df Sum Sq Mean Sq F value   Pr(>F)    
Source          2    251     126   0.774   0.4663    
Protein         1   3227    3227  19.862 4.24e-05 ***
Source:Protein  2   1147     574   3.530   0.0362 *  
Residuals      54   8773     162                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The P-value for the interaction is 0.0362, which is significant at \(\alpha = 0.05\), so there is indeed a significant interaction.

This is where you stop looking: even though there is a significant Protein effect, we will be able to understand why that is by investigating the interaction with simple effects. The interaction being significant likely means that the effect of Protein is different for the different Sources (or vice versa).

  1. (1 point) Why is it appropriate to look at simple effects here?

Because the interaction is significant, or, more precisely, because we hope to understand why there is a significant interaction.

  1. (4 points) Test the simple effects of Source at each level of Protein. What do you conclude?

There are two levels of Protein, Hi and Lo. For each of these:

  • grab only the data for that level of protein
  • test whether there is a Source effect at that level of protein using a one-way ANOVA
  • if there is a significant Source effect, use Tukey to discover what it is, because there are three protein sources, and the simple-effects \(F\) test only tells you that they are not all the same in terms of weight gain.

Thus, starting with low protein:

fatrats %>% 
  filter(Protein == "Lo") -> lows
lows.1 <- aov(Gain ~ Source, data = lows)
summary(lows.1)
            Df Sum Sq Mean Sq F value Pr(>F)
Source       2    165    82.3   0.418  0.663
Residuals   27   5321   197.1               

There is no significant difference in weight gain between the sources of protein on a low-protein diet. (That blue line on your interaction plot is not significantly different from going straight across, a conclusion that is perhaps made more convincing by comparing the blue boxes on your grouped boxplot.)

Then, repeat with high protein:

fatrats %>% 
  filter(Protein == "Hi") -> highs
highs.1 <- aov(Gain ~ Source, data = highs)
summary(highs.1)
            Df Sum Sq Mean Sq F value Pr(>F)  
Source       2   1234   616.9   4.825 0.0162 *
Residuals   27   3452   127.9                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This time, there is a significant difference in weight gain between the protein sources when the diets are high in protein. There are three protein sources, so to see why this \(F\)-test came out significant, we need to run Tukey:

TukeyHSD(highs.1)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Gain ~ Source, data = highs)

$Source
             diff         lwr      upr     p adj
Cereal-Beef -13.8 -26.3377095 -1.26229 0.0287979
Pork-Beef    -0.4 -12.9377095 12.13771 0.9965566
Pork-Cereal  13.4   0.8622905 25.93771 0.0344068

The Cereal high-protein diet has a significantly smaller average weight gain than either of the two meat high-protein diets, but Beef and Pork are not significantly different in terms of mean weight gain. (On your interaction plot, the red curve really does dip in the middle.)

Extra: it’s not clear to me whether the people who collected these data were more interested in Source or Protein. If they were more interested in the effect of Protein, they might have chosen to do the simple effects the other way around by conditioning on Source. I think there is no one right way around; I chose to have you do it this way around to give the grader something consistent to look at. It also makes it a bit more interesting for you, with the need to do Tukey for one of the simple effects analyses. If you had conditioned on Source, you would have had to do three one-way ANOVAs, but if any of them had come out significant, you would have known right away that there was a difference between Hi and Lo protein, and you wouldn’t have needed to do any Tukeys.