STAD29 Assignment 7

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Packages

library(tidyverse)

P-values

  1. (2 points) There are a lot of P-values in scientific notation on this assignment. A P-value that R writes as 4e-03 is actually \(4 \times 10^{-3} = 0.004\). Sometimes it is not convenient to write out all the zeros after the decimal point and you want to show scientific notation in your writeup. To do that, first find out which editing mode you are in. Look top left of the window you are editing in, below the tabs for files you have open, and see which one of Source or Visual is highlighted. Then, in your Quarto text, use “inline math mode”. If you are using the visual mode editor, type a slash /, then scroll down to Inline Math and hit Enter. This will display two dollar signs with the cursor between them. If you are using the source mode editor, type two dollar signs yourself and move the cursor between them. When you want to display a P-value as \(x \times 10^{-y}\), write $x \times 10^{-y}$, using the dollar signs you just obtained. How would you write the P-value of 2e-07 properly, in a sentence saying “The P-value is”, followed by the P-value in scientific notation?

As The P-value is $2 \times 10^{-7}$., obtaining a new pair of dollar signs as above:

The P-value is \(2 \times 10^{-7}\).

If you are using the visual mode editor, you see what you are typing and how the “inline math” will look, one above the other. If you move the cursor away, you just see what the rendered math will look like, and if you move your cursor back to where the math is, you see your typing again and you can edit it. (If you are using the source mode editor, clicking on your inline math code will create a popup showing how your math will look when rendered.)

Extra: you can actually use any LaTeX code inside the dollar signs. LaTeX is the approved way of writing math. For example, if you write $$ y = {x^2 + 1 \over x + 2}$$, you get an equation on a line by itself:

\[ y = {x^2 + 1 \over x + 2}\]

which is what the visual mode editor calls “display math”.

Turkeys and diets

In a study, 30 turkeys were randomly assigned to one of five different diets: a Control diet, a diet with amounts 1 or 2 of additive A (diets A1 and A2), or a diet with amount 1 or 2 of additive B (diets B1 and B2). The researchers were not interested in all comparisons of the five diets, but in these particular comparisons:

  • how the control diet compared to the average of the other 4 diets
  • how the (average of) diet A compared to the (average of) diet B
  • how amounts 1 and 2 of additive A compared
  • how amounts 1 and 2 of additive B compared

The weight gain of each turkey was measured at the end of the study (that is to say, the turkeys were weighed at the beginning and at the end of the study, and the numbers in the column wt.gain are the difference between the weight at the end and the weight at the beginning). The data are in http://ritsokiguess.site/datafiles/turkey.csv.

  1. (1 point) Read in and display (some of) the data.

As usual:

my_url <- "http://ritsokiguess.site/datafiles/turkey.csv"
turkey <- read_csv(my_url)
Rows: 30 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): diet
dbl (1): wt.gain

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
turkey

Note that the diet column is actually text (chr), which we will have to have to deal with shortly.

  1. (3 points) Draw a suitable graph. What does your graph suggest about the comparisons of interest?

The graph is a boxplot (as you would expect), and this one is an ordinary (not grouped) boxplot because there is only one categorical variable:

ggplot(turkey, aes(x = diet, y = wt.gain)) + geom_boxplot()

The graph suggests that we will be seeing a lot of significance:

  • the control diet seems worse (lower weight gain) than all the others
  • diet A overall looks worse than diet B
  • amount 2 appears better than amount 1 within both diets
  1. (1 point) Create, save, and display a dataframe in which the categorical column is a factor rather than text. (Hint: this is the same idea as on the worksheet.)

As on the worksheet, I’m going to overwrite the dataframe I read in, but you can certainly use a different name for your new dataframe:

turkey %>% 
  mutate(diet = factor(diet)) -> turkey
turkey

As you see, diet is now a factor, which will help us out later when we analyze the contrasts we are about to make.

  1. (2 points) Define four contrasts that express the comparisons that the researchers wish to make.

I’m going to do these in the order they were listed, but actually the first one is the most difficult, so you can think about the others first. Here are the comparisons we are making:

  • how the control diet compared to the average of the other 4 diets
  • how the (average of) diet A compared to the (average of) diet B
  • how amounts 1 and 2 of additive A compared
  • how amounts 1 and 2 of additive B compared

and your boxplot tells you that the two amounts of additive A are first, the two of additive B are next, and the Control diet is last, so we get:

vs_control <- c(1/4, 1/4, 1/4, 1/4, -1)
a_vs_b <- c(1/2, 1/2, -1/2, -1/2, 0)
a12 <- c(1, -1, 0, 0, 0)
b12 <- c(0, 0, 1, -1, 0)

Comments:

  • a12 compares amounts 1 and 2 within additive A (the first two diets on your boxplot). b12 does the same for additive B. These are the easiest contrasts to start with.
  • a_vs_b compares the average of the two treatments involving additive A (amounts 1 and 2) with the same for additive B. These, from the boxplot, are the first two diets vs. the 3rd and 4th. I like to put the 2s in here because there are two different amounts of each additive. It actually works here without the 2s, but if there had been three different amounts of additive B, it would have been important to get the 2s and 3s correct.
  • vs_control compares the other diets (the first four on your boxplot) vs the control diet. There are four “real” diets, so their coefficients get divided by 4 (to make the coefficients within the contrast add up to zero). As you’ll recall from the worksheet, either of these versions of my first contrast will also do the job, but I think mine is easiest to think about:
c(1, 1, 1, 1, -4)
c(-1, -1, -1, -1. 4)
c(-1/4, -1/4, -1/4, -1/4, 1)
  • use names for your contrasts that say something about what they are comparing, so that you can pick them out of the table later when you do the tests. (The names are your choice, and the contrasts can be in any order.)
  1. (2 points) Run a suitable analysis to test your contrasts. Hint: there is some setup work first.

The setup work is to collect the contrasts together into a matrix using cbind and set that as the contrasts for your categorical-variable-as-factor. After you have done that, you run the ANOVA as a regression and display the summary:

m <- cbind(vs_control, a_vs_b, a12, b12)
contrasts(turkey$diet) <- m
turkey.1 <- lm(wt.gain ~ diet, data = turkey)
summary(turkey.1)

Call:
lm(formula = wt.gain ~ diet, data = turkey)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0000 -0.2958 -0.0500  0.3917  1.3000 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      6.5300     0.1025  63.686  < 2e-16 ***
dietvs_control   2.7467     0.2051  13.394 6.58e-13 ***
dieta_vs_b      -1.9500     0.2293  -8.505 7.56e-09 ***
dieta12         -0.7417     0.1621  -4.575 0.000112 ***
dietb12         -1.1917     0.1621  -7.350 1.06e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5616 on 25 degrees of freedom
Multiple R-squared:  0.9289,    Adjusted R-squared:  0.9175 
F-statistic: 81.67 on 4 and 25 DF,  p-value: 5.598e-14
  1. (2 points) What do you conclude about the comparisons of interest to the researchers?

In the summary output, ignore the intercept line and look for your contrast names below that with diet on the front (because they are comparisons of diets). The analysis allows you to conclude differences when the P-values are small enough; go back and look at your boxplot to conclude what is bigger than what:

  • There is a significant difference in weight gain between the real diets and the control one (P-value \(6.6 \times 10^{-13}\)); this is because the real diets have a larger mean weight gain than the control diet.
  • There is a significant difference in weight gain between diets with additive A and those with additive B (P-value \(7.6 \times 10^{-9}\)); this is because additive B produces larger mean weight gain than additive A.
  • There is a significant difference in weight gain between amounts 1 and 2 of additive A (P-value \(0.00011\)); this is because there is a larger weight gain from amount 2 compared to amount 1.
  • There is a significant difference in weight gain between amounts 1 and 2 of additive B (P-value \(1.1 \times 10^{-7}\)); this is because there is again a larger weight gain from amount 2 compared to amount 1.

There is a lot of boilerplate text in there; I copied and pasted each conclusion from the previous one, and edited the words and numbers that needed editing. The first question on the assignment, about displaying things in scientific notation, was inspired by all those very small P-values on this question; you could and should use those ideas again here, which is what I did.

This was a lot of work, but the point was that the researchers got assessments of the actual comparisons they were interested in. Doing Tukey on the turkeys (!) would not have compared the right things.

Jellyfish

Jellyfish were collected from two different locations on the Hawkesbury River in New South Wales, Australia. The length and width of each jellyfish was measured. Do the jellyfish at the two locations differ in size? The data are in http://ritsokiguess.site/datafiles/jellyfish.csv.

  1. (1 point) Read in and display (some of) the data.

Do you really need to make me tell you how to do this?

my_url <- "http://ritsokiguess.site/datafiles/jellyfish.csv"
jellyfish <- read_csv(my_url)
Rows: 46 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Location
dbl (2): Width, Length

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jellyfish
  1. (2 points) Make a suitable plot of the data, taking advantage of the fact that you have exactly two quantitative variables.

Having exactly two quantitative variables (as for the seed yield and weight in lecture) means that we can make a scatterplot out of them, labelling the points according to the categorical one (by colour):

ggplot(jellyfish, aes(x = Width, y = Length, colour = Location)) + geom_point()

The story from this graph (which you don’t need to say now, but we come back to later) is that Salamander jellyfish are both longer and wider than the Dangar ones.

It doesn’t matter whether you put Length or Width on the \(x\)-axis (neither is explanatory); the impression should be the same either way.

Extra 1: The jellyfish were measured only to the nearest half-millimetre, and there are some jellyfish that have the same length and width to this accuracy (for example, the eighth and ninth jellyfish in the dataframe). This means that there is some overplotting on the scatterplot, with some of the points actually representing more than one jellyfish (even though it looks as if there is only one). The solution in this case is to replace geom_point with geom_jitter, which moves the points around slightly so that you can see them all. The default movement is a maximum of halfway to the next value in each direction:

ggplot(jellyfish, aes(x = Width, y = Length, colour = Location)) + geom_jitter()

There are indeed more points now. I guess I can find the duplicates. This is a rather unconventional use of count on a quantitative variable. The filter is to show only the combinations of length and width that I have more than one of:

jellyfish %>% count(Width, Length) %>% filter(n > 1)

There are actually three jellyfish of width 15 and length 16. On the first graph, it looks as if there is only one, but on the second graph, you can see the three points jittered slightly away from \((15, 16)\).

Extra 2: if we had had three or more quantitative variables, we would have needed three or more dimensions for our scatterplot! In that case, the best we can do is a boxplot for each quantitative variable. You can do these one at a time, but you can also do them in one shot. The mechanism is actually the same as for plotting residuals against a bunch of \(x\)-variables in regression: pivot all the quantitative variables longer, then draw the boxplots and facet by the names of the variables:

jellyfish %>% 
  pivot_longer(c(Width, Length), names_to = "yname", values_to = "y") %>% 
  ggplot(aes(x = Location, y = y)) + geom_boxplot() + 
  facet_wrap(~ yname)

I didn’t need to worry about scales = "free" here because the length and width are on similar scales.

Once again, the Salamander jellyfish are bigger than the Dangar ones, both in terms of length and width.

  1. (2 points) Set up and display some of a suitable response variable for a MANOVA. Hint: head.

I like staying in the tidyverse and turning to a matrix at the end:

jellyfish %>% select(-Location) %>% 
  as.matrix() -> y

but you can also do all-base-R:

yy <- with(jellyfish, cbind(Width, Length))

This is a matrix, so if you display it, it will display all of itself (and it is long). To avoid this, wrap it in head, which will display (by default) only the first six rows, which is enough to see that you have the right thing:

head(y)
     Width Length
[1,]   6.0      9
[2,]   6.5      8
[3,]   6.5      9
[4,]   7.0      9
[5,]   7.0     10
[6,]   7.0     11

namely, the first six lengths and widths.

  1. (2 points) Run a suitable MANOVA, displaying the results.

The easiest way is to use small-m manova, which is a part of base R:

jellyfish.1 <- manova(y ~ Location, data = jellyfish)
summary(jellyfish.1)
          Df  Pillai approx F num Df den Df    Pr(>F)    
Location   1 0.65107   40.118      2     43 1.475e-10 ***
Residuals 44                                             
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now that I think about it, you can also put the construction of y inside the manova, as we did with Surv in survival analysis:

jellyfish.1a <- manova(cbind(Width, Length) ~ Location, data = jellyfish)
summary(jellyfish.1a)
          Df  Pillai approx F num Df den Df    Pr(>F)    
Location   1 0.65107   40.118      2     43 1.475e-10 ***
Residuals 44                                             
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

manova is perfectly good here, but if you want to live on the edge you can also use big-M Manova from the car package:

library(car)
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:purrr':

    some
jellyfish.2a <- lm(y ~ Location, data = jellyfish)
jellyfish.2 <- Manova(jellyfish.2a)
summary(jellyfish.2)

Type II MANOVA Tests:

Sum of squares and products for error:
          Width   Length
Width  336.7623 270.2509
Length 270.2509 275.9328

------------------------------------------
 
Term: Location 

Sum of squares and products for the hypothesis:
          Width   Length
Width  419.7649 460.7817
Length 460.7817 505.8064

Multivariate Tests: Location
                 Df test stat approx F num Df den Df     Pr(>F)    
Pillai            1 0.6510736 40.11759      2     43 1.4749e-10 ***
Wilks             1 0.3489264 40.11759      2     43 1.4749e-10 ***
Hotelling-Lawley  1 1.8659343 40.11759      2     43 1.4749e-10 ***
Roy               1 1.8659343 40.11759      2     43 1.4749e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This way gives you more P-values, but for data of this structure1 they are all the same.

  1. (2 points) What do you conclude from your MANOVA?

The P-value of \(1.5 \times 10^{-10}\) is much smaller than 0.05. The jellyfish at the two locations differ in some way, either length or width or both. That’s all you can say, so stop there.

  1. (2 points) Looking at your graph, why do you think your MANOVA gave the significant or non-significant result that it did?

The result of the MANOVA was significant because the Salamander jellyfish are actually on average larger in both length and width compared to the Dangar jellyfish.

Extra: if your fingers are like mine, they are more used to typing “Danger” than “Dangar”, so that’s what will probably come out the first time.

Total points: \(2 + (1 +3 + 1 + 2 + 2 + 2) + (1 + 2 + 2 + 2 + 2 + 2) = 2 + 11 + 11 = 24\)

Footnotes

  1. A single categorical variable, a so-called one-way MANOVA.↩︎