STAD29 Assignment 7

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

ancova

manova (short)

repeated measures

1 Cats

144 cats, 47 female and 97 male, had been used in an experiment with the muscle-relaxing drug digitalis. For each cat, its Sex was recorded, along with the cat’s body weight (Bwt) in kilograms and the weight of its heart (Hwt) in grams. We are interested in the relationship between the weight of a cat’s heart (response) and its body weight (explanatory), and whether that relationship is different for male and female cats. The data are in the file http://ritsokiguess.site/datafiles/cats.csv.

(a) (1 point) Read in and display (some of) the data.

As ever:

my_url <- "http://ritsokiguess.site/datafiles/cats.csv"
cats <- read_csv(my_url)

Rows: 144 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Sex
dbl (2): Bwt, Hwt

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

cats

144 rows, with the columns promised. The bit about digitalis is not actually at all relevant to what we are doing.

(b) (2 points) Make a suitable graph of the three variables. Add appropriate regression lines to your graph.

Two quantitative and one categorical means a scatterplot of the two quantitative variables, bearing in mind which one of them is the response, with the categorical variable’s levels being indicated by colour. Here, we are predicting heart weight from body weight:

ggplot(cats, aes(x = Bwt, y = Hwt, colour = Sex)) + geom_point() + 
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Some things to note for yourself from the graph:

the male cats tended to be bigger than the females (the heaviest female cat had a body weight of 3 kg, but there were many male cats heavier than that)
cats with a bigger body weight tended to have a bigger heart, whether male or female
is there a difference between male and female cats? Not clear.
the slopes of the two regression lines look a bit different, but whether that difference is significant remains to be seen.

(c) (2 points) Fit a suitable analysis of covariance model, and display its output.

Include an interaction (and check whether it is necessary, which is coming up):

cats.1 <- lm(Hwt ~ Bwt * Sex, data = cats)
summary(cats.1)


Call:
lm(formula = Hwt ~ Bwt * Sex, data = cats)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7728 -1.0118 -0.1196  0.9272  4.8646 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.9813     1.8428   1.618 0.107960    
Bwt           2.6364     0.7759   3.398 0.000885 ***
SexM         -4.1654     2.0618  -2.020 0.045258 *  
Bwt:SexM      1.6763     0.8373   2.002 0.047225 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.442 on 140 degrees of freedom
Multiple R-squared:  0.6566,    Adjusted R-squared:  0.6493 
F-statistic: 89.24 on 3 and 140 DF,  p-value: < 2.2e-16

Extra:

The interaction is significant (just), so we need to keep it. That means that the two lines on your graph do in fact have significantly different slopes (which may or may not be what you guessed)
The Estimate for SexM says how the intercept for males compares to the baseline Sex, females. If you imagine a cat with body weight 0 (which is of course impossible) by extending the lines on your graph to the left, this says that the Male line will be considerably less than the Female line by the time you get back to a Bwt of zero. Unlike the example in lecture, the Male line ends up being above the Female one eventually (once Bwt gets above about 2.5), because it has a larger slope.
The Estimate for Bwt is an ordinary regression slope, but since the slopes of the two lines are different, we have to say that it is the slope for the baseline Sex, females. That is, a female cat that is 1 kg heavier will have a heart that is 2.64 grams heavier, on average.
The Intercept is likewise the intercept for females.
The interaction term represents the difference in slopes for male and female cats: the slope for male cats is 1.68 bigger than the slope for female cats.

(d) (2 points) Is the interaction term significant? What does your answer mean in the context of the data?

The interaction term has a P-value of 0.047, which is just less than 0.05, so it is significant (at that \(\alpha\)). This means that the two lines predicting heart weight from body weight for males and females have different slopes.

This is unlike the example we had in class, where the interaction term and hence the two slopes were not significantly different.

(e) (3 points) For male and female cats of body weights 2.5 and 3.5 kg (all four combinations), obtain predicted heart weights.

This uses marginaleffects in the customary way. Make a new dataframe of values to predict for first, and then do your predictions, displaying only the columns of interest.

Hence, first this:

new <- datagrid(model = cats.1, Bwt = c(2.5, 3.5), Sex = c("F", "M"))
new

and then this:

cbind(predictions(cats.1, new)) %>% 
  select(estimate, Bwt, Sex)

Add things like confidence limits for the predictions if you like (though we’re not using them).

(f) (2 points) Using your predictions, verify that the slopes for males and females are different.

Take a trip back to high school and recall that slope is rise over run (or search for it and tell me where you found out what to do). In this case, the “run” part is 1, because the two body weights are 1 kg apart, so the slope in each case is just the “rise” part:

Slope for (baseline) females is

12.21 - 9.57

[1] 2.64

and the slope for males is

13.91 - 9.60

[1] 4.31

These are indeed different, as required. (If you forgot to add an interaction term in your ANCOVA model, they will be the same.) You don’t need to be any more accurate than this.

Extra: going back to our output from earlier:

summary(cats.1)


Call:
lm(formula = Hwt ~ Bwt * Sex, data = cats)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7728 -1.0118 -0.1196  0.9272  4.8646 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.9813     1.8428   1.618 0.107960    
Bwt           2.6364     0.7759   3.398 0.000885 ***
SexM         -4.1654     2.0618  -2.020 0.045258 *  
Bwt:SexM      1.6763     0.8373   2.002 0.047225 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.442 on 140 degrees of freedom
Multiple R-squared:  0.6566,    Adjusted R-squared:  0.6493 
F-statistic: 89.24 on 3 and 140 DF,  p-value: < 2.2e-16

the slope for the baseline females is indeed the same as the Estimate for Bwt. The slope for males is the baseline slope, plus the change in slope between females and males, which is the interaction term, hence

2.64 + 1.68

[1] 4.32

as we found from the predictions.

2 Sociology grades

Forty students took an introductory university sociology course. Some of them had previously taken a high school sociology course, and some of them had not (respectively yes and no in column hssoc). Did this have an impact on the student’s subsequent performance? The university course had two midterms and a final exam (denoted midterm1, midterm2, and final in our data). The data are in http://ritsokiguess.site/datafiles/socgrades.csv. There are several other columns that we will not use here.

3 (a) (1 point) Read in and display (some of) the data.

As ever:

my_url <- "http://ritsokiguess.site/datafiles/socgrades.csv"
grades <- read_csv(my_url)

Rows: 40 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): sex, hssoc
dbl (8): class, gpa, boards, pretest, midterm1, midterm2, final, eval

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

grades

40 students (rows), the column hssoc, the three exam scores, and some other things.

(b) (2 points) Why would a MANOVA analysis be suitable here?

The key is that we have three quantitative responses (the three exam scores), so that we have a multivariate response (the “multivariate” part of MANOVA). To go with that, we have one categorical explanatory variable hssoc, whether or not the student took a high school sociology course.

Extra: there are some other explanatory variables as well. For example, pretest is the score on a test taken at the beginning of the sociology course. This is a quantitative explanatory variable. You might imagine doing a multivariate version of an ANCOVA by predicting the three exam scores from pretest and hssoc; MANCOVA is actually a thing, but we don’t do it in this course.

(c) (2 points) Create and display a suitable response variable for your MANOVA.

Glue the three columns containing exam scores together into a matrix. This is the way I did it in lecture:

response <- with(grades, cbind(midterm1, midterm2, final))
response

      midterm1 midterm2 final
 [1,]       43       61   129
 [2,]       50       47    60
 [3,]       47       79   119
 [4,]       24       40   100
 [5,]       47       60    79
 [6,]       57       59    99
 [7,]       42       61    92
 [8,]       42       79   107
 [9,]       69       83   156
[10,]       48       67   110
[11,]       59       74   116
[12,]       21       40    49
[13,]       52       71   107
[14,]       35       40   125
[15,]       35       57    64
[16,]       59       58   100
[17,]       68       66   138
[18,]       38       58    63
[19,]       45       24    82
[20,]       37       48    73
[21,]       54      100   132
[22,]       45       83    87
[23,]       31       70    89
[24,]       39       48    99
[25,]       67       85   119
[26,]       30       14   100
[27,]       19       55    84
[28,]       71      100   166
[29,]       80       94   111
[30,]       47       45   110
[31,]       46       58    93
[32,]       59       90   141
[33,]       48       84    99
[34,]       68       81   114
[35,]       43       49    96
[36,]       31       54    39
[37,]       64       87   149
[38,]       19       36    53
[39,]       43       51    39
[40,]       20       59    91

You can also do it tidyverse-style by keeping everything in dataframes until the end:

grades %>% 
  select(midterm1:final) %>% 
  as.matrix()

      midterm1 midterm2 final
 [1,]       43       61   129
 [2,]       50       47    60
 [3,]       47       79   119
 [4,]       24       40   100
 [5,]       47       60    79
 [6,]       57       59    99
 [7,]       42       61    92
 [8,]       42       79   107
 [9,]       69       83   156
[10,]       48       67   110
[11,]       59       74   116
[12,]       21       40    49
[13,]       52       71   107
[14,]       35       40   125
[15,]       35       57    64
[16,]       59       58   100
[17,]       68       66   138
[18,]       38       58    63
[19,]       45       24    82
[20,]       37       48    73
[21,]       54      100   132
[22,]       45       83    87
[23,]       31       70    89
[24,]       39       48    99
[25,]       67       85   119
[26,]       30       14   100
[27,]       19       55    84
[28,]       71      100   166
[29,]       80       94   111
[30,]       47       45   110
[31,]       46       58    93
[32,]       59       90   141
[33,]       48       84    99
[34,]       68       81   114
[35,]       43       49    96
[36,]       31       54    39
[37,]       64       87   149
[38,]       19       36    53
[39,]       43       51    39
[40,]       20       59    91

The second way looks a lot different, but gets to the same place via a different route.

Aside: the thing response is really a bit big to be displaying all of, but on this occasion I thought it was useful for you to see it (and to have it for the grader to check, should anything go wrong later).

(d) (3 points) Fit a MANOVA and display the output (without using anything from the `car` package).

This means the easier way using manova:

grades.1 <- manova(response ~ hssoc, data = grades)
summary(grades.1)

          Df Pillai approx F num Df den Df  Pr(>F)  
hssoc      1 0.2163   3.3119      3     36 0.03076 *
Residuals 38                                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Extra: here is how you would do the same analysis using Manova from the car package:

library(car)
grades.2a <- lm(response ~ hssoc, data = grades)
summary(Manova(grades.2a))


Type II MANOVA Tests:

Sum of squares and products for error:
         midterm1 midterm2    final
midterm1  7395.75  6319.25  8590.75
midterm2  6319.25 14029.71 10873.71
final     8590.75 10873.71 30520.71

------------------------------------------
 
Term: hssoc 

Sum of squares and products for the hypothesis:
         midterm1 midterm2    final
midterm1  1848.15 1776.000 3141.300
midterm2  1776.00 1706.667 3018.667
final     3141.30 3018.667 5339.267

Multivariate Tests: hssoc
                 Df test stat approx F num Df den Df   Pr(>F)  
Pillai            1 0.2162956 3.311896      3     36 0.030761 *
Wilks             1 0.7837044 3.311896      3     36 0.030761 *
Hotelling-Lawley  1 0.2759913 3.311896      3     36 0.030761 *
Roy               1 0.2759913 3.311896      3     36 0.030761 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you see, the P-values (all four of them) are the same. The output from manova only gives the first test in the Manova output, known as “Pillai’s trace”. The four tests are identical because hssoc only had one degree of freedom (there were only two possible answers to whether or not a student had taken a sociology course in high school).

(e) (2 points) What do you conclude from your analysis, in the context of the data?

The conclusion from these is always vague. Here, all you can say is that whether or not a student took a high school sociology class had some effect on one or more of the exam scores. What kind of effect, we have no idea.

(f) (4 points) Make a graph that shows the effect of `hssoc` on each of the three exam scores. What does your graph say about the effect of `hssoc` on exam scores? (Hint: rearrange the data so that all the exam scores are in one column, and then use facets.)

If you only had one exam score, you would draw a boxplot, so our aim now is to make facetted boxplots, with the facets being the different exams. Hence, pivot the three exam scores longer, then make a boxplot of exam score by hssoc, and facet by which exam you are looking at:

grades %>% 
  pivot_longer(midterm1:final, names_to = "exam_name", values_to = "exam_score") %>% 
  ggplot(aes(x = hssoc, y = exam_score)) + geom_boxplot() +
    facet_wrap(~exam_name, scales = "free")

The effect of hssoc is that students who have taken a high school sociology course do better on average on all three of the exams in the university course, and that is why the MANOVA came out significant. (This is a rather less interesting conclusion than the seed yield and weight one from lecture.)

Extra: this is really three univariate conclusions rather than one multivariate one. You could do three \(t\)-tests to compare the exam scores by hssoc:

t.test(final ~ hssoc, data = grades)


    Welch Two Sample t-test

data:  final by hssoc
t = -2.4915, df = 28.474, p-value = 0.01881
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
 -42.958373  -4.208294
sample estimates:
 mean in group no mean in group yes 
         90.04167         113.62500

t.test(midterm1 ~ hssoc, data = grades)


    Welch Two Sample t-test

data:  midterm1 by hssoc
t = -2.9222, df = 26.447, p-value = 0.007031
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
 -23.626864  -4.123136
sample estimates:
 mean in group no mean in group yes 
           40.500            54.375

t.test(midterm2 ~ hssoc, data = grades)


    Welch Two Sample t-test

data:  midterm2 by hssoc
t = -2.002, df = 24.603, p-value = 0.05642
alternative hypothesis: true difference in means between group no and group yes is not equal to 0
95 percent confidence interval:
 -27.0610130   0.3943463
sample estimates:
 mean in group no mean in group yes 
         57.54167          70.87500

It turns out that there isn’t quite a significant difference on midterm 2, perhaps because the scores in both groups are very spread out (the outliers mean that you might not even want to run a \(t\)-test here). The other two exam scores definitely differ significantly, though.