library(tidyverse)
library(car)
library(lme4)
library(MVTests)
library(MASS, exclude = "select")
STAD29 Assignment 8
You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.
If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)
You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html
on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd
), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.
Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.
Packages
Paw inflammation
In animals like mice, swelling of the paw is an indication of inflammation, in the third sense here. Two actual drugs, Dexamethazone
and Curcuma longa
, along with a placebo drug Control
, were randomly allocated to 15 mice (each mouse received one drug, shown in Treatment
). The amount of inflammation was measured at four different times: 30 minutes, 1 hour, 2 hours, and 3 hours after the drug (or placebo) was administered. The columns whose names begin with X are the amount of inflammation at the different times. The first digit after the X is the number of hours and the remaining two digits are the number of minutes. In addition, each mouse had an ID number, shown in Animal.number
. The data are in http://ritsokiguess.site/datafiles/paw_inflammation.csv.
Hint: the names of the real treatments are long, and you can refer to them as C and D in this question, but make sure you write out Control
in full so as not to confuse it with Curcuma longa.
- (1 point) Read in and display (some of) the data.
No great surprises, which is to say no surprises at all:
<- "http://ritsokiguess.site/datafiles/paw_inflammation.csv"
my_url <- read_csv(my_url) paw_inflammation
Rows: 15 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Treatment
dbl (5): Animal.number, X030, X100, X200, X300
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
paw_inflammation
This is, you will note, wide-format data, with 15 rows, one per mouse, and four columns of inflammation values, one per time point.
- (2 points) Create, save, and display (some of) a longer dataframe that has all the inflammation values in one column (and thus several rows per mouse). Here, and elsewhere for these data, treat time as a categorical variable (the categories will display in a sensible order).
A pivot-longer. Save yourself some work by noting that all the columns with inflammation values in them have names beginning with X:
%>%
paw_inflammation pivot_longer(starts_with("X"), names_to = "time",
values_to = "inflammation") -> paw_long
paw_long
You see (this is a check on your work) that mouse number 1 now appears in four rows, one for each of the four time points (that are listed in the column I called time
). You can choose your own names for the new columns, but each name should say something about what the column contains, hence my choices of time
and inflammation
.
The last sentence of this question means that there is no need for you to turn what I called time
into a numerical value (eg. with parse_number
as I did in lecture). In any case, that does not make sense here unless you treat the hours and minutes properly. It is fine, if unnecessary, if you somehow convert all the different times into minutes (or all of them into hours), such as 30, 60, 120, 180 (minutes), but it is an error to just blindly copy what I did in lecture because 30, 100, 200, 300 are not meaningful as numbers. You need to show understanding.
- (3 points) Make an interaction plot to assess a possible interaction between treatment and time. What does your plot tell you?
This uses the “longer” dataframe you just made.
Two steps to make the graph: work out the mean inflammation for each combination of Treatment
and whatever you called time
, then plot the mean inflammation against time with the Treatment
as colour and group. The times will display in a sensible order, since the person who organized the data was thinking ahead:
%>%
paw_long group_by(Treatment, time) %>%
summarize(mean_inflammation = mean(inflammation)) %>%
ggplot(aes(x = time, y = mean_inflammation,
colour = Treatment, group = Treatment)) +
geom_point() + geom_line()
`summarise()` has grouped output by 'Treatment'. You can override using the
`.groups` argument.
Come to a conclusion about whether you think those lines are parallel or not, and say what that implies. So your options are:
- the lines are close to parallel so I don’t expect to see an interaction between treatment and time
- the lines are not parallel, so I expect to see an interaction between treatment and time.
I don’t mind what you decide about the lines being parallel or not, but you need to use that decision to say what you think about whether there’s an interaction between treatment and time.
Extra: if you look ahead to possible simple effects (fixing time), you might imagine that at time 30 minutes, all three treatments have a different mean, but after that the two real treatments have the same mean (the placebo is associated with more inflammation at all times). This change in potential significance over time would lead you to infer an interaction and hence a conclusion that the lines are really not parallel.
Having said that, the interaction plot doesn’t tell you anything about variability. To get a handle on that, you would draw a spaghetti plot, which is not too bad here because there are only 15 mice, and hence only 15 spaghetti strands:
ggplot(paw_long, aes(x = time, y = inflammation,
colour = Treatment, group = Animal.number)) +
geom_point() + geom_line()
It looks to me as if maybe the green traces for Curcuma longa (C) start out higher than the blue traces for Dexamethasone (D) and end up the same, and, maybe, there is little enough variability for that pattern to be consistent enough to be significant. But you can imagine that the decision about interaction when it comes to significance is going to be a close call.
- (3 points) Run a suitable mixed model repeated measures analysis. What is your conclusion from it, in the context of the data?
I’m having you do this one first for a change. Remember to use the longer data that you made for the plot.
Predict inflammation from treatment and time and their interaction, and include a random effect for the different mice, here in Animal.number
. Finally, run drop1
to test that interaction term:
.1 <- lmer(inflammation ~ Treatment * time + (1 | Animal.number),
pawdata = paw_long)
drop1(paw.1, test = "Chisq")
The interaction is significant, with a P-value of 0.012. Hence the pattern of inflammation over time does differ for the different treatments.
(That is to say, the decision from this analysis is that those lines on the interaction plot were not parallel. But see later.)
- (4 points) Explore any possible interaction between treatment and time by finding simple effects of treatment (on inflammation) at each time point. Bear in mind that there are here three treatments. What does your simple effects analysis tell you about the reason for the significance (or not) of the interaction?
This is the same idea as on the worksheet, but with an extra wrinkle that we get to shortly. The bit that’s the same is that by fixing time, we only have one observation from each mouse, and so we look for a dependence of inflammation on treatment at that time using an ordinary aov
. As on the worksheet, this is most easily done using the wider dataframe that you read in from the file, that we haven’t otherwise used yet:
.1 <- aov(X030 ~ Treatment, data = paw_inflammation)
psummary(p.1)
Df Sum Sq Mean Sq F value Pr(>F)
Treatment 2 1565 782.5 83.84 8.88e-08 ***
Residuals 12 112 9.3
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If you prefer, you can also use the longer dataframe that you made. This requires you to filter
the observations you want first, and then run the aov
on the right columns:
%>%
paw_long filter(time == "X030") -> pp
.1 <- aov(inflammation ~ Treatment, data = pp)
ppsummary(pp.1)
Df Sum Sq Mean Sq F value Pr(>F)
Treatment 2 1565 782.5 83.84 8.88e-08 ***
Residuals 12 112 9.3
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
and you see that this is a bit more work to get to the same place.
However, all this tells you so far is that the treatments are not all the same at 30 minutes. Here comes the wrinkle: to find out which ones differ from which, we need to run Tukey. I’m going to rerun my code and glue the Tukey on the end, for ease of copying and pasting later:
.1 <- aov(X030 ~ Treatment, data = paw_inflammation)
psummary(p.1)
Df Sum Sq Mean Sq F value Pr(>F)
Treatment 2 1565 782.5 83.84 8.88e-08 ***
Residuals 12 112 9.3
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(p.1)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = X030 ~ Treatment, data = paw_inflammation)
$Treatment
diff lwr upr p adj
Curcuma longa-Control -17.6 -22.7548 -12.445198 0.0000027
Dexamethasone-Control -24.2 -29.3548 -19.045198 0.0000001
Dexamethasone-Curcuma longa -6.6 -11.7548 -1.445198 0.0131404
Now we have a conclusion: all three treatments differ in the amount of inflammation after 30 minutes (that difference you might have seen at 30 minutes between D and C is indeed significant).
Moving on to 1 hour: copy, paste, and change some numbers (the time, and maybe the name of your fitted model object):
.2 <- aov(X100 ~ Treatment, data = paw_inflammation)
psummary(p.2)
Df Sum Sq Mean Sq F value Pr(>F)
Treatment 2 1574.8 787.4 229.3 2.75e-10 ***
Residuals 12 41.2 3.4
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(p.2)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = X100 ~ Treatment, data = paw_inflammation)
$Treatment
diff lwr upr p adj
Curcuma longa-Control -20.2 -23.326451 -17.0735491 0.0000000
Dexamethasone-Control -23.0 -26.126451 -19.8735491 0.0000000
Dexamethasone-Curcuma longa -2.8 -5.926451 0.3264509 0.0810808
The ANOVA doesn’t tell us much (we already kind of suspect Control is different from the others, which would by itself make the ANOVA significant), but the Tukey tells us that C and D are no longer significantly different (at \(\alpha = 0.05\)).
Repeat for time 2 hours and 3 hours:
.3 <- aov(X200 ~ Treatment, data = paw_inflammation)
psummary(p.3)
Df Sum Sq Mean Sq F value Pr(>F)
Treatment 2 1690.0 845.0 145.7 3.83e-09 ***
Residuals 12 69.6 5.8
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(p.3)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = X200 ~ Treatment, data = paw_inflammation)
$Treatment
diff lwr upr p adj
Curcuma longa-Control -22 -26.063569 -17.936431 0.0000000
Dexamethasone-Control -23 -27.063569 -18.936431 0.0000000
Dexamethasone-Curcuma longa -1 -5.063569 3.063569 0.7922928
.4 <- aov(X300 ~ Treatment, data = paw_inflammation)
psummary(p.4)
Df Sum Sq Mean Sq F value Pr(>F)
Treatment 2 1572.1 786.1 136.3 5.62e-09 ***
Residuals 12 69.2 5.8
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(p.4)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = X300 ~ Treatment, data = paw_inflammation)
$Treatment
diff lwr upr p adj
Curcuma longa-Control -21.2 -25.251875 -17.148125 0.0000000
Dexamethasone-Control -22.2 -26.251875 -18.148125 0.0000000
Dexamethasone-Curcuma longa -1.0 -5.051875 3.051875 0.7912488
and the conclusion is the same as for 1 hour: Control is significantly different from the other two, but C and D are not significantly different from each other.
To summarize, the reason the interaction came out significant is that time 30 minutes was different in terms of significant differences than the other times. Because the interaction was significant, we saw a different pattern in significant differences between treatments over the different times.
- (4 points) Rerun your repeated measures analysis using
Manova
. Comment briefly on any differences you see. (You do not need to do any simple effects here, and you can stop with the interpretation as soon as you find a difference, if you find one. The output fromManova
is long, but for this question, you can hand it all in.)
This analysis uses the wide data that you read in from the file.
Create the response matrix, which this time I called y
:
%>%
paw_inflammation select(starts_with("X")) %>%
as.matrix() -> y
y
X030 X100 X200 X300
[1,] 32 29 27 27
[2,] 33 31 29 28
[3,] 32 30 29 26
[4,] 39 35 32 31
[5,] 30 32 31 27
[6,] 12 10 8 7
[7,] 8 7 6 6
[8,] 10 10 6 5
[9,] 6 6 6 6
[10,] 9 9 7 4
[11,] 11 11 5 3
[12,] 20 12 3 3
[13,] 14 9 9 8
[14,] 17 12 9 8
[15,] 16 12 12 11
y
is only 15 rows, so you can certainly display all of it (which is a good idea for you when you’re working on this). Including it in the work you hand in is a good way to convince the grader that you are doing the right thing so far.
Use the with
way if you like it better.
Next, set up the time stuff:
<- colnames(y)
times <- data.frame(times = factor(times))
times.df times
[1] "X030" "X100" "X200" "X300"
times.df
This, as is often the case for these, is a straightforward act of theft from the lecture notes.
Then, fit an lm
using your new response variable (as it depends on Treatment
, with the time lurking in the background), and then feed the output from lm
into Manova
with the right things to get the time-dependency right (again, copy the lecture notes):
.2a <- lm(y ~ Treatment, data = paw_inflammation)
paw.2 <- Manova(paw.2a, idata = times.df, idesign = ~times)
pawsummary(paw.2)
Type II Repeated Measures MANOVA Tests:
------------------------------------------
Term: (Intercept)
Response transformation matrix:
(Intercept)
X030 1
X100 1
X200 1
X300 1
Sum of squares and products for the hypothesis:
(Intercept)
(Intercept) 61824.6
Multivariate Tests: (Intercept)
Df test stat approx F num Df den Df Pr(>F)
Pillai 1 0.98951 1132.319 1 12 3.0136e-13 ***
Wilks 1 0.01049 1132.319 1 12 3.0136e-13 ***
Hotelling-Lawley 1 94.35989 1132.319 1 12 3.0136e-13 ***
Roy 1 94.35989 1132.319 1 12 3.0136e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------
Term: Treatment
Response transformation matrix:
(Intercept)
X030 1
X100 1
X200 1
X300 1
Sum of squares and products for the hypothesis:
(Intercept)
(Intercept) 25381.2
Multivariate Tests: Treatment
Df test stat approx F num Df den Df Pr(>F)
Pillai 2 0.97484 232.4286 2 12 2.5396e-10 ***
Wilks 2 0.02516 232.4286 2 12 2.5396e-10 ***
Hotelling-Lawley 2 38.73810 232.4286 2 12 2.5396e-10 ***
Roy 2 38.73810 232.4286 2 12 2.5396e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------
Term: times
Response transformation matrix:
times1 times2 times3
X030 1 0 0
X100 0 1 0
X200 0 0 1
X300 -1 -1 -1
Sum of squares and products for the hypothesis:
times1 times2 times3
times1 528.0667 326.33333 112.73333
times2 326.3333 201.66667 69.66667
times3 112.7333 69.66667 24.06667
Multivariate Tests: times
Df test stat approx F num Df den Df Pr(>F)
Pillai 1 0.889936 26.95215 3 10 4.1669e-05 ***
Wilks 1 0.110064 26.95215 3 10 4.1669e-05 ***
Hotelling-Lawley 1 8.085645 26.95215 3 10 4.1669e-05 ***
Roy 1 8.085645 26.95215 3 10 4.1669e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------
Term: Treatment:times
Response transformation matrix:
times1 times2 times3
X030 1 0 0
X100 0 1 0
X200 0 0 1
X300 -1 -1 -1
Sum of squares and products for the hypothesis:
times1 times2 times3
times1 80.533333 25.4666667 -2.1333333
times2 25.466667 8.1333333 -0.2666667
times3 -2.133333 -0.2666667 2.1333333
Multivariate Tests: Treatment:times
Df test stat approx F num Df den Df Pr(>F)
Pillai 2 0.6524919 1.775477 6 22 0.150827
Wilks 2 0.4139368 1.847644 6 20 0.140497
Hotelling-Lawley 2 1.2553476 1.883021 6 18 0.139205
Roy 2 1.1108861 4.073249 3 11 0.035828 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Univariate Type II Repeated-Measures ANOVA Assuming Sphericity
Sum Sq num Df Error SS den Df F value Pr(>F)
(Intercept) 15456.1 1 163.8 12 1132.3187 3.014e-13 ***
Treatment 6345.3 2 163.8 12 232.4286 2.540e-10 ***
times 311.0 3 128.2 36 29.1092 9.841e-10 ***
Treatment:times 56.6 6 128.2 36 2.6474 0.03131 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Mauchly Tests for Sphericity
Test statistic p-value
times 0.13978 0.00082319
Treatment:times 0.13978 0.00082319
Greenhouse-Geisser and Huynh-Feldt Corrections
for Departure from Sphericity
GG eps Pr(>F[GG])
times 0.53346 4.227e-06 ***
Treatment:times 0.53346 0.07517 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
HF eps Pr(>F[HF])
times 0.6027375 1.209896e-06
Treatment:times 0.6027375 6.575214e-02
Then, peruse the long output (you can hand it all in):
- start with Mauchly’s test for sphericity (near the bottom). Sphericity is rejected (P-value 0.0008 for both the interaction and the time effect).
- So, look at the adjusted P-value for the interaction (I suggest Huynh-Feldt in the lecture notes), which is 0.066.
- This is not significant at \(\alpha = 0.05\).
You can stop here, because in the mixed model analysis, the interaction was significant. Stop here and say that.
A note here that if sphericity had been OK, you would have inferred a significant interaction again (P-value 0.031), but the implication is that the interaction only appears to be significant because actually sphericity is not OK and the adjustment (Huynh-Feldt) moves the P-value for the interaction to just the other side of significance.
Extra 1: according to this analysis, those traces on the interaction plot are not different from parallel. Thus there is a time effect, and, independently, a treatment effect, to assess.
To do that, in this kind of analysis, we cannot “take out” the interaction because time is intertwined with the whole analysis (from that point of view, mixed models is much clearer). So we go back through this analysis and pull out the P-values we are looking for.
- Since sphericity failed, in testing for time, we look again at the Huynh-Feldt table to get a P-value of \(1.2 \times 10^{-6}\), a strongly significant time effect.
- Treatment is the same over all times, so sphericity is not an issue here. Go back and look at
Univariate Type II Repeated-Measures ANOVA Assuming Sphericity
(even though we are no longer assuming sphericity). The P-value for treatment is \(2.54 \times 10^{-10}\), again strongly significant.
Having seen this much, you are probably itching to do Tukey to find out what the differences are. There isn’t Tukey or similar here, so what you do is look at your interaction plot and/or spaghetti plot and try to see what made the tests significant:
- time: inflammation is going down over time for all treatments
- treatment: inflammation is less for the two real treatments than for Control, over all times.
According to this analysis (no significant interaction), you don’t get to say that 30 minutes is different from the other times: no interaction means that the treatment effect is the same for all times.
You might be wondering whether you should use Manova or mixed models for this kind of analysis. What you should never do is choose between them on the basis of P-values. That is P-hacking. What you do is to pick one, and then follow the analysis with that method and accept the results you get with it. (I showed you both because you can expect to see both in your reading. Mixed models is newer, but not every author will know about it.)
Extra 2: a history lesson. (I might have told you this story in lecture.) When I first started teaching this course, we used a program called SAS. This is big and expensive and (these days) oriented towards business analytics, but it has the historical advantage of having been the first actual statistical software. Back in the late 1960s, when SAS was new, its competition was coding everything yourself in Fortran, and so it made a lot of statistical analyses possible for many people when they were not before.1 On the computer systems of the time, you had punched cards with your code (or a terminal connected to the network into which you typed your code) and then you “submitted your job” to the mainframe computer on which the actual computation was done. If you were lucky, your code had no errors and you got some output. If you were unlucky, you had to fix your errors, hope you got them all, and try again. The output would be printed on paper and delivered to your mailbox in the computer centre, from where you would collect it.
As you see, this was a very different process than running R, where you type some code and see the output right away. One of the early selling points of R was this interactiveness (and to show off the new graphics terminals of the time). Anyway, going back to SAS, because it was run in a non-interactive way, it would give you a ton of output, everything it thought you might need, to save you having to run things again. You would scour through the output, pull out what you needed, and ignore everything else. (SAS still works this way: you put together some code, “submit” it, and get a pile of output to read through.)
All of this is to say that the output you get from Manova
is very much inspired by how SAS does things, right down to the “Type II Sums of Squares” which is actually terminology invented by SAS, and the listing of one table after another in a typewriter font. This is very much output for reading through carefully, rather than output for doing things with (like the output from tidy
and glance
in broom
is).
So I have a throwback moment every time I see the Manova
output.
Hemophilia
Hemophilia is a rare disorder in which the blood doesn’t clot in the typical way because it doesn’t have enough blood-clotting proteins (clotting factors). If you have hemophilia, you might bleed for a longer time after an injury than you would if your blood clotted properly. (This from the Mayo Clinic website, which is a very good and reliable source of information about medical matters.)
Researchers wanted to find out how to distinguish women with hemophilia (carrier
in our dataset) from women without (normal
). They considered two candidate variables, called AHFactivity
and AHFantigen
in our dataset. The data are in http://ritsokiguess.site/datafiles/hemophilia_rrcov.csv.
- (1 point) Read in and display (some of) the data.
As ever:
<- "http://ritsokiguess.site/datafiles/hemophilia_rrcov.csv"
my_url <- read_csv(my_url) hemophilia
Rows: 75 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): gr
dbl (2): AHFactivity, AHFantigen
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hemophilia
The column gr
says whether each woman had hemophilia or not.
- (2 points) Make a suitable plot of the three variables in the dataset.
Two quantitative variables and one categorical variable suggests a scatterplot with points coloured by group:
ggplot(hemophilia, aes(x = AHFactivity, y = AHFantigen, colour = gr)) +
geom_point()
There’s no particular reason for the \(x\) and \(y\) variables to be this way around, so you could also do
ggplot(hemophilia, aes(y = AHFactivity, x = AHFantigen, colour = gr)) +
geom_point()
The impression is the same either way: one corner has mostly red points, one has mostly blue, and there is a dividing line going diagonally between them.
Note: we are lucky here to be able to draw a graph, because we have two quantitative (response) variables rather than more (which would require a plot in 3D or more). This is the same issue that we had with MANOVA before.
Extra: talking of MANOVA, I was going to have you run one here, to convince you that the normal and carrier women did indeed differ on those two variables. I think, though, you are probably convinced enough by looking at your graph. In any case, including it made the assignment too long. Here’s how it goes:
This is an “ordinary” MANOVA, not repeated measures, so there is none of that stuff with times
. Make a response variable out of the quantitative variable columns, for example with
%>% select(-gr) %>% as.matrix() -> y hemophilia
(or the cbind
way), and then most easily
.1 <- manova(y ~ gr, data = hemophilia)
hemophiliasummary(hemophilia.1)
Df Pillai approx F num Df den Df Pr(>F)
gr 1 0.53006 40.605 2 72 1.562e-12 ***
Residuals 73
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
or, if you want to live a little:
.2a <- lm(y ~ gr, data = hemophilia)
hemophilia.2 <- Manova(hemophilia.2a)
hemophiliasummary(hemophilia.2)
Type II MANOVA Tests:
Sum of squares and products for error:
AHFactivity AHFantigen
AHFactivity 1.652508 1.126485
AHFantigen 1.126485 1.577224
------------------------------------------
Term: gr
Sum of squares and products for the hypothesis:
AHFactivity AHFantigen
AHFactivity 0.5391996 -0.22388851
AHFantigen -0.2238885 0.09296385
Multivariate Tests: gr
Df test stat approx F num Df den Df Pr(>F)
Pillai 1 0.5300558 40.60484 2 72 1.562e-12 ***
Wilks 1 0.4699442 40.60484 2 72 1.562e-12 ***
Hotelling-Lawley 1 1.1279121 40.60484 2 72 1.562e-12 ***
Roy 1 1.1279121 40.60484 2 72 1.562e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with the same result, albeit a bit more verbosely. You can (if you are doing this) use either method. If you think one method is better (eg., the manova
because it is a simpler way to get to the same answer), you can say that.
The P-value is \(1.6 \times 10^{-12}\), extremely small, so we reject the null hypothesis and conclude that women with hemophilia differ on one or both variables (or a combination of the two) from the women that don’t.
This is where you would stop. The MANOVA, by itself, tells you nothing about how the groups differ. That we explore shortly with a discriminant analysis.
In case you are wondering, the null hypothesis in this kind of test is that AHFactivity
has the same mean for women with hemophilia and women that don’t have it, and also AHFantigen
has the same mean for women with hemophilia and women that don’t have it.2 So rejecting this null says that the means of the two variables are not both the same across both groups.
I can assess the MANOVA versions of the normality and equal spreads assumptions. Normality first. We begin by putting all the quantitative variables in one column, and then facet by them and by the groups at the end:
%>% pivot_longer(-gr, names_to = "xname", values_to = "x") %>%
hemophilia ggplot(aes(sample = x)) + stat_qq() + stat_qq_line() +
facet_grid(xname ~ gr, scales = "free_y")
I used x
in here, even though the quantitative variables are both responses, because I already used y
as the name of the response. Not thinking ahead.
In selecting the columns to pivot, it is often useful with this kind of thing to say “everything except the categorical column”, but you could also use the starts_with()
idea because all the quantitative columns here have names starting with the same thing AHF
.
In the facets, I made the y-scale in the facets have separate scales (and not the x) because the quantitative response variables might have been measured on very different scales (though here the scales were almost the same). This helps the normal quantile plots to fill their facets and give you a better picture of the normality.
To the graphs: the ones for normal
are (ironically) not so close to normally distributed. There is, I think, a version of the Central Limit Theorem that applies here; the implication is the same as in C32 in that if you have a larger sample size, as we do here, you don’t have to worry so much about normality. So I’m going to call these good.
To assess “equal spreads”, meaning that each variable has the same SD across the two groups of women, and they have the same correlation across those two groups, Box’s M test:
library(MVTests)
summary(BoxM(y, hemophilia$gr))
Box's M Test
Chi-Squared Value = 5.338277 , df = 3 and p-value: 0.149
Box’s M test does not reject here, despite the large sample sizes, so we should have no problem trusting the MANOVA.
I didn’t have you do all this, because the question is really about discriminant analysis, and there is plenty of work for you to do there. The point of doing a MANOVA first is to convince yourself that there is something for the discriminant analysis to find, and therefore that it is worth doing. (If you do a discriminant analysis without doing a MANOVA first, what you find might just be chance, though in this case it’ll be pretty clear because the results from the discriminant analysis will be terrible: there will basically be nothing that distinguishes the groups.)
- (2 points) Run a discriminant analysis and display the results.
Remember that for this, the quantitative response variables go on the right and the categorical explanatory variable goes on the left. The logic is that discriminant analysis asks how group membership is influenced by values on the other variables, even though they are really outcomes:
.3 <- lda(gr ~ AHFactivity + AHFantigen, data = hemophilia)
hemophilia.3 hemophilia
Call:
lda(gr ~ AHFactivity + AHFantigen, data = hemophilia)
Prior probabilities of groups:
carrier normal
0.6 0.4
Group means:
AHFactivity AHFantigen
carrier -0.3079467 -0.005991111
normal -0.1348700 -0.077856667
Coefficients of linear discriminants:
LD1
AHFactivity 9.032787
AHFantigen -8.006605
This output is pretty short. It is simplified because there are only two groups, and therefore there is only \(2-1=1\) linear discriminant (regardless of how many quantitative variables there are). As a result, there is no “percent of trace” because the one linear discriminant takes up all 100% and there is no issue about whether it is important or not: it is the most important out of the 1 linear discriminant there is!
- (2 points) What would make a woman have a high (positive) LD1 score?
Look at the Coefficients of Linear Discriminants at the bottom of the output, and look out for any that are far from zero compared to the others. In this case, they are both about the same size, so we should consider both, but the one for AHFactivity
is positive and the one for AHFantigen
is negative. That means that a large (positive) LD1 score would come from a large value on AHFactivity
and a small value on AHFantigen
.
- (3 points) Considering the goal of discriminant analysis, and looking at the plot you drew earlier, do the results of your discriminant analysis make sense? Explain briefly.
From the previous question, a large LD1 score goes with a large value of AHFactivity
and a small value of AHFantigen
. Looking back at my (first) graph, these values would put you in the bottom right corner of the graph, which is firmly in the blue (normal) zone.
Flipping this around, a small (negative) LD1 score would go with a small value of AHFactivity
and a large value of AHFantigen
, which is top left of the graph in with the red (carrier) women.
Hence, LD1 could be expected to do a good job of distinguishing normal from carrier women, and does so in a way that is consistent with the graph.
(If your graph was my second one, you’ll have to change some things around, but you should come to the same place in the end: large LD1 is now top left, but that still corresponds to “normal” and bottom right corresponds to small (negative) LD1 and to “carrier”).
I occasionally find that some students end up with all the signs in a discriminant analysis reversed (which I think is something to do with system architecture, like being on a Mac), but even if that happens to you, you should end up with one end of LD1 scores being “normal” and the other end being “carrier”.
- (2 points) Obtain the predicted group membership for all the women, and make a dataframe containing the original data, predicted group memberships, LD scores and posterior probabilities of being in each group.
This is almost more to say than it is to do:
<- predict(hemophilia.3)
p <- cbind(hemophilia, p) hemophilia_pred
The two steps are to run predict
on the output from lda
, and then to glue the output from that onto the original data. It might be smart to check that the result contains everything the question asked for, by a means like this:
glimpse(hemophilia_pred)
Rows: 75
Columns: 7
$ AHFactivity <dbl> -0.0056, -0.1698, -0.3469, -0.0894, -0.1679, -0.0836…
$ AHFantigen <dbl> -0.1657, -0.1585, -0.1879, 0.0064, 0.0713, 0.0106, -…
$ gr <chr> "normal", "normal", "normal", "normal", "normal", "n…
$ class <fct> normal, normal, carrier, normal, carrier, normal, ca…
$ posterior.carrier <dbl> 0.002777821, 0.069925591, 0.581785885, 0.211268880, …
$ posterior.normal <dbl> 0.9972222, 0.9300744, 0.4182141, 0.7887311, 0.212384…
$ LD1 <dbl> 3.15425344, 1.61342227, 0.24910989, 1.01936925, -0.2…
You see the original data, a column called class
with the predicted hemophilia categories, two columns of posterior probabilities (one for each group, that add up to 1), and a column of LD1
scores.
- (2 points) Make a graph that will tell you whether your linear discriminant(s) distinguish women with hemophilia from women without. What do you conclude? Explain briefly.
The usual graph for this kind of thing is of the first two linear discriminants on a scatterplot, with the groups (levels of the categorical variable) distinguished by colour. Here, though, we only have one linear discriminant LD1
to go with the one categorical variable gr
, so the right graph is a boxplot:
ggplot(hemophilia_pred, aes(x = gr, y = LD1)) + geom_boxplot()
The boxes overlap, but only a little, suggesting that the discriminant analysis does a reasonably good job of distinguishing the two groups of women. (Use your own adjective that summarizes how good you think the separation is.)
Extra: The answer here should make sense in the light of your graph in Question 8. On my graph, the reds (carrier
) are mostly top left and the blues (normal
) are mostly bottom right, but there is some overlap across the middle with reds and blues mixed up, so the separation is not going to be perfect.
- (3 points) Make a table showing how many women are correctly and incorrectly classified. Does this make sense given your graph in the previous question? Explain briefly.
The standard way to do it is this:
with(hemophilia_pred, table(obs = gr, pred = class))
pred
obs carrier normal
carrier 38 7
normal 4 26
Alternatively, you can stay within the tidyverse, like this:
%>%
hemophilia_pred count(gr, class)
although it would be better to make a cross-tabulation like table
does:
%>%
hemophilia_pred count(gr, class) %>%
pivot_wider(names_from = class, values_from = n)
Most of the women are on the diagonal of the table (top-left to bottom-right), indicating that they were correctly classified, with only a few classified as being in the wrong group (7 carriers that were classified as normal, 4 normals that were classified as carriers). This seems to be consistent with the boxplot having some overlap but not much.
- (3 points) Find a woman in the dataset that was difficult for the discriminant analysis to classify. (Hint: posterior probabilities.) Where on your graph of Question 8 does that woman fall? Explain briefly why that makes sense.
Being difficult to classify means that the posterior probabilities are almost the same, and since there are only two groups, that means that they are both close to 0.5. You can scroll through what I called hemophilia_pred
and find one, or you can be lazy like me and let R do the heavy lifting:
%>% filter(between(posterior.carrier, 0.4, 0.6)) hemophilia_pred
Woman number 20 has posterior probabilities of 0.47 and 0.53, which are close, so she is difficult to classify. This woman has values on the original variables close to \(-0.20\) and \(-0.05\), so is a little above average on both variables, but (the key point) in an area where there are both reds and blues, so it is difficult to guess whether that woman has hemophilia or not (is a carrier or is normal). This is very much reflected in the posterior probabilities.
It doesn’t matter who you pick as long as you are transparent in your process:
- say what the posterior probabilities are for the woman you chose (that are reasonably close to 50-50)
- pull out the
AHFactivity
andAHFantigen
values for that woman - find them on your graph
- say something about the woman being in the area where there are both reds and blues and it’s hard to tell which one they belong to.
Conversely, if I had asked you to find a woman with very different posterior probabilities, that woman would have been either very clearly in the red zone or very clearly in the blue zone.
Extra: I was playing with this idea a bit:
%>% mutate(unclear = between(posterior.carrier, 0.4, 0.6)) -> hemophilia_pred
hemophilia_pred glimpse(hemophilia_pred)
Rows: 75
Columns: 8
$ AHFactivity <dbl> -0.0056, -0.1698, -0.3469, -0.0894, -0.1679, -0.0836…
$ AHFantigen <dbl> -0.1657, -0.1585, -0.1879, 0.0064, 0.0713, 0.0106, -…
$ gr <chr> "normal", "normal", "normal", "normal", "normal", "n…
$ class <fct> normal, normal, carrier, normal, carrier, normal, ca…
$ posterior.carrier <dbl> 0.002777821, 0.069925591, 0.581785885, 0.211268880, …
$ posterior.normal <dbl> 0.9972222, 0.9300744, 0.4182141, 0.7887311, 0.212384…
$ LD1 <dbl> 3.15425344, 1.61342227, 0.24910989, 1.01936925, -0.2…
$ unclear <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS…
This creates a new column unclear
which is true if the posterior probability shown, and therefore the other one as well, is between 0.4 and 0.6). Then:
ggplot(hemophilia_pred, aes(x = AHFactivity, y = AHFantigen, colour = gr,
shape = unclear, size = unclear)) + geom_point()
Warning: Using size for a discrete variable is not advised.
I repeated the plot of question 8 with those “unclear” points exaggerated with a large different plotting symbol. You can see whereabouts on the plot they are: along the diagonal where there are both reds and blues nearby, so that it is indeed not clear whether any of these women are normal
or carrier
.3
Footnotes
If you are in the social sciences, you might have run into software called SPSS, which is of a similar vintage and likewise was part of what you might call a “quantitative revolution” in the social sciences. I think it is point-and-click now, but in the old days, you wrote a program and submitted it as a whole, as with SAS.↩︎
Generous use of copy and paste there.↩︎
I am ignoring the warning because I (claim to) know what I’m doing.↩︎