STAD29 Assignment 8

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 You have intelligent rats

Each of 12 students trained rats to run a maze, and the number of successful runs (out of 50 attempts) was recorded on each of 5 days. Some of the students were told that the rats they were training were “bright” (intelligent) and some of them were told that their rats were “dull” (not intelligent). In actual fact, all the rats were from the same source, and none of them were any more intelligent than the others. Did it make a difference whether the students were told that their rats were intelligent on the number of mazes the rats successfully ran, and, if so, was the effect consistent over time? The same group of rats were trained by the same student throughout this experiment, so it makes sense to treat the data as repeated measures.

The data are in http://ritsokiguess.site/datafiles/intelligent-rats.csv. The columns of interest to us are:

  • Student: a numerical identifier for each student
  • Treatment: what the student was told about their rats
  • Day1 through Day5: the number of successful runs on each day.

There are some other variables that will not concern us here.

(a) (1 point) Read in and display (some of) the data.

Nothing unexpected here:

my_url <- "http://ritsokiguess.site/datafiles/intelligent-rats.csv"
maze <- read_csv(my_url)
Rows: 12 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Treatment
dbl (11): Student, PriorExp, Block, Day1, Day2, Day3, Day4, Day5, Relax, Suc...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
maze

(b) (5 points) Set up and run an appropriate repeated-measures ANOVA. What do you conclude from it? (Hint: set up the response variable, and go through the steps required to run the Manova. Obtain the appropriate P-values, describing your process.)

There’s actually a fair bit to do here. The first bit is to grab the five Day columns and make a response matrix out of them, either this way:

maze %>% select(starts_with("Day")) %>% 
  as.matrix() -> response
response
      Day1 Day2 Day3 Day4 Day5
 [1,]   10   13   12   12    9
 [2,]   18   20   21   23   30
 [3,]   16   14   20   20   20
 [4,]   11   20   27   17   16
 [5,]   27   24   26   28   27
 [6,]   16   14   14   13   15
 [7,]    8   10   21   10   20
 [8,]   13   20   18   20   22
 [9,]   24   23   27   29   27
[10,]   13   12   23   17   16
[11,]    9   15   24   12   11
[12,]   19   17   23   33   29

or this way:

response1 <- with(maze, cbind(Day1, Day2, Day3, Day4, Day5))
response1
      Day1 Day2 Day3 Day4 Day5
 [1,]   10   13   12   12    9
 [2,]   18   20   21   23   30
 [3,]   16   14   20   20   20
 [4,]   11   20   27   17   16
 [5,]   27   24   26   28   27
 [6,]   16   14   14   13   15
 [7,]    8   10   21   10   20
 [8,]   13   20   18   20   22
 [9,]   24   23   27   29   27
[10,]   13   12   23   17   16
[11,]    9   15   24   12   11
[12,]   19   17   23   33   29

The second way is more like what I did in class, but also more typing. The as.matrix turns a dataframe into a matrix, which is what the response needs to be in the end.

Next, run lm, set up the time-dependence stuff, run Manova, and look at the results. This is basically cookie-cutter code: take what appears in the lecture notes and copy and paste it, changing a few names:

maze.1 <- lm(response ~ Treatment, data = maze)
times <- colnames(response)
times.df <- data.frame(times = factor(times))
maze.2 <- Manova(maze.1, idata = times.df, idesign = ~ times)
summary(maze.2)
Warning in summary.Anova.mlm(maze.2): HF eps > 1 treated as 1

Type II Repeated Measures MANOVA Tests:

------------------------------------------
 
Term: (Intercept) 

 Response transformation matrix:
     (Intercept)
Day1           1
Day2           1
Day3           1
Day4           1
Day5           1

Sum of squares and products for the hypothesis:
            (Intercept)
(Intercept)    104160.3

Multivariate Tests: (Intercept)
                 Df test stat approx F num Df den Df     Pr(>F)    
Pillai            1   0.97802 444.8761      1     10 1.2754e-09 ***
Wilks             1   0.02198 444.8761      1     10 1.2754e-09 ***
Hotelling-Lawley  1  44.48761 444.8761      1     10 1.2754e-09 ***
Roy               1  44.48761 444.8761      1     10 1.2754e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------
 
Term: Treatment 

 Response transformation matrix:
     (Intercept)
Day1           1
Day2           1
Day3           1
Day4           1
Day5           1

Sum of squares and products for the hypothesis:
            (Intercept)
(Intercept)    4720.333

Multivariate Tests: Treatment
                 Df test stat approx F num Df den Df    Pr(>F)   
Pillai            1 0.6684447 20.16088      1     10 0.0011608 **
Wilks             1 0.3315553 20.16088      1     10 0.0011608 **
Hotelling-Lawley  1 2.0160877 20.16088      1     10 0.0011608 **
Roy               1 2.0160877 20.16088      1     10 0.0011608 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------
 
Term: times 

 Response transformation matrix:
     times1 times2 times3 times4
Day1      1      0      0      0
Day2      0      1      0      0
Day3      0      0      1      0
Day4      0      0      0      1
Day5     -1     -1     -1     -1

Sum of squares and products for the hypothesis:
          times1    times2     times3    times4
times1 280.33333 193.33333 -67.666667 38.666667
times2 193.33333 133.33333 -46.666667 26.666667
times3 -67.66667 -46.66667  16.333333 -9.333333
times4  38.66667  26.66667  -9.333333  5.333333

Multivariate Tests: times
                 Df test stat approx F num Df den Df   Pr(>F)  
Pillai            1 0.6827637 3.766392      4      7 0.060953 .
Wilks             1 0.3172363 3.766392      4      7 0.060953 .
Hotelling-Lawley  1 2.1522241 3.766392      4      7 0.060953 .
Roy               1 2.1522241 3.766392      4      7 0.060953 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------
 
Term: Treatment:times 

 Response transformation matrix:
     times1 times2 times3 times4
Day1      1      0      0      0
Day2      0      1      0      0
Day3      0      0      1      0
Day4      0      0      0      1
Day5     -1     -1     -1     -1

Sum of squares and products for the hypothesis:
       times1    times2 times3     times4
times1     27  51.00000     81  -6.000000
times2     51  96.33333    153 -11.333333
times3     81 153.00000    243 -18.000000
times4     -6 -11.33333    -18   1.333333

Multivariate Tests: Treatment:times
                 Df test stat approx F num Df den Df   Pr(>F)  
Pillai            1 0.6538624 3.305793      4      7 0.080236 .
Wilks             1 0.3461376 3.305793      4      7 0.080236 .
Hotelling-Lawley  1 1.8890244 3.305793      4      7 0.080236 .
Roy               1 1.8890244 3.305793      4      7 0.080236 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Univariate Type II Repeated-Measures ANOVA Assuming Sphericity

                 Sum Sq num Df Error SS den Df  F value    Pr(>F)    
(Intercept)     20832.1      1   468.27     10 444.8761 1.275e-09 ***
Treatment         944.1      1   468.27     10  20.1609 0.0011608 ** 
times             294.3      4   411.07     40   7.1586 0.0001909 ***
Treatment:times   194.3      4   411.07     40   4.7259 0.0032260 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Mauchly Tests for Sphericity

                Test statistic p-value
times                  0.69814 0.96424
Treatment:times        0.69814 0.96424


Greenhouse-Geisser and Huynh-Feldt Corrections
 for Departure from Sphericity

                 GG eps Pr(>F[GG])    
times           0.86937  0.0004321 ***
Treatment:times 0.86937  0.0052136 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                  HF eps   Pr(>F[HF])
times           1.389507 0.0001909378
Treatment:times 1.389507 0.0032260394

Finally, go down to near the bottom and check the sphericity tests. Sphericity is by no means rejected (P-value 0.964), so look at the “Univariate Type II Repeated-Measures ANOVA Assuming Sphericity” just above that. The interaction is significant, with a P-value of 0.0032, so that is our finding. In the context of the data, that means that there is an effect of treatment on the number of successful runs, but that effect depends on which day you are looking at. (This is as far as we can go for now.)

(c) (3 points) Draw a suitable interaction plot (for treatment and time). How does this clarify your conclusions of the previous part? (Hint: you’ll need longer data.)

The hint says that the first thing we need to do is to make “long” data:

maze %>% 
  pivot_longer(starts_with("Day"), names_to = "Day", values_to = "runs") -> maze_long
maze_long

Then work out the mean number of runs for each combination of Treatment and Day:

maze_long %>% 
  group_by(Treatment, Day) %>% 
  summarize(mean_runs = mean(runs)) -> means 
`summarise()` has grouped output by 'Treatment'. You can override using the
`.groups` argument.
means

And then make the interaction plot, bearing in mind that you need a colour and a group and they need to be the same:

ggplot(means, aes(x = Day, y = mean_runs, colour = Treatment, group = Treatment)) + geom_point() + geom_line()

(You don’t need to display the intervening dataframes, though you probably should for yourself while you’re working on this.)

When the students were told that their rats were “bright”, the number of successful runs on average increased over time. But for the students told that their rats were “dull”, the mean number of successful runs increased up to day 3, peaked there, and decreased afterwards. You might also say that the traces over time are not parallel, and talk about how they are not.

This difference in pattern over time was the reason for the significant interaction.

Extra: I also drew a spaghetti plot, which uses the same long data that we just made, but has a different colour and treatment:

maze_long
ggplot(maze_long, aes(x = Day, y = runs, colour = Treatment, group = Student)) + geom_point() + geom_line()

The pattern is not so clear here (which is why I had you draw the interaction plot instead), but if you look carefully, you can that several of the students with the supposedly “dull” rats had the number of successful runs peak at day 3, while the students with “bright” rats saw the number of runs on a more or less upward trajectory. There is some variability here, but the different patterns over time were consistent enough to make the interaction strongly significant.

The red traces are mostly above the blue ones (except perhaps at day 3), which is probably the reason there is a significant treatment main effect, but we would do better to look at simple effects over time to further understand the significant interaction (the treatment effect will probably not be consistent over time, as a consequence of the significant interaction).

I think I airily talked about how you can use simple effects to understand an interaction in this kind of analysis, but never actually showed you an example. There are two ways you might go:

  • fix time and look at the effect of treatment at each time
  • fix treatment and look at the effect of time within that treatment

In this example, and perhaps more generally, the first option makes more sense to me: we care more about the treatment effect, while the time is a sort of “block” within which to observe the treatment effect. The first option is also easier to do, because by focusing on only one time point at a time, there is no time dependence left and each student gives you only one observation at that time (the number of runs their rats made at that time). Hence, the analysis at each time point is only an ordinary aov.1

The simple effects analyses are based on the “long” data that we made for the interaction plot. Grab the observations for that time point, then assess the effect of treatment on the number of runs. There are only two treatments here, so aov2 will do it:

maze_long %>% 
  filter(Day == "Day1") %>% 
  aov(runs ~ Treatment, data = .) %>% 
  summary()
            Df Sum Sq Mean Sq F value  Pr(>F)   
Treatment    1  208.3  208.33   11.81 0.00636 **
Residuals   10  176.3   17.63                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a treatment effect (P-value 0.006) on Day 1, which we know (from the interaction plot) is because the supposedly bright rats performed better than the dull ones. We would expect something similar to happen on Day 2. Copy and paste:

maze_long %>% 
  filter(Day == "Day2") %>% 
  aov(runs ~ Treatment, data = .) %>% 
  summary()
            Df Sum Sq Mean Sq F value Pr(>F)  
Treatment    1  96.33   96.33   7.565 0.0205 *
Residuals   10 127.33   12.73                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Still significant (P-value 0.020), though not as strongly so. The bright rats (so-called) again did better than the dull ones.

Day 3 is where the significance of the treatment effect might go away, since this is where the dull rats peaked. Paste again (it is probably still on your clipboard):

maze_long %>% 
  filter(Day == "Day3") %>% 
  aov(runs ~ Treatment, data = .) %>% 
  summary()
            Df Sum Sq Mean Sq F value Pr(>F)
Treatment    1  16.33   16.33   0.691  0.425
Residuals   10 236.33   23.63               

The P-value is 0.425, which is nowhere near significance, so there is no treatment effect on Day 3.

After that, you would expect bright to be better again, since the dull rats peaked and the bright ones continued to improve after Day 3:

maze_long %>% 
  filter(Day == "Day4") %>% 
  aov(runs ~ Treatment, data = .) %>% 
  summary()
            Df Sum Sq Mean Sq F value   Pr(>F)    
Treatment    1    432   432.0   23.61 0.000663 ***
Residuals   10    183    18.3                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
maze_long %>% 
  filter(Day == "Day5") %>% 
  aov(runs ~ Treatment, data = .) %>% 
  summary()
            Df Sum Sq Mean Sq F value   Pr(>F)    
Treatment    1  385.3   385.3   24.65 0.000566 ***
Residuals   10  156.3    15.6                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And so it proves, with P-values 0.00066 and 0.00057. Thus, there is a significant treatment effect on all the days except Day 3. This (small) inconsistency of pattern is what drives the significant interaction; if the interaction had not been significant, there would have been a consistent treatment effect over all five days.3

2 Diabetes

According to the Mayo Clinic,

Diabetes mellitus refers to a group of diseases that affect how the body uses blood sugar (glucose). Glucose is an important source of energy for the cells that make up the muscles and tissues. It’s also the brain’s main source of fuel. The main cause of diabetes varies by type. But no matter what type of diabetes you have, it can lead to excess sugar in the blood. Too much sugar in the blood can lead to serious health problems.

The data in http://ritsokiguess.site/datafiles/diabetes1.csv are from 145 non-obese adult patients classified into three groups (types of diabetes): “normal”, “overt”, and “chemical”. For each patient, five other variables were also recorded:

  • rw: relative weight, the ratio of actual weight to ideal weight for the person’s height.
  • fpg: fasting plasma glucose
  • glucose: area under plasma glucose curve after 3-hour glucose tolerance test
  • insulin: area under plasma insulin curve after 3-hour glucose tolerance test
  • sspg: steady-state plasma glucose

These variables are recorded here as \(z\)-scores (they were originally measured on vastly different scales).

Our aim is to investigate any association between the five measured variables and the diabetes type (in group).

(a) (1 point) Read in and display (some of) the data.

As ever:

my_url <- "http://ritsokiguess.site/datafiles/diabetes1.csv"
diabetes <- read_csv(my_url)
Rows: 145 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): group
dbl (5): rw, fpg, glucose, insulin, sspg

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
diabetes

145 rows, the variables promised, and values for those variables that do indeed look like \(z\)-scores.

Extra: discriminant analysis works best if all the quantitative variables are on about the same scale, so that the sizes of the “coefficients of linear discriminants” are roughly comparable (since we are going to look at them to decide what LD1, LD2, etc., contain). For the original data, this was very much not the case:

library(rrcov)
Loading required package: robustbase

Attaching package: 'robustbase'
The following object is masked from 'package:survival':

    heart
The following object is masked from 'package:lmerTest':

    carrots
Scalable Robust Estimators with High Breakdown Point (version 1.7-4)
data("diabetes")
diabetes

You can see that a one-unit change in rw is a big change relative to the kinds of values rw takes, but a one-unit change in glucose is a very small one. So I decided to re-express the quantitative variables as \(z\)-scores. The function is called scale, but if you feed it a column, it will give you back a matrix, for reasons that go back into the dusty early days of R. So I wrote a little function that takes as input a column, and returns the standardized version of that column to two decimal places. First I run scale, then I turn it from a matrix back into a vector, and then I round the vector to 2 decimals. The bottom line is a test:

my_scale <- function(x) {
  y <- scale(x)
  z <- as.vector(y)
  round(z, 2)
}
zz <- c(0, 2, 4, 6)
my_scale(zz)
[1] -1.16 -0.39  0.39  1.16

This seems to have worked. My zz has mean 3, and the values above the mean are as far above 3 as the ones below are below, so I should get two pairs of \(z\)-scores differing only in sign.

The idea is that the thing I’m doing on a bunch of columns is kind of messy, so I write a function to abstract away the messiness, and then my across (below) is much simpler.

Now that I have that set up, I run it on all the columns except for the categorical group. The mutate line below reads, in English, as “take all the columns except for group, and replace them by the standardized versions of themselves, to 2 decimals”:

diabetes %>% 
  mutate(across(-group, \(x) my_scale(x))) -> diabetes
diabetes

This is the data I saved for you.

(b) (3 points) Using manova, demonstrate that the group has some kind of effect on the other variables.

Create a response matrix first, out of the quantitative columns. You have two options here, the base-R way that you will probably think of first:

response <- with(diabetes, cbind(rw, fpg, glucose, insulin))

Or the tidyverse-till-the-end way, where everything stays as a dataframe until the end:

diabetes %>% select(-group) %>% 
  as.matrix() -> response1

I’m not displaying either response matrix, because they have 145 rows (though I did for myself earlier to make sure they looked right).

Then run the MANOVA:

diabetes.1 <- manova(response ~ group, data = diabetes)
summary(diabetes.1)
           Df Pillai approx F num Df den Df    Pr(>F)    
group       2 1.2113   53.756      8    280 < 2.2e-16 ***
Residuals 142                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The P-value is very small, so one or more of the groups has some effect on one or more of the quantitative variables. As you recall, at this point, we have no idea what kind of effect, or on what, but we do know that there is something to find. If this were ANOVA rather than MANOVA, we would at this point run Tukey to see what kind of differences there were.

(c) (2 points) Run a discriminant analysis and display the results.

Two points suggests not very much work, and the not very much work is this, making sure you have MASS loaded:

diabetes.2 <- lda(group ~ rw + fpg + glucose + insulin + sspg, data = diabetes)
diabetes.2
Call:
lda(group ~ rw + fpg + glucose + insulin + sspg, data = diabetes)

Prior probabilities of groups:
 chemical    normal     overt 
0.2482759 0.5241379 0.2275862 

Group means:
                  rw        fpg    glucose    insulin       sspg
chemical  0.60694444 -0.3550000 -0.1569444  0.8419444  0.2336111
normal   -0.31000000 -0.4815789 -0.6110526 -0.1111842 -0.6622368
overt     0.05060606  1.4957576  1.5787879 -0.6627273  1.2700000

Coefficients of linear discriminants:
                LD1        LD2
rw       0.17691142  0.4883370
fpg     -2.15320564 -2.3530795
glucose  3.98629010  2.2550693
insulin -0.01269006  0.7451996
sspg     0.44658743 -0.1192412

Proportion of trace:
  LD1   LD2 
0.881 0.119 

In a discriminant analysis, the categorical variable (or combination of the categorical variables) goes on the left, and the quantitative variables go on the right, so that we are “predicting category”, even if logically the category is really influencing the quantitative variables rather than the other way around.

Extra 1: this assignment is likely to get long, so I won’t ask it here, but I often ask whether the number of linear discriminants you got is what you were expecting. Here, we had 3 groups and 5 variables, so we should expect the smaller of \(3-1\) and 5, that is 2, LDs, and that’s what we got.

Extra 2: depending specifically on your R installation, you might get output with negative and positive signs switched around compared to mine. This may make your graphs (later) come out upside down or mirror-image compared to mine, and may have other effects, but as long as you appear to have done the right thing, you are good (that is, if the grader cannot find any errors in your code, it is no problem if your output differs from mine in these kinds of ways.)

(d) (2 points) Comment briefly on the relative importance of the linear discriminants.

Look at the “proportion of trace” at the bottom of the output. The value for LD1, 0.881, is much larger than the value for LD2, 0.119, so the first linear discriminant is much more important than the second, and does most of the work in separating the three groups.

Your choice of adjective; LD2 is not quite worthless, but we will see on a graph later that most of the separation is happening along the LD1 dimension.

(e) (2 points) Which two of the original quantitative variables play the largest role in LD1? What kind of values on those variables would make the LD1 score large (very positive)?

This is, for me, fpg negatively and glucose positively; thus, the LD1 score will be largest if glucose is large and/or fpg is small. (I am happy if you say “and” here, though things are actually a bit more complicated than that; see the Extra below.)

Extra: We will see later that this is characteristic of an individual with overt diabetes. Having said that, such individuals tend to have large glucose and large fpg. The rationalization of this is that the coefficient on LD1 for glucose is almost twice as big in size as the one for fpg, so that a large value of glucose “wins” over a large value of fpg. I drew some graphs for myself at the end to understand this better.

You might have the signs switched around, so the answer here would then be “small glucose and large fpg”.

(f) (2 points) Obtain and save a dataframe containing the predicted group memberships, posterior probabilities, and discriminant scores for each individual, along with the original data. Display (some of) your dataframe.

This is obtained with the old-fashioned predict. predict only produces the predictions, so we need to glue this onto the original data:

p <- predict(diabetes.2)
d <- cbind(diabetes, p) 
d

Most of the columns I see here came from the original data, but class is the predicted group (from the discriminant analysis, based only on the values of the five quantitative variables). If you scroll right, you’ll see the posterior probabilities of each individual being in each group, and at the end the LD scores (on LD1 and LD2).

(g) (3 points) Obtain a table counting the number of individuals who actually had each type of diabetes, cross-classified by the type of diabetes they were predicted to have. Does the classification appear to be good or bad? Explain briefly.

This is the standard way to do it:

with(d, table(obs = group, pred = class))
          pred
obs        chemical normal overt
  chemical       30      6     0
  normal          3     73     0
  overt           5      1    27

You also have a tidyverse option, with count:

d %>% count(group, class)

but this, I think, is best made to look like the previous output:

d %>% count(group, class) %>% 
  pivot_wider(names_from = class, values_from = n, values_fill = 0)

The pivoting-wider introduces some cells that didn’t appear in the original count output (which had only seven rows) ; the values_fill replaces those missings with the zero that they actually are.

However you get the table, look at the values off the top-left to bottom-right diagonal; these are the ones that the classification got wrong. There are not many of these here, only \(6+3+5+1 = 15\) out of 145. So the classification appears to be good (which implies that the three types of diabetes can be distinguished well by the five quantitative variables, in particular by glucose and fpg that were the most important part of LD1).

If you have the table that comes directly out of count(group, class), you’ll need to say something about how you know the classification was a good one, for example words like “the rows on which the actual group and the predicted group (in class) are different are the ones for which the classification was wrong. The frequencies on these rows are small, so not many of the individuals were wrongly classified”.

(h) (3 points) Find an individual that was misclassified (it doesn’t matter which one). For your chosen individual, was the misclassification clear-cut or a close thing? Explain briefly.

The true type of diabetes is in group and the predicted type is in class, so you can first find the individuals for which group and class are different:

d %>% filter(group != class)

Then, whichever individual you pick, scroll across and find their posterior probabilities of being in the true and predicted types. For example, the individual in row 26 was actually normal but was predicted to be chemical. This individual has a posterior probability 0.82 of being chemical, only 0.17 of being normal, and almost 0 of being overt. So this decision was actually quite clearly wrong.

Go through this procedure with your chosen individual. The answer you get doesn’t matter as long as it follows logically, and will depend on your chosen individual. The one in row 59, for example, is a close call (0.47 and 0.53), and the one in the row labelled 66 (on mine) is an even closer call. (The row numbers come from the original dataframe d before the filter was run on it.)

(i) (2 points) Make a plot of LD1 and LD2 scores for each individual, distinguished by the group they belong to. There are too many points on this plot to label individually.

Just this:

ggplot(d, aes(x = x.LD1, y = x.LD2, colour = group)) + geom_point()

If you try to label the points (like I did on the peanuts example in class), the plot will be mostly an effort to find places to put labels, so don’t do that.

Look in your dataframe of data plus predictions (the one I called d) to see what the discriminant scores are called. It depends on precisely how you made your dataframe. On mine, they are called x.LD1 and x.LD2, but on yours they might be called LD1 and LD2. Take a look.

Extra: to understand why that happened for me, let’s look at the “structure” of my p:

str(p)
List of 3
 $ class    : Factor w/ 3 levels "chemical","normal",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ posterior: num [1:145, 1:3] 0.007919 0.000234 0.000362 0.0546 0.019599 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:145] "1" "2" "3" "4" ...
  .. ..$ : chr [1:3] "chemical" "normal" "overt"
 $ x        : num [1:145, 1:2] -1.7 -2.84 -2.63 -1.51 -1.86 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:145] "1" "2" "3" "4" ...
  .. ..$ : chr [1:2] "LD1" "LD2"

This is actually a list, with things inside other things. For example, the posterior probabilities of each group are all in one object called posterior, with three things called chemical, normal, and overt inside that. The cbind I did earlier squashes this list down into a dataframe, and the nested structure is lost: chemical inside posterior becomes a column called posterior.chemical, and LD1 within x becomes x.LD1. Check what I called d to see that this is exactly what happened.

(j) (2 points) Which group is on the right on your plot? What does that say about this group’s values on the original quantitative variables?

On my graph, the overt group is on the right. These have high LD1 scores, and as we said back in (d), these individuals have large glucose and/or small fpg. Your graph might have come out as the mirror image, in which case you might have the normal group on the right, and they have small glucose and (possibly) large fpg.

Extra: these questions spiral rapidly out of hand if I ask you to do everything. The one thing I didn’t ask you to do was to make a biplot, which is actually easier to make than the LD scores plot:

ggbiplot(diabetes.2, group = diabetes$group)

The value this has is that we see the original variables as well, so we get confirmation that the overt group is high on glucose and (apparently) low on fpg, and the other three variables don’t do much to distinguish the groups.

Only two of our original quantitative variables seem to have any difference between groups, so we are in the fortunate position of being able to plot those two against each other, with groups labelled by colour:

ggplot(diabetes, aes(x = glucose, y = fpg, colour = group)) + geom_point()

This is where I saw that glucose and fpg actually have a high positive correlation, so that a high value on one almost always goes with a high value on the other. This seems to contradict what the biplot and the answer to (e) were saying, where a high LD1 score seems to go with high glucose and low fpg. I saw this last night, when I was very tired, and figured I’d better come back to it this morning with coffee in me and explain it properly for you. So I went back today and qualified my answer to (e) (the “and/or”, and the Extra there), so that what is actually happening is that the high glucose is “winning” over the high fpg because its coefficient in LD1 is bigger in size.

I have another graph to help make that point. I wanted to plot glucose, fpg and LD1 score, but these are all quantitative, and I don’t have three dimensions. So I tried making LD1 score as colour. This is a cheat because normally colour is something categorical, but it does actually work:

ggplot(d, aes(x = glucose, y = fpg, colour = x.LD1)) + geom_point()

When you feed colour something quantitative, ggplot uses different shades of the same colour (by default blue). According to the legend, a darker blue goes with a smaller LD1 score, and a lighter blue with a larger one. This shows that the larger LD1 scores (top right) go with a larger value of glucose and a larger value of fpg, despite what we said earlier. This illustrates that the variable with the larger coefficient in LD1 “wins” in deciding whether a larger or smaller value goes with a larger LD1 score.

Looking back at the last-but-one graph (with the groups on it), you can see that glucose by itself does a pretty good job of distinguishing the groups, and you might be wondering why fpg is part of LD1 at all. But if you look carefully, you’ll see that having a value of fpg above zero identifies the overt group and distinguishes it from the chemical group individuals whose glucose is above 0. So it is worth looking at fpg as well as glucose.

Footnotes

  1. The second option is still repeated measures because you are still looking at all five time points.↩︎

  2. Or even a two-sample \(t\)-test.↩︎

  3. The significant treatment and time effects that showed up in part (b), which we are not really supposed to interpret, say that on average there is a treatment effect over all days (that is, when you average up over all five days, the bright rats do better), and there is an effect over time on average regardless of treatment (probably driven by rats on both treatments improving over the first three days, ignoring what happened after that).↩︎