Worksheet 9

Published

March 12, 2025

Packages

library(tidyverse)
library(MASS, exclude = "select")

Diabetes

According to the Mayo Clinic,

Diabetes mellitus refers to a group of diseases that affect how the body uses blood sugar (glucose). Glucose is an important source of energy for the cells that make up the muscles and tissues. It’s also the brain’s main source of fuel. The main cause of diabetes varies by type. But no matter what type of diabetes you have, it can lead to excess sugar in the blood. Too much sugar in the blood can lead to serious health problems.

The data in http://ritsokiguess.site/datafiles/diabetes1.csv are from 145 non-obese adult patients classified into three groups (types of diabetes): “normal”, “overt”, and “chemical”. For each patient, five other variables were also recorded:

  • rw: relative weight, the ratio of actual weight to ideal weight for the person’s height.
  • fpg: fasting plasma glucose
  • glucose: area under plasma glucose curve after 3-hour glucose tolerance test
  • insulin: area under plasma insulin curve after 3-hour glucose tolerance test
  • sspg: steady-state plasma glucose

These variables are recorded here as \(z\)-scores (they were originally measured on vastly different scales).

Our aim is to investigate any association between the five measured variables and the diabetes type (in group).

  1. Read in and display (some of) the data.

As ever:

my_url <- "http://ritsokiguess.site/datafiles/diabetes1.csv"
diabetes <- read_csv(my_url)
Rows: 145 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): group
dbl (5): rw, fpg, glucose, insulin, sspg

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
diabetes

145 rows, the variables promised, and values for those variables that do indeed look like \(z\)-scores.

Extra: discriminant analysis works best if all the quantitative variables are on about the same scale, so that the sizes of the “coefficients of linear discriminants” are roughly comparable (since we are going to look at them to decide what LD1, LD2, etc., contain). For the original data, this was very much not the case:

library(rrcov)
Loading required package: robustbase
Scalable Robust Estimators with High Breakdown Point (version 1.7-6)
data("diabetes")
diabetes

You can see that a one-unit change in rw is a big change relative to the kinds of values rw takes, but a one-unit change in glucose is a very small one. So I decided to re-express the quantitative variables as \(z\)-scores. The function is called scale, but if you feed it a column, it will give you back a matrix, for reasons that go back into the dusty early days of R. So I wrote a little function that takes as input a column, and returns the standardized version of that column to two decimal places. First I run scale, then I turn it from a matrix back into a vector, and then I round the vector to 2 decimals. The bottom line is a test:

my_scale <- function(x) {
  y <- scale(x)
  z <- as.vector(y)
  round(z, 2)
}
zz <- c(0, 2, 4, 6)
my_scale(zz)
[1] -1.16 -0.39  0.39  1.16

This seems to have worked. My zz has mean 3, and the values above the mean are as far above 3 as the ones below are below, so I should get two pairs of \(z\)-scores differing only in sign.

The idea is that the thing I’m doing on a bunch of columns is kind of messy, so I write a function to abstract away the messiness, and then my across (below) is much simpler.

Now that I have that set up, I run it on all the columns except for the categorical group. The mutate line below reads, in English, as “take all the columns except for group, and replace them by the standardized versions of themselves, to 2 decimals”:

diabetes %>% 
  mutate(across(-group, \(x) my_scale(x))) -> diabetes
diabetes

This is the data I saved for you.

  1. Using manova, demonstrate that the group has some kind of effect on the other variables.

Create a response matrix first, out of the quantitative columns. You have two options here, the base-R way that you will probably think of first:

response <- with(diabetes, cbind(rw, fpg, glucose, insulin))

Or the tidyverse-till-the-end way, where everything stays as a dataframe until the end:

diabetes %>% select(-group) %>% 
  as.matrix() -> response1

I’m not displaying either response matrix, because they have 145 rows (though I did for myself earlier to make sure they looked right).

Then run the MANOVA:

diabetes.1 <- manova(response ~ group, data = diabetes)
summary(diabetes.1)
           Df Pillai approx F num Df den Df    Pr(>F)    
group       2 1.2113   53.756      8    280 < 2.2e-16 ***
Residuals 142                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The P-value is very small, so one or more of the groups has some effect on one or more of the quantitative variables. As you recall, at this point, we have no idea what kind of effect, or on what, but we do know that there is something to find. If this were ANOVA rather than MANOVA, we would at this point run Tukey to see what kind of differences there were.

  1. Run a discriminant analysis and display the results.

This part is not much work, having made sure you have MASS loaded:

diabetes.2 <- lda(group ~ rw + fpg + glucose + insulin + sspg, data = diabetes)
diabetes.2
Call:
lda(group ~ rw + fpg + glucose + insulin + sspg, data = diabetes)

Prior probabilities of groups:
 chemical    normal     overt 
0.2482759 0.5241379 0.2275862 

Group means:
                  rw        fpg    glucose    insulin       sspg
chemical  0.60694444 -0.3550000 -0.1569444  0.8419444  0.2336111
normal   -0.31000000 -0.4815789 -0.6110526 -0.1111842 -0.6622368
overt     0.05060606  1.4957576  1.5787879 -0.6627273  1.2700000

Coefficients of linear discriminants:
                LD1        LD2
rw       0.17691142  0.4883370
fpg     -2.15320564 -2.3530795
glucose  3.98629010  2.2550693
insulin -0.01269006  0.7451996
sspg     0.44658743 -0.1192412

Proportion of trace:
  LD1   LD2 
0.881 0.119 

In a discriminant analysis, the categorical variable (or combination of the categorical variables) goes on the left, and the quantitative variables go on the right, so that we are “predicting category”, even if logically the category is really influencing the quantitative variables rather than the other way around.

Extra 1: this worksheet is likely to get long, so I won’t ask it here, but I often ask whether the number of linear discriminants you got is what you were expecting. Here, we had 3 groups and 5 variables, so we should expect the smaller of \(3-1\) and 5, that is 2, LDs, and that’s what we got.

Extra 2: depending specifically on your R installation, you might get output with negative and positive signs switched around compared to mine. This may make your graphs (later) come out upside down or mirror-image compared to mine, and may have other effects, but as long as you appear to have done the right thing, you are good (that is, if the grader on an assignment cannot find any errors in your code, it is no problem if your output differs from mine in these kinds of ways.)

  1. Comment briefly on the relative importance of the linear discriminants.

Look at the “proportion of trace” at the bottom of the output. The value for LD1, 0.881, is much larger than the value for LD2, 0.119, so the first linear discriminant is much more important than the second, and does most, but not all, of the work in separating the three groups.

Your choice of adjective; LD2 is not quite worthless, but we will see on a graph later that most of the separation is happening along the LD1 dimension.

  1. Which two of the original quantitative variables play the largest role in LD1? What kind of values on those variables would make the LD1 score large (very positive)?

This is, for me, fpg negatively and glucose positively; thus, the LD1 score will be largest if glucose is large and/or fpg is small. (I am happy if you say “and” here, though things are actually a bit more complicated than that; see the Extra below.)

Extra: We will see later that this is characteristic of an individual with overt diabetes. Having said that, such individuals tend to have large glucose and large fpg. The rationalization of this is that the coefficient on LD1 for glucose is almost twice as big in size as the one for fpg, so that a large value of glucose “wins” over a large value of fpg. I drew some graphs for myself at the end to understand this better.

You might have the signs switched around, so the answer here would then be “small glucose and large fpg”.

  1. Obtain and save a dataframe containing the predicted group memberships, posterior probabilities, and discriminant scores for each individual, along with the original data. Display (some of) your dataframe.

This is obtained with the old-fashioned predict. predict only produces the predictions, so we need to glue this onto the original data:

p <- predict(diabetes.2)
d <- cbind(diabetes, p) 
d

Most of the columns I see here came from the original data, but class is the predicted group (from the discriminant analysis, based only on the values of the five quantitative variables). If you scroll right, you’ll see the posterior probabilities of each individual being in each group, and at the end the LD scores (on LD1 and LD2).

  1. Obtain a table counting the number of individuals who actually had each type of diabetes, cross-classified by the type of diabetes they were predicted to have. Does the classification appear to be good or bad? Explain briefly.

This is the standard way to do it:

with(d, table(obs = group, pred = class))
          pred
obs        chemical normal overt
  chemical       30      6     0
  normal          3     73     0
  overt           5      1    27

You also have a tidyverse option, with count:

d %>% count(group, class)

but this, I think, is best made to look like the previous output:

d %>% count(group, class) %>% 
  pivot_wider(names_from = class, values_from = n, values_fill = 0)

The pivoting-wider introduces some cells that didn’t appear in the original count output (which had only seven rows); the values_fill replaces those missings with the zero that they actually are. You won’t always need this; try it without first, and if you get some missings that should be zeros, put it in.

However you get the table, look at the values off the top-left to bottom-right diagonal; these are the ones that the classification got wrong. There are not many of these here, only \(6+3+5+1 = 15\) out of 145. So the classification appears to be good (which implies that the three types of diabetes can be distinguished well by the five quantitative variables, in particular by glucose and fpg that were the most important part of LD1).

If you have the table that comes directly out of count(group, class), you’ll need to say something about how you know the classification was a good one, for example words like “the rows on which the actual group and the predicted group (in class) are different are the ones for which the classification was wrong. The frequencies on these rows are small, so not many of the individuals were wrongly classified”.

  1. Find an individual that was misclassified (it doesn’t matter which one). For your chosen individual, was the misclassification clear-cut or a close thing? Explain briefly.

The true type of diabetes is in group and the predicted type is in class, so you can first find the individuals for which group and class are different:

d %>% filter(group != class)

Then, whichever individual you pick, scroll across and find their posterior probabilities of being in the true and predicted types. For example, the individual in row 26 was actually normal but was predicted to be chemical. This individual has a posterior probability 0.82 of being chemical, only 0.17 of being normal, and almost 0 of being overt. So this decision was actually quite clearly wrong.

Go through this procedure with your chosen individual. The answer you get doesn’t matter as long as it follows logically, and will depend on your chosen individual. The one in row 59, for example, is a close call (0.47 and 0.53), and the one in the row labelled 66 (on mine) is an even closer call. (The row numbers come from the original dataframe d before the filter was run on it.)

  1. Make a plot of LD1 and LD2 scores for each individual, distinguished by the group they belong to. There are too many points on this plot to label individually.

Just this:

ggplot(d, aes(x = x.LD1, y = x.LD2, colour = group)) + geom_point()

If you try to label the points (like I did on the peanuts example in class), the plot will be mostly an effort to find places to put labels, so don’t do that.

Look in your dataframe of data plus predictions (the one I called d) to see what the discriminant scores are called. It depends on precisely how you made your dataframe. On mine, they are called x.LD1 and x.LD2, but on yours they might be called LD1 and LD2. Take a look.

Extra: to understand why that happened for me, let’s look at the “structure” of my p:

str(p)
List of 3
 $ class    : Factor w/ 3 levels "chemical","normal",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ posterior: num [1:145, 1:3] 0.007919 0.000234 0.000362 0.0546 0.019599 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:145] "1" "2" "3" "4" ...
  .. ..$ : chr [1:3] "chemical" "normal" "overt"
 $ x        : num [1:145, 1:2] -1.7 -2.84 -2.63 -1.51 -1.86 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:145] "1" "2" "3" "4" ...
  .. ..$ : chr [1:2] "LD1" "LD2"

This is actually a list, with things inside other things. For example, the posterior probabilities of each group are all in one object called posterior, with three things called chemical, normal, and overt inside that. The cbind I did earlier squashes this list down into a dataframe, and the nested structure is lost: chemical inside posterior becomes a column called posterior.chemical, and LD1 within x becomes x.LD1. Check what I called d to see that this is exactly what happened.

  1. Which group is on the right on your plot? What does that say about this group’s values on the original quantitative variables?

On my graph, the overt group is on the right. These have high LD1 scores, and as we said earlier, these individuals have large glucose and/or small fpg. Your graph might have come out as the mirror image, in which case you might have the normal group on the right, and they have small glucose and (possibly) large fpg.

Extra: these questions spiral rapidly out of hand if I ask you to do everything. The one thing I didn’t ask you to do was to make a biplot, which is actually easier to make than the LD scores plot, once you have everything set up:

library(ggbiplot)
Loading required package: plyr
------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------

Attaching package: 'plyr'
The following objects are masked from 'package:dplyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize
The following object is masked from 'package:purrr':

    compact
Loading required package: scales

Attaching package: 'scales'
The following object is masked from 'package:purrr':

    discard
The following object is masked from 'package:readr':

    col_factor
Loading required package: grid
library(conflicted)
conflicts_prefer(dplyr::mutate)
[conflicted] Will prefer dplyr::mutate over any other package.
conflicts_prefer(dplyr::count)
[conflicted] Will prefer dplyr::count over any other package.
conflicts_prefer(dplyr::filter)
[conflicted] Will prefer dplyr::filter over any other package.

The ggbiplot package uses a package called plyr that has a lot of functions with the same names as ones in dplyr (the database-type functions in the tidyverse). It is a good idea to load the conflicted package here and explicitly choose the dplyr functions when conflicted offers you a choice, using conflicts_prefer. These are the ones I needed; you might need others.

ggbiplot(diabetes.2, group = diabetes$group)

The value this has is that we see the original variables as well, so we get confirmation that the overt group is high on glucose and (apparently) low on fpg, and the other three variables don’t do much to distinguish the groups.

Only two of our original quantitative variables seem to have any difference between groups, so we are in the fortunate position of being able to plot those two against each other, with groups labelled by colour:

ggplot(diabetes, aes(x = glucose, y = fpg, colour = group)) + geom_point()

This is where I saw that glucose and fpg actually have a high positive correlation, so that a high value on one almost always goes with a high value on the other. This seems to contradict what the biplot and question 5 were saying, where a high LD1 score seems to go with high glucose and low fpg. I saw this last night, when I was very tired, and figured I’d better come back to it this morning with coffee in me and explain it properly for you. So I went back today and qualified my answer to question 5 (the “and/or”, and the Extra there), so that what is actually happening is that the high glucose is “winning” over the high fpg because its coefficient in LD1 is bigger in size.

You can see that glucose by itself does a pretty good job of distinguishing the groups, and you might be wondering why fpg is part of LD1 at all. But if you look carefully, you’ll see that having a value of fpg above zero identifies the overt group and distinguishes it from the chemical group individuals whose glucose is above 0. So it is worth looking at fpg as well as glucose.

In an analysis like this, if you can get it down to individual important quantitative variables, you can see how well they distinguish the groups by drawing boxplots:

ggplot(diabetes, aes(x = group, y = glucose)) + geom_boxplot()

and

ggplot(diabetes, aes(x = group, y = fpg)) + geom_boxplot()

and you can even make a little table that shows which variable does the better job of distinguishing each pair of groups:

Group 1 Group 2 better variable
chemical normal glucose
chemical overt fpg
normal overt both

and you see that both variables matter for distinguishing the groups. In this case, the use of glucose and fpg to distinguish the groups is simple enough that looking at the variables one at a time tells you how the groups are different. Under other circumstances it might have been that a combination of values of the two variables was what was important, and then the scatterplot of the two variables with coloured groups would have been the thing to look at.

This last piece of the analysis seems very simple, scatterplots and boxplots. But it took the discriminant analysis to tell us which variables to look at.