library(tidyverse)
library(MASS, exclude = "select")
Worksheet 9
Packages
Diabetes
According to the Mayo Clinic,
Diabetes mellitus refers to a group of diseases that affect how the body uses blood sugar (glucose). Glucose is an important source of energy for the cells that make up the muscles and tissues. It’s also the brain’s main source of fuel. The main cause of diabetes varies by type. But no matter what type of diabetes you have, it can lead to excess sugar in the blood. Too much sugar in the blood can lead to serious health problems.
The data in http://ritsokiguess.site/datafiles/diabetes1.csv are from 145 non-obese adult patients classified into three group
s (types of diabetes): “normal”, “overt”, and “chemical”. For each patient, five other variables were also recorded:
rw
: relative weight, the ratio of actual weight to ideal weight for the person’s height.fpg
: fasting plasma glucoseglucose
: area under plasma glucose curve after 3-hour glucose tolerance testinsulin
: area under plasma insulin curve after 3-hour glucose tolerance testsspg
: steady-state plasma glucose
These variables are recorded here as \(z\)-scores (they were originally measured on vastly different scales).
Our aim is to investigate any association between the five measured variables and the diabetes type (in group
).
- Read in and display (some of) the data.
As ever:
<- "http://ritsokiguess.site/datafiles/diabetes1.csv"
my_url <- read_csv(my_url) diabetes
Rows: 145 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): group
dbl (5): rw, fpg, glucose, insulin, sspg
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
diabetes
145 rows, the variables promised, and values for those variables that do indeed look like \(z\)-scores.
Extra: discriminant analysis works best if all the quantitative variables are on about the same scale, so that the sizes of the “coefficients of linear discriminants” are roughly comparable (since we are going to look at them to decide what LD1, LD2, etc., contain). For the original data, this was very much not the case:
library(rrcov)
Loading required package: robustbase
Scalable Robust Estimators with High Breakdown Point (version 1.7-6)
data("diabetes")
diabetes
You can see that a one-unit change in rw
is a big change relative to the kinds of values rw
takes, but a one-unit change in glucose
is a very small one. So I decided to re-express the quantitative variables as \(z\)-scores. The function is called scale
, but if you feed it a column, it will give you back a matrix
, for reasons that go back into the dusty early days of R. So I wrote a little function that takes as input a column, and returns the standardized version of that column to two decimal places. First I run scale
, then I turn it from a matrix back into a vector, and then I round the vector to 2 decimals. The bottom line is a test:
<- function(x) {
my_scale <- scale(x)
y <- as.vector(y)
z round(z, 2)
}<- c(0, 2, 4, 6)
zz my_scale(zz)
[1] -1.16 -0.39 0.39 1.16
This seems to have worked. My zz
has mean 3, and the values above the mean are as far above 3 as the ones below are below, so I should get two pairs of \(z\)-scores differing only in sign.
The idea is that the thing I’m doing on a bunch of columns is kind of messy, so I write a function to abstract away the messiness, and then my across
(below) is much simpler.
Now that I have that set up, I run it on all the columns except for the categorical group
. The mutate
line below reads, in English, as “take all the columns except for group, and replace them by the standardized versions of themselves, to 2 decimals”:
%>%
diabetes mutate(across(-group, \(x) my_scale(x))) -> diabetes
diabetes
This is the data I saved for you.
- Using
manova
, demonstrate that thegroup
has some kind of effect on the other variables.
Create a response matrix first, out of the quantitative columns. You have two options here, the base-R way that you will probably think of first:
<- with(diabetes, cbind(rw, fpg, glucose, insulin)) response
Or the tidyverse-till-the-end way, where everything stays as a dataframe until the end:
%>% select(-group) %>%
diabetes as.matrix() -> response1
I’m not displaying either response matrix, because they have 145 rows (though I did for myself earlier to make sure they looked right).
Then run the MANOVA:
.1 <- manova(response ~ group, data = diabetes)
diabetessummary(diabetes.1)
Df Pillai approx F num Df den Df Pr(>F)
group 2 1.2113 53.756 8 280 < 2.2e-16 ***
Residuals 142
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The P-value is very small, so one or more of the groups has some effect on one or more of the quantitative variables. As you recall, at this point, we have no idea what kind of effect, or on what, but we do know that there is something to find. If this were ANOVA rather than MANOVA, we would at this point run Tukey to see what kind of differences there were.
- Run a discriminant analysis and display the results.
This part is not much work, having made sure you have MASS
loaded:
.2 <- lda(group ~ rw + fpg + glucose + insulin + sspg, data = diabetes)
diabetes.2 diabetes
Call:
lda(group ~ rw + fpg + glucose + insulin + sspg, data = diabetes)
Prior probabilities of groups:
chemical normal overt
0.2482759 0.5241379 0.2275862
Group means:
rw fpg glucose insulin sspg
chemical 0.60694444 -0.3550000 -0.1569444 0.8419444 0.2336111
normal -0.31000000 -0.4815789 -0.6110526 -0.1111842 -0.6622368
overt 0.05060606 1.4957576 1.5787879 -0.6627273 1.2700000
Coefficients of linear discriminants:
LD1 LD2
rw 0.17691142 0.4883370
fpg -2.15320564 -2.3530795
glucose 3.98629010 2.2550693
insulin -0.01269006 0.7451996
sspg 0.44658743 -0.1192412
Proportion of trace:
LD1 LD2
0.881 0.119
In a discriminant analysis, the categorical variable (or combination of the categorical variables) goes on the left, and the quantitative variables go on the right, so that we are “predicting category”, even if logically the category is really influencing the quantitative variables rather than the other way around.
Extra 1: this worksheet is likely to get long, so I won’t ask it here, but I often ask whether the number of linear discriminants you got is what you were expecting. Here, we had 3 groups and 5 variables, so we should expect the smaller of \(3-1\) and 5, that is 2, LDs, and that’s what we got.
Extra 2: depending specifically on your R installation, you might get output with negative and positive signs switched around compared to mine. This may make your graphs (later) come out upside down or mirror-image compared to mine, and may have other effects, but as long as you appear to have done the right thing, you are good (that is, if the grader on an assignment cannot find any errors in your code, it is no problem if your output differs from mine in these kinds of ways.)
- Comment briefly on the relative importance of the linear discriminants.
Look at the “proportion of trace” at the bottom of the output. The value for LD1, 0.881, is much larger than the value for LD2, 0.119, so the first linear discriminant is much more important than the second, and does most, but not all, of the work in separating the three groups.
Your choice of adjective; LD2 is not quite worthless, but we will see on a graph later that most of the separation is happening along the LD1 dimension.
- Which two of the original quantitative variables play the largest role in LD1? What kind of values on those variables would make the LD1 score large (very positive)?
This is, for me, fpg
negatively and glucose
positively; thus, the LD1 score will be largest if glucose
is large and/or fpg
is small. (I am happy if you say “and” here, though things are actually a bit more complicated than that; see the Extra below.)
Extra: We will see later that this is characteristic of an individual with overt diabetes. Having said that, such individuals tend to have large glucose
and large fpg
. The rationalization of this is that the coefficient on LD1 for glucose
is almost twice as big in size as the one for fpg
, so that a large value of glucose
“wins” over a large value of fpg
. I drew some graphs for myself at the end to understand this better.
You might have the signs switched around, so the answer here would then be “small glucose
and large fpg
”.
- Obtain and save a dataframe containing the predicted group memberships, posterior probabilities, and discriminant scores for each individual, along with the original data. Display (some of) your dataframe.
This is obtained with the old-fashioned predict
. predict
only produces the predictions, so we need to glue this onto the original data:
<- predict(diabetes.2)
p <- cbind(diabetes, p)
d d
Most of the columns I see here came from the original data, but class
is the predicted group (from the discriminant analysis, based only on the values of the five quantitative variables). If you scroll right, you’ll see the posterior probabilities of each individual being in each group, and at the end the LD scores (on LD1 and LD2).
- Obtain a table counting the number of individuals who actually had each type of diabetes, cross-classified by the type of diabetes they were predicted to have. Does the classification appear to be good or bad? Explain briefly.
This is the standard way to do it:
with(d, table(obs = group, pred = class))
pred
obs chemical normal overt
chemical 30 6 0
normal 3 73 0
overt 5 1 27
You also have a tidyverse option, with count
:
%>% count(group, class) d
but this, I think, is best made to look like the previous output:
%>% count(group, class) %>%
d pivot_wider(names_from = class, values_from = n, values_fill = 0)
The pivoting-wider introduces some cells that didn’t appear in the original count
output (which had only seven rows); the values_fill
replaces those missings with the zero that they actually are. You won’t always need this; try it without first, and if you get some missings that should be zeros, put it in.
However you get the table, look at the values off the top-left to bottom-right diagonal; these are the ones that the classification got wrong. There are not many of these here, only \(6+3+5+1 = 15\) out of 145. So the classification appears to be good (which implies that the three types of diabetes can be distinguished well by the five quantitative variables, in particular by glucose
and fpg
that were the most important part of LD1).
If you have the table that comes directly out of count(group, class)
, you’ll need to say something about how you know the classification was a good one, for example words like “the rows on which the actual group
and the predicted group (in class
) are different are the ones for which the classification was wrong. The frequencies on these rows are small, so not many of the individuals were wrongly classified”.
- Find an individual that was misclassified (it doesn’t matter which one). For your chosen individual, was the misclassification clear-cut or a close thing? Explain briefly.
The true type of diabetes is in group
and the predicted type is in class
, so you can first find the individuals for which group
and class
are different:
%>% filter(group != class) d
Then, whichever individual you pick, scroll across and find their posterior probabilities of being in the true and predicted types. For example, the individual in row 26 was actually normal but was predicted to be chemical. This individual has a posterior probability 0.82 of being chemical, only 0.17 of being normal, and almost 0 of being overt. So this decision was actually quite clearly wrong.
Go through this procedure with your chosen individual. The answer you get doesn’t matter as long as it follows logically, and will depend on your chosen individual. The one in row 59, for example, is a close call (0.47 and 0.53), and the one in the row labelled 66 (on mine) is an even closer call. (The row numbers come from the original dataframe d
before the filter
was run on it.)
- Make a plot of LD1 and LD2 scores for each individual, distinguished by the
group
they belong to. There are too many points on this plot to label individually.
Just this:
ggplot(d, aes(x = x.LD1, y = x.LD2, colour = group)) + geom_point()
If you try to label the points (like I did on the peanuts example in class), the plot will be mostly an effort to find places to put labels, so don’t do that.
Look in your dataframe of data plus predictions (the one I called d
) to see what the discriminant scores are called. It depends on precisely how you made your dataframe. On mine, they are called x.LD1
and x.LD2
, but on yours they might be called LD1
and LD2
. Take a look.
Extra: to understand why that happened for me, let’s look at the “structure” of my p
:
str(p)
List of 3
$ class : Factor w/ 3 levels "chemical","normal",..: 2 2 2 2 2 2 2 2 2 2 ...
$ posterior: num [1:145, 1:3] 0.007919 0.000234 0.000362 0.0546 0.019599 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:145] "1" "2" "3" "4" ...
.. ..$ : chr [1:3] "chemical" "normal" "overt"
$ x : num [1:145, 1:2] -1.7 -2.84 -2.63 -1.51 -1.86 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:145] "1" "2" "3" "4" ...
.. ..$ : chr [1:2] "LD1" "LD2"
This is actually a list
, with things inside other things. For example, the posterior probabilities of each group are all in one object called posterior
, with three things called chemical
, normal
, and overt
inside that. The cbind
I did earlier squashes this list down into a dataframe, and the nested structure is lost: chemical
inside posterior
becomes a column called posterior.chemical
, and LD1
within x
becomes x.LD1
. Check what I called d
to see that this is exactly what happened.
- Which
group
is on the right on your plot? What does that say about this group’s values on the original quantitative variables?
On my graph, the overt
group is on the right. These have high LD1 scores, and as we said earlier, these individuals have large glucose
and/or small fpg
. Your graph might have come out as the mirror image, in which case you might have the normal
group on the right, and they have small glucose
and (possibly) large fpg
.
Extra: these questions spiral rapidly out of hand if I ask you to do everything. The one thing I didn’t ask you to do was to make a biplot, which is actually easier to make than the LD scores plot, once you have everything set up:
library(ggbiplot)
Loading required package: plyr
------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------
Attaching package: 'plyr'
The following objects are masked from 'package:dplyr':
arrange, count, desc, failwith, id, mutate, rename, summarise,
summarize
The following object is masked from 'package:purrr':
compact
Loading required package: scales
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
Loading required package: grid
library(conflicted)
conflicts_prefer(dplyr::mutate)
[conflicted] Will prefer dplyr::mutate over any other package.
conflicts_prefer(dplyr::count)
[conflicted] Will prefer dplyr::count over any other package.
conflicts_prefer(dplyr::filter)
[conflicted] Will prefer dplyr::filter over any other package.
The ggbiplot
package uses a package called plyr
that has a lot of functions with the same names as ones in dplyr
(the database-type functions in the tidyverse). It is a good idea to load the conflicted
package here and explicitly choose the dplyr
functions when conflicted
offers you a choice, using conflicts_prefer
. These are the ones I needed; you might need others.
ggbiplot(diabetes.2, group = diabetes$group)
The value this has is that we see the original variables as well, so we get confirmation that the overt
group is high on glucose
and (apparently) low on fpg
, and the other three variables don’t do much to distinguish the groups.
Only two of our original quantitative variables seem to have any difference between groups, so we are in the fortunate position of being able to plot those two against each other, with groups labelled by colour:
ggplot(diabetes, aes(x = glucose, y = fpg, colour = group)) + geom_point()
This is where I saw that glucose
and fpg
actually have a high positive correlation, so that a high value on one almost always goes with a high value on the other. This seems to contradict what the biplot and question 5 were saying, where a high LD1 score seems to go with high glucose
and low fpg
. I saw this last night, when I was very tired, and figured I’d better come back to it this morning with coffee in me and explain it properly for you. So I went back today and qualified my answer to question 5 (the “and/or”, and the Extra there), so that what is actually happening is that the high glucose
is “winning” over the high fpg
because its coefficient in LD1 is bigger in size.
You can see that glucose
by itself does a pretty good job of distinguishing the groups, and you might be wondering why fpg
is part of LD1 at all. But if you look carefully, you’ll see that having a value of fpg
above zero identifies the overt group and distinguishes it from the chemical group individuals whose glucose
is above 0. So it is worth looking at fpg
as well as glucose
.
In an analysis like this, if you can get it down to individual important quantitative variables, you can see how well they distinguish the groups by drawing boxplots:
ggplot(diabetes, aes(x = group, y = glucose)) + geom_boxplot()
and
ggplot(diabetes, aes(x = group, y = fpg)) + geom_boxplot()
and you can even make a little table that shows which variable does the better job of distinguishing each pair of groups:
Group 1 | Group 2 | better variable |
---|---|---|
chemical | normal | glucose |
chemical | overt | fpg |
normal | overt | both |
and you see that both variables matter for distinguishing the groups. In this case, the use of glucose
and fpg
to distinguish the groups is simple enough that looking at the variables one at a time tells you how the groups are different. Under other circumstances it might have been that a combination of values of the two variables was what was important, and then the scatterplot of the two variables with coloured groups would have been the thing to look at.
This last piece of the analysis seems very simple, scatterplots and boxplots. But it took the discriminant analysis to tell us which variables to look at.