STAD29 Assignment 3

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Packages

library(tidyverse)
library(marginaleffects)
library(MASS, exclude = "select")

Coronary heart disease and age

100 subjects selected for a study were assessed for “evidence of significant coronary heart disease”, recorded in column chd as Yes or No. This is believed to be associated with age, so each subject’s age was also recorded. The data are in http://ritsokiguess.site/datafiles/chdage.csv. There are two other columns, an id for each individual, and a categorized age group. We will not use either of these.

  1. (1 point) Read in and display (some of) the data.

It’s getting almost embarrassing to give you one point for these now:

my_url <- "http://ritsokiguess.site/datafiles/chdage.csv"
chdage <- read_csv(my_url)
Rows: 100 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): agegrp, chd
dbl (2): id, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chdage

chdage was the name of the dataset in my source, so I stuck with the name for my dataframe. Call yours whatever you like: heart_disease or similar would be fine, but steer clear of chd as the name of your dataframe because it is also the name of one of the columns in it.

  1. (2 points) Fit a logistic regression model to predict coronary heart disease as it depends on age, and display the results.

The one non-obvious thing is the treatment of chd. This is a categorical variable as far as we are concerned, but for glm it needs to be a factor. You can redefine it as such in the dataframe, or, perhaps easier, do it right inside the glm:

chd.1 <- glm(factor(chd) ~ age, family = "binomial", data = chdage)
summary(chd.1)

Call:
glm(formula = factor(chd) ~ age, family = "binomial", data = chdage)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.30945    1.13365  -4.683 2.82e-06 ***
age          0.11092    0.02406   4.610 4.02e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 136.66  on 99  degrees of freedom
Residual deviance: 107.35  on 98  degrees of freedom
AIC: 111.35

Number of Fisher Scoring iterations: 4

If you turn chd into a factor in the dataframe, your glm line will be simpler, but the results will be the same, so either way is good.

  1. (2 points) What precisely does the number in the Estimate column of your output next to age tell you? Explain briefly.

If age increases by 1, the log-odds of the probability of having coronary heart disease increases by 0.11.

The “explain briefly” part means that you need to say how you know that it’s the probability of having coronary heart disease (as opposed to the probability of not having it). To justify this: note that the values in chd are No and Yes; No is the first alphabetically, so it is the baseline, and we are modelling the probability of the other value Yes: that is, the probability that an individual of that age does have coronary heart disease.

  1. (2 points) Use plot_predictions to make a graph showing the predicted probability of coronary heart disease as it depends on age, with a confidence interval for the probability. Why is the trend on your graph not linear?

The inputs to plot_predictions are the model and the (one) explanatory variable, input as condition:

plot_predictions(chd.1, condition = "age")

This should be visually consistent with what you said in the previous part: as age increases, the probability of coronary heart disease also increases.

This graph is not linear because it shows predicted probabilities. The linearity is in the relationship between age and log-odds: the slope of age shows how much the log-odds changes if age changes by 1 year, and the relationship between log-odds and probability is not linear.

Extras:

Extra 1: you can tweak plot_predictions to plot the log-odds on the vertical scale, rather than the probability, and this is linear, with the slope that you got earlier:

plot_predictions(chd.1, condition = "age", type = "link")

As age increases by 10, the log-odds increases by about \(1 = 10(0.11)\), confirming what you said before.

In fact, you see the same narrowness in the middle that we saw on the regression one (of the sleep time of children of various ages). This is for the same reason as there: an age nearer the mean has a lot of nearby ages to base the prediction on. However, when you turn the log-odds into probabilities, the ones near the centre get wider and the ones near the extremes get narrower. This is perhaps easier to understand for an extreme probability, such as one near 1: the log-odds can change a lot without changing the probability very much, if the probability is near 1.

Extra 2: the confidence “band” for the probabilities is fairly narrow, because we have a decent amount of data (100 individuals, bearing in mind that one individual doesn’t carry all that much information; they either have CHD or they don’t.)

Extra 3: the probability of CHD at a high age is definitely bigger than at a lower age: not only is the predicted probability bigger, but also the whole interval contains (much) bigger values.

Amyloid-beta

Amyloid-beta (Abeta) is a protein fragment that has been linked to Alzheimer’s disease. Autopsies from a sample of Catholic priests included measurements of Abeta (pmol/g tissue from the posterior cingulate cortex in the brain) from three groups: subjects who had exhibited no cognitive impairment before death (NCI in the data), subjects who had exhibited mild cognitive impairment (MCI), and subjects who had mild to moderate Alzheimer’s disease (substantial cognitive impairment; mAD). The data are in http://ritsokiguess.site/datafiles/Stat2Data_Amyloid.csv.

  1. (1 point) Read in and display (some of) the data.

You know the story by now:

my_url <- 'http://ritsokiguess.site/datafiles/Stat2Data_Amyloid.csv'
amyloid <- read_csv(my_url)  
Rows: 57 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Group
dbl (1): Abeta

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
amyloid

Why the data are Catholic priests, I have no idea. Perhaps priests are routinely autopsied.

  1. (2 points) Thinking of amyloid-beta as something that may influence cognitive impairment, why would ordinal logistic regression be a reasonable technique to use here?

The question says that the (quantitative) amyloid-beta is explanatory, and therefore the degree of impairment in the (categorical) Group is the response. The key thing here is that there are three (more than two) response categories (three levels of impairment), and that they can be ordered, such as from least impaired (NCI) to most impaired (mAD).

  1. (2 points) Do the levels of impairment appear in the original data in a sensible order? How do you know? Hence create a response variable in your dataframe with the impairment categories in a sensible order.

In the dataframe that you read in, the least impaired individuals (NCI) are first, and the most impaired (mAD) are at the end. Hence the categories do appear in the data in a sensible order.

To take advantage of that, recall fct_inorder from C32 that creates a factor categorical variable with the categories in the order that they appear in the data. You can create a new column (probably sensible) or overwrite the old one once you think you know what you are doing:

amyloid %>% 
  mutate(group_ord = fct_inorder(Group)) -> amyloid

Don’t forget to save your modified dataframe.

Extra: if you want to check that the categories are now in a sensible order, you can do something like counting them:

amyloid %>% count(group_ord)

and you see that they are listed in the order you want.

Having done that, you see that there isn’t very much data, so we might have trouble with significance later, but if you make a plot (a boxplot):

ggplot(amyloid, aes(x = group_ord, y = Abeta)) + geom_boxplot()

you see that individuals with the highest amyloid-beta do tend to be more cognitively impaired, at least if you discount the outliers.

  1. (2 points) Fit a suitable model to predict level impairment from amyloid-beta. (You do not need to display any output.)

Having decided earlier that ordinal logistic regression was the thing, we follow through with polr from MASS:

amyl.1 <- polr(group_ord ~ Abeta, data = amyloid)

Give your model a reasonable name. I used the dataframe (abbreviated) plus a number.

Extra: if you do decide to look at the output, you’ll find that it’s not apparently very helpful:

summary(amyl.1)

Re-fitting to get Hessian
Call:
polr(formula = group_ord ~ Abeta, data = amyloid)

Coefficients:
         Value Std. Error t value
Abeta 0.001671  0.0006333   2.639

Intercepts:
        Value   Std. Error t value
NCI|MCI -0.0729  0.3618    -0.2014
MCI|mAD  1.6688  0.4323     3.8602

Residual Deviance: 116.6483 
AIC: 122.6483 

There are no tests here (we will get hold of one in a minute). The coefficients here come with \(t\)-values, which you can interpret like \(z\)’s (anything beyond \(\pm 2\) is likely significant). In this case, it looks as if there is a significant (“positive”) effect of amyloid-beta, because the \(t\)-value is bigger than 2. The implication is that with a larger value of amyloid-beta, the individual is more likely to be in one of the later impairment categories.1

The bottom table of Intercepts gives us a bit more detail. The first two categories are very similar, but the last category, Alzheimer’s, is very different from the second one (and hence from the first one as well). This says that amyloid-beta is very good at distinguishing Alzheimer’s from the other categories, but not at distinguishing the no-impairment and moderate-impairment categories. This is the same kind of story that was told by my boxplot (in an earlier Extra).

This is all rather anticipating something you haven’t done yet (but will, shortly).

  1. (2 points) Test whether amyloid-beta does in fact have a significant effect on impairment level. What do you conclude?

In most models, you would look at the summary output, but in this one (see the Extra to the last question) it doesn’t tell us anything useful. So we fall back on drop1, with the right test being Chisq because this is a generalized linear model:

drop1(amyl.1, test = "Chisq")

The P-value, 0.0043, is much smaller than 0.05, so there is a significant effect of amyloid-beta on impairment category.

Extra: this test says that there is an effect of amyloid-beta, but it does not say what kind of effect it is. For that, we need to do some predictions (coming up), or look at the indications we have seen in previous Extras.

  1. (3 points) For amyloid-beta values of 0, 500, 1000, 1500 (that is, from 0 to 1500 in steps of 500), predict the probabilities that individuals with those amyloid-beta values will fall into each impairment category. Display your predictions in such a way that it is easy to see how they change as amyloid-beta changes. (It is not necessary to display confidence intervals for these predictions.)

There are three steps, and the guideline is one point each:

  • make a dataframe of values to predict for (the column having the same name as in the dataframe you read in from the file).
  • do the predictions
  • rearrange the predictions so that they display nicely.

For the first step, you can do it like this:

new <- datagrid(model = amyl.1, Abeta = seq(0, 1500, 500))
new

(using my standard name new.) Any other way you have of getting a dataframe like this with a column called Abeta (to match the original dataframe) is good, for example using tibble or tribble rather than datagrid.

Depending on the version of marginaleffects you have, you might get a second column with a different name than rowid. That doesn’t matter: as long as you have a column called exactly Abeta with those values in it, it doesn’t matter what else you have.

The second step is to get the predictions (and only keep the columns you want):

cbind(predictions(amyl.1, newdata = new)) %>% 
  select(group, Abeta, estimate)

Re-fitting to get Hessian

For this kind of model, you also need to keep the column group that says which impairment category this is a prediction for.

The last step is to display all the predictions for each value of amyloid-beta in one row so that you can compare the rows as a whole. This means pivoting-wider the group column, carrying along the values in estimate:

cbind(predictions(amyl.1, newdata = new)) %>% 
  select(group, Abeta, estimate) %>% 
  pivot_wider(names_from = group, values_from = estimate)

Re-fitting to get Hessian

With only three response categories, it is easy enough to see what is going on (unlike the carrots one in the worksheet), so I am happy if you stop here. You could, however, round the predicted probabilities to, say, three decimals, like this, with the last line of code saying “for each of the columns except for Abeta, round it to 3 decimals”:

cbind(predictions(amyl.1, newdata = new)) %>% 
  select(group, Abeta, estimate) %>% 
  pivot_wider(names_from = group, values_from = estimate) %>% 
  mutate(across(-Abeta, \(x) round(x, 3)))

Re-fitting to get Hessian

and you could very reasonably argue that this is now easier to read.

  1. (2 points) How does the overall level of impairment change as amyloid-beta changes? Explain briefly.

A higher value of amyloid-beta goes with a higher level of impairment. When amyloid-beta is low, an individual is likely to have no impairment and less likely to have Alzheimer’s (the highest level of impairment). But when amyloid-beta is high, there is a low probability of no impairment and a high probability of Alzheimer’s.

Alternatively, read down the columns: the probability of no impairment goes down as amyloid-beta increases, and the probability of Alzheimer’s goes up as amyloid-beta increases, therefore the overall picture is that impairment gets worse as amyloid-beta increases.

When there is an ordered categorical response, the probabilities generally move from one end to the other as the explanatory variable(s) change, and so (as here) it makes sense to talk about an overall low or high level of impairment by considering whether there is a low or high probability of being towards the low or high end of the scale.

  1. (2 points) Draw a graph that shows the predicted probability of each level of impairment as it depends on the value of amyloid-beta, along with confidence bands for each prediction.

This is exactly what plot_predictions does, so it is no more than this:

plot_predictions(amyl.1, condition = c("Abeta", "group"))

Re-fitting to get Hessian

The only slightly weird thing is where you put group: it is a response category, but you want to have it displayed using colour as if it were a categorical explanatory variable, so you pretend that’s what it is and plot_predictions is smart enough to infer what you actually meant.

I didn’t ask for comment, because what you would say is about what you said in the previous question: the probability of no impairment goes down as amyloid-beta increases, the probability of Alzheimer’s increases, and the probability of moderate impairment is caught in the middle, lowest when one of the other probabilities is highest. Hence the overall level of impairment gets higher as you move from left to right. (This is actually the same sort of thing as with the coal-miners example in lecture, where the overall level of lung disease got higher as exposure got higher.)

Extra: the confidence intervals are widish. This is where the small sample size has finally caught up with us: the pattern of high amyloid-beta going with Alzheimer’s is clear enough, but just how big the effect is, well, that is not clear (and would be clearer with a larger sample). On this graph, the blue “envelope” of Alzheimer’s starts out lower than the others and ends up higher, but the red and green envelopes for the other two categories overlap substantially: there is not much to choose between the probabilities of these categories, all the way along.2

Footnotes

  1. This is what I meant by a “positive” effect; it’s up to you whether you really think such an effect is “positive”.↩︎

  2. The brownish area is where the red and green envelopes overlap. You can see this if you look carefully: the top of the red envelope goes down all the way, and the bottom of the green envelope starts at just below 0.25, increases until an Abeta of about 500, and then decreases. The bottom of the green envelope is kind of hard to see because it overlaps with either the blue or the red envelope all the way across.↩︎