You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.
If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)
You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.
Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.
1 Log odds and poisoning rats
In one of the examples from lecture, we learned about modelling the probability that a rat would live as it depended on the dose of a poison. Some of the output from the logistic regression is as shown below:
summary(rat2.1)
Call:
glm(formula = response ~ dose, family = "binomial", data = rat2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.3619 0.6719 3.515 0.000439 ***
dose -0.9448 0.2351 -4.018 5.87e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.530 on 5 degrees of freedom
Residual deviance: 2.474 on 4 degrees of freedom
AIC: 18.94
Number of Fisher Scoring iterations: 4
For the calculations below, I suggest you use R as a calculator. If you prefer, use your actual calculator, but then your numerical answers will need to be sufficiently close to being correct in order to get any credit.
(a) (2 points) Using the summary output, obtain a prediction for a dose of 3.2 units. What precisely is this a prediction of?
Do it as if you are doing a regression prediction, B22-style, by substituting 3.2 into the (logistic) regression equation:
log_odds <-2.3619-0.9448*3.2log_odds
[1] -0.66146
This, precisely, is the predicted log-odds that a rat given this dose will live.
I don’t know about you, but I have no intuition for how big a probability this is, yet. All I can really say is that it is negative, so it goes with a probability less than 0.5. (It is a high dose by the standards of the data, so it is more likely than not that a rate given this dose will die.)
(I saved my log-odds to use in the next step, but for the purposes of this problem, I have no objection if you copy and paste the answer.)
(b) (3 points) Convert your prediction into a predicted probability that a rat given this dose will live. Hint: if probability is denoted \(p\) and odds \(d\), we saw in class that \(d = p/(1-p)\). It follows (by algebra that I am doing for you) that \(p = d/(1+d)\).
Two steps: convert the log-odds into odds by exp-ing it:
odds <-exp(log_odds)odds
[1] 0.5160973
and then use the formula in the hint to convert this to a probability:
odds / (1+ odds)
[1] 0.3404117
The probability that a rat given this dose will live is only 0.34. The grader can decide whether they want to split these three points as 2 and 1 or 1.5 and 1.5, but it’s definitely more work than a two-point question.
Extra: to see whether this is right, run the code in the lecture notes (up to the model I called rat2.1) and then use predictions:
new <-tibble(dose =3.2)cbind(predictions(rat2.1, newdata = new)) %>%select(dose, estimate)
(c) (2 points) In the output given at the top of this question, there is a number \(-0.9448\). What is the interpretation of this number? (If you prefer, you can instead interpret the exp of this number.)
the original slope -0.9448 says that if you increase dose by 1, you decrease the log-odds of living by 0.9448.
the exp of the slope (see below) is about 0.39, so if you increase dose by 1, you multiply the odds of living by 0.39.
exp(-0.9448)
[1] 0.3887573
Increasing the dose makes the odds, and therefore the probability, of living smaller, which makes sense given that we are talking about the dose of a poison.
Extra: Interpreting the slope is talking about changes, and this doesn’t naturally fit to changes in probabilities. I wanted to do a numerical example to show you what I mean, since I don’t know how well this came across in lecture. Let’s pick several doses, 1 unit apart:
new <-tibble(dose =1:6)cbind(predictions(rat2.1, newdata = new)) %>%select(estimate, dose)
The thing about this is that the change in probability as you increase dose by 1 is not constant: sometimes it’s as big as 0.23 and sometimes as small as 0.05.
Compare that with this:
cbind(predictions(rat2.1, newdata = new, type ="link")) %>%select(estimate, dose)
Using type = "link" produces predictions on the log-odds scale.3 This time, you see that the estimate values decrease by about 0.95 every time, no matter what dose you are looking at. This “decrease of about 0.95” is actually the \(-0.9448\) that is the slope. Thus, increasing the dose by -1 decreases the log-odds of living by 0.9448 no matter what dose you’re looking at.
As I said, I don’t have much feeling for log-odds, but maybe I do have some intuition for odds themselves. Let’s turn those last predictions into odds by exp-ing them:
cbind(predictions(rat2.1, newdata = new, type ="link")) %>%select(estimate, dose) %>%mutate(estimate =exp(estimate))
At dose 1, a rat is odds-on to live (more likely than not), but as the dose increases, a rat definitely becomes less likely to live.
As you look down the numbers in the estimate column now, you might be able to see that each one is a bit more than a third of the previous one. The multiplier is actually this:
exp(-0.9448)
[1] 0.3887573
so maybe “a multiplier of 0.4” would be more accurate. This says that if you increase the dose by 1, you multiply the odds of living by about 0.4, no matter what dose you were looking at.
Thus:
the slope tells you how much the log-odds changes as the \(x\) increases by 1
the exp of the slope tells you by what multiple the odds change as the \(x\) changes by 14
neither of those tell you anything about how the probability itself changes in a predictable way (because it doesn’t).
2 Carrots
In a consumer study, 103 consumers scored their preference of 12 Danish carrot types on a scale from 1 to 7, where 1 represents “strongly dislike” and 7 represents “strongly like”. The consumers also rated each carrot type on some other features, and some demographic information was collected. The data are in http://ritsokiguess.site/datafiles/carrots_pref.csv. We will be predicting preference score from the type of carrot and how often the consumer eats carrots (the latter treated as quantitative):
Frequency: how often the consumer eats carrots: 1: once a week or more, 2: once every two weeks, 3: once every three weeks, 4: at least once month, 5: less than once a month. (We will treat this as quantitative.)
Preference: consumer score on a seven-point scale, 7 being best
Product: type of carrot (there are 12 different named types).
(a) (1 point) Read in and display (some of) the data.
Before I do that, the packages you’ll need this time are probably these:
You’ll certainly need the top three (as usual, for model-fitting, and for predictions, respectively). Because we are using MASS and there is the select issue, I also recommend using conflicted (that you may have to install first). When you start using conflicted, you will get what seem like a ton of errors, because R will throw an error every time you use a function like select or filter that exists in more than one of the packages you have loaded. The error message will tell you how to choose the one you want; that’s what those last two lines are for. The select and filter we will be using are the tidyverse ones, which live in a part of the tidyverse called dplyr, so copy the code from the error messages with conflicts_prefer and dplyr in it, and paste in somewhere so it will run next time (my preferred place is after all the library lines where I load my packages). The next time R runs into a select, it will know which one you want, and won’t throw an error about it.
If you don’t want to use conflicted, you will need to make sure that you load MASS first, before tidyverse. The idea here is that the select you load last, that is, the one in the tidyverse, is the one that will run. If you go this way, though, you might from time to time need to restart your R session (Session, Restart R) to keep things in line.
Rows: 1236 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Product
dbl (2): eat_carrots, Preference
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
carrots
You might be wondering why there are so many rows. This is because there were 103 consumers, and each one rated 12 different carrot types (the ones named in Product), so there should be this many rows in the “longer” layout that we have:
103*12
[1] 1236
and so there are. (This would admittedly have been clearer if we had had an ID for each consumer; then, it would have been easier to see where the preferences for one consumer ended and the ones for the next one began.)
(b) (2 points) Why would ordinal logistic regression be a sensible method of analysis here?
The response variable Preference is categorical (even though written as a number) , and the categories are in order with 1 being worst and 7 being best.
(c) (3 points) Fit an ordinal logistic regression to this dataset. You do not need to display any output from this model yet. Hint: Preference is actually categorical, even though it looks like a number, so you should make sure that R treats it as categorical.
polr from package MASS. Also, Preference, which looks like a number, should be a genuine factor, so make it one when you fit the model (or, create a column using mutate that is the factor version of Preference).
carrots.1<-polr(factor(Preference) ~ eat_carrots + Product, data = carrots)
(d) (2 points) Can any explanatory variables be removed? Explain briefly.
drop1 works just fine for these models, so is the right tool for finding this out:
drop1(carrots.1, test ="Chisq")
The P-value for eat_carrots is 0.095 (watch the scientific notation!), so this can be removed. This means that people’s rating for a carrot is not affected by how much they tend to eat carrots.
Note that the categorical Product is treated as a whole (which is a reason to use drop1): we know that Product has to stay, that is, that there are some differences in preference among the carrot types.
In the previous part, you didn’t know how you were going to be displaying the model, so it was better to wait and figure out what you needed to look at first.
(e) (2 points) If necessary, fit an improved model. (If not, explain briefly why not.)
Copy, paste, and edit, or use update:
carrots.2<-update(carrots.1, .~. - eat_carrots)
If you didn’t think there was anything to remove, say why you thought that (for example, that you thought both explanatory variables were significant and so should not be removed). I am likely to disagree with your reasoning here, but as long as you are consistent the rest of the way, this part should be the only place you lose points.
(f) (2 points) We will be predicting probabilities of each rating category for each of the explanatory variables remaining in the best model. Make a dataframe that includes all the different types of carrot, and the values 1 and 5 for eat_carrots if that is in your best model. Hint: you can use count to get all the levels of a categorical variable.
My best model does not include eat_carrots, so I only need the twelve types of carrot, which is a categorical variable.
carrots %>%count(Product) -> newnew
If you have eat_carrots in your model as well, you can still use count to get all the combinations of values, and then you need to get rid of the ones you don’t want:
The extra column n can be safely ignored, because if predictions finds an extra column in newdata it will ignore it.
There are 12 varieties of carrot, so if that is all that is left in your model, your new should have 12 rows. If you have two values of eat_carrots going with each of those, your new will have 24 rows.
(g) (3 points) Predict the probability of a customer giving each carrot type each preference score. Display your results in such a way that you can easily compare the probability of each score for different types of carrot.
This is predictions, with the usual request to keep only the relevant columns:
These are the predictions all right, but 84 rows is a lot to assess. Two points for getting this far.
group is the score category, and I think it would be easier to have the predicted probabilities for the different groups going across the page. (Feel free to disagree, but you need to make the case for leaving it like this, if that’s what you choose to do.)
Depending on the width of your page, you might find that the predicted probabilities now go off the right side. This is acceptable for this assignment, but the very best answer would round the probabilities off to, say, 2 or 3 decimals so that you can see all seven of them5 even on a narrow screen. You can do that all at once like this:
mutate with across redefines all the columns “in place” (so that the unrounded values are lost unless you calculate them again, but we only wanted to see the rounded ones).
The third point for making a decent attempt to show the probabilities of each preference value for each carrot in a readable way.
If your model had eat_carrots in it, your answers will look different, but the same considerations otherwise apply:
(h) (2 points) There was a significant difference in preference scores among the different types of carrot. What do your predictions tell you about why that is? Explain briefly.
See whether there are any carrots that are generally liked more or less than the others,
Casting my eye down the predictions, it seems that most Products are most likely to be rated 5. Exceptions are Major_E, Turbo_E, and Yukon_E, which are more likely to be rated 4, and are thus liked less than the others. You might also notice that Bolero_L and Yukon_L are the most likely to be rated 6 or 7, so people like them the most.
Pick out at least one product that is liked more or less than the others, to justify the significance of Product. Also, say something about whether the products you named are liked more or less overall than the others. (Because the response category is ordered, having a higher probability of being a 6 or 7 means that people like it more, and having a higher probability of a 4 or even a 3 means that people like it less. Very few people gave a 1 or a 2 rating to anything.)
If you had eat_carrots in your predictions, focus on just one of its values. The overall pattern should otherwise be the same as above.
Footnotes
predictions is probably using more accurate numbers than I took from the summary output.↩︎
In general, for a generalized linear model, it gives predictions on the scale the model was fitted, which is log-odds for a logistic regression, the scale on which the original predictions are done.↩︎
Adding on a log scale is the same thing as multiplying on the original scale. This is why you can multiply numbers by adding their logs.↩︎