library(tidyverse)
library(marginaleffects)
library(MASS, exclude = "select")
Worksheet 3
Packages
Log odds and poisoning rats
In one of the examples from lecture, we learned about modelling the probability that a rat would live as it depended on the dose of a poison. Some of the output from the logistic regression is as shown below:
summary(rat2.1)
Call:
glm(formula = response ~ dose, family = "binomial", data = rat2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.3619 0.6719 3.515 0.000439 ***
dose -0.9448 0.2351 -4.018 5.87e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.530 on 5 degrees of freedom
Residual deviance: 2.474 on 4 degrees of freedom
AIC: 18.94
Number of Fisher Scoring iterations: 4
For the calculations below, I suggest you use R as a calculator. If you prefer, use your actual calculator, but then your numerical answers will need to be sufficiently close to being correct in order to have gotten any credit (when this was an assignment problem).
- Using the
summary
output, obtain a prediction for a dose of 3.2 units. What precisely is this a prediction of?
Do it as if you are doing a regression prediction, B22-style, by substituting 3.2 into the (logistic) regression equation:
<- 2.3619 - 0.9448 * 3.2
log_odds log_odds
[1] -0.66146
This, precisely, is the predicted log-odds that a rat given this dose will live.
I don’t know about you, but I have no intuition for how big a probability this is, yet. All I can really say is that it is negative, so it goes with a probability less than 0.5. (It is a high dose by the standards of the data, so it is more likely than not that a rate given this dose will die.)
(I saved my log-odds to use in the next step, but for the purposes of this problem, I have no objection if you copy and paste the answer.)
- Convert your prediction into a predicted probability that a rat given this dose will live. Hint: if probability is denoted \(p\) and odds \(d\), we saw in class that \(d = p/(1-p)\). It follows (by algebra that I am doing for you) that \(p = d/(1+d)\).
Two steps: convert the log-odds into odds by exp-ing it:
<- exp(log_odds)
odds odds
[1] 0.5160973
and then use the formula in the hint to convert this to a probability:
/ (1 + odds) odds
[1] 0.3404117
The probability that a rat given this dose will live is only 0.34.
Extra: to see whether this is right, run the code in the lecture notes (up to the model I called rat2.1
) and then use predictions
:
<- tibble(dose = 3.2)
new cbind(predictions(rat2.1, newdata = new)) %>%
select(dose, estimate)
Check, to within rounding.1
- In the output given at the top of this question, there is a number \(-0.9448\). What is the interpretation of this number? (If you prefer, you can instead interpret the exp of this number.)
This is the slope2 of the logistic regression.
Two choices, therefore:
- the original slope -0.9448 says that if you increase dose by 1, you decrease the log-odds of living by 0.9448.
- the exp of the slope (see below) is about 0.39, so if you increase dose by 1, you multiply the odds of living by 0.39.
exp(-0.9448)
[1] 0.3887573
Increasing the dose makes the odds, and therefore the probability, of living smaller, which makes sense given that we are talking about the dose of a poison.
Extra: Interpreting the slope is talking about changes, and this doesn’t naturally fit to changes in probabilities. I wanted to do a numerical example to show you what I mean, since I don’t know how well this came across in lecture. Let’s pick several doses, 1 unit apart:
<- tibble(dose = 1:6)
new cbind(predictions(rat2.1, newdata = new)) %>%
select(estimate, dose)
The thing about this is that the change in probability as you increase dose by 1 is not constant: sometimes it’s as big as 0.23 and sometimes as small as 0.05.
Compare that with this:
cbind(predictions(rat2.1, newdata = new, type = "link")) %>%
select(estimate, dose)
Using type = "link"
produces predictions on the log-odds scale.3 This time, you see that the estimate values decrease by about 0.95 every time, no matter what dose you are looking at. This “decrease of about 0.95” is actually the \(-0.9448\) that is the slope. Thus, increasing the dose by 1 decreases the log-odds of living by 0.9448 no matter what dose you’re looking at.
As I said, I don’t have much feeling for log-odds, but maybe I do have some intuition for odds themselves. Let’s turn those last predictions into odds by exp-ing them:
cbind(predictions(rat2.1, newdata = new, type = "link")) %>%
select(estimate, dose) %>%
mutate(estimate = exp(estimate))
At dose 1, a rat is odds-on to live (more likely than not), but as the dose increases, a rat definitely becomes less likely to live.
As you look down the numbers in the estimate
column now, you might be able to see that each one is a bit more than a third of the previous one. The multiplier is actually this:
exp(-0.9448)
[1] 0.3887573
so maybe “a multiplier of 0.4” would be more accurate. This says that if you increase the dose by 1, you multiply the odds of living by about 0.4, no matter what dose you were looking at.
Thus:
- the slope tells you how much the log-odds changes as the \(x\) increases by 1
- the exp of the slope tells you by what multiple the odds change as the \(x\) changes by 14
- neither of those tell you anything about how the probability itself changes in a predictable way (because it doesn’t).
Carrots
In a consumer study, 103 consumers scored their preference of 12 Danish carrot types on a scale from 1 to 7, where 1 represents “strongly dislike” and 7 represents “strongly like”. The consumers also rated each carrot type on some other features, and some demographic information was collected. The data are in http://ritsokiguess.site/datafiles/carrots_pref.csv. We will be predicting preference score from the type of carrot and how often the consumer eats carrots (the latter treated as quantitative):
Frequency
: how often the consumer eats carrots: 1: once a week or more, 2: once every two weeks, 3: once every three weeks, 4: at least once month, 5: less than once a month. (We will treat this as quantitative.)Preference
: consumer score on a seven-point scale, 7 being bestProduct
: type of carrot (there are 12 different named types).
- Read in and display (some of) the data.
Before I do that, the packages you’ll need this time are probably these:
library(tidyverse)
library(MASS, exclude = "select")
library(marginaleffects)
# library(conflicted)
# conflicts_prefer(dplyr::select)
# conflicts_prefer(dplyr::filter)
You’ll certainly need the top three (as usual, for model-fitting, and for predictions, respectively). Because we are using MASS
and there is the select
issue, you need to do one of two things: either explicitly not import the select
that lives in MASS
(the exclude
thing), or, if you forget that, load the conflicted
package (the commented-out lines at the bottom).
If you’re using conflicted
(which you may have to install first), you will at first get what seem like a ton of errors, because R will throw an error every time you use a function like select
or filter
that exists in more than one of the packages you have loaded. The error message will tell you how to choose the one you want; that’s what those last two lines are for. The select
and filter
we will be using are the tidyverse
ones, which live in a part of the tidyverse called dplyr
, so copy the code from the error messages with conflicts_prefer
and dplyr
in it, and paste in somewhere so it will run next time (my preferred place is after all the library
lines where I load my packages). The next time R runs into a select
, it will know which one you want, and won’t throw an error about it.
If you nonetheless get an error message about select
but your use of select
is correct (in tidyverse terms), that means that R is getting confused about which select
you mean, and you might need to restart your R session (Session, Restart R) to keep things in line.
The coder in you might be happier (in general) to use the conflicted
package, even though it takes some more work to get things set up, because then it’s absolutely clear which package each function you are running is coming from. (The business with filter
is nothing to do with MASS
: it so happens that dplyr
(in the tidyverse) has the filter
that we learned about in C32, but there is also a filter
in the stats
package that is loaded every time you run R5, and so conflicted
wants to know which of those two versions of filter
is the one you mean: for us, the one in dplyr
.)
Reading the data, though, is as usual:
<- "http://ritsokiguess.site/datafiles/carrots_pref.csv"
my_url <- read_csv(my_url) carrots
Rows: 1236 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Product
dbl (2): eat_carrots, Preference
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
carrots
You might be wondering why there are so many rows. This is because there were 103 consumers, and each one rated 12 different carrot types (the ones named in Product
), so there should be this many rows in the “longer” layout that we have:
103 * 12
[1] 1236
and so there are. (This would admittedly have been clearer if we had had an ID for each consumer; then, it would have been easier to see where the preferences for one consumer ended and the ones for the next one began.)
- Why would ordinal logistic regression be a sensible method of analysis here?
The response variable Preference
is categorical (even though written as a number) , and the categories are in order with 1 being worst and 7 being best.
- Fit an ordinal logistic regression to this dataset. You do not need to display any output from this model yet. Hint:
Preference
is actually categorical, even though it looks like a number, so you should make sure that R treats it as categorical.
polr
from package MASS
. Also, Preference
, which looks like a number, should be a genuine factor
, so make it one when you fit the model (or, create a column using mutate
that is the factor
version of Preference
).
.1 <- polr(factor(Preference) ~ eat_carrots + Product, data = carrots) carrots
- Can any explanatory variables be removed? Explain briefly.
drop1
works just fine for these models, so is the right tool for finding this out:
drop1(carrots.1, test = "Chisq")
The P-value for eat_carrots
is 0.095 (watch the scientific notation!), so this can be removed. This means that people’s rating for a carrot is not affected by how much they tend to eat carrots.
Note that the categorical Product
is treated as a whole (which is a reason to use drop1
): we know that Product
has to stay, that is, that there are some differences in preference among the carrot types.
In the previous part, you didn’t know how you were going to be displaying the model, so it was better to wait and figure out what you needed to look at first.
- If necessary, fit an improved model. (If not, explain briefly why not.)
Copy, paste, and edit, or use update
:
.2 <- update(carrots.1, .~. - eat_carrots) carrots
If you didn’t think there was anything to remove, say why you thought that (for example, that you thought both explanatory variables were significant and so should not be removed). I am likely to disagree with your reasoning here, but as long as you are consistent the rest of the way, this part would have been the only place you lost points.
- We will be predicting probabilities of each rating category for each of the explanatory variables remaining in the best model. Make a dataframe that includes all the different types of carrot, and the values 1 and 5 for
eat_carrots
if that is in your best model. Hint: you can usecount
to get all the levels of a categorical variable.
My best model does not include eat_carrots
, so I only need the twelve types of carrot, which is a categorical variable.
%>% count(Product) -> new
carrots new
If you have eat_carrots
in your model as well, you can still use count
to get all the combinations of values, and then you need to get rid of the ones you don’t want:
%>% count(eat_carrots, Product) carrots
%>% count(eat_carrots, Product) %>%
carrots filter(eat_carrots == 1 | eat_carrots == 5) -> new2
new2
The extra column n
can be safely ignored, because if predictions
finds an extra column in newdata
it will ignore it.
There are 12 varieties of carrot, so if that is all that is left in your model, your new
should have 12 rows. If you have two values of eat_carrots
going with each of those, your new
will have 24 rows.
- Predict the probability of a customer giving each carrot type each preference score. Display your results in such a way that you can easily compare the probability of each score for different types of carrot.
This is predictions
, with the usual request to keep only the relevant columns:
cbind(predictions(model = carrots.2, newdata = new)) %>%
select(Product, estimate, group)
Re-fitting to get Hessian
These are the predictions all right, but 84 rows is a lot to assess. Two points out of three (when this was graded) for getting this far.
group
is the score category, and I think it would be easier to have the predicted probabilities for the different groups going across the page. (Feel free to disagree, but you need to make the case for leaving it like this, if that’s what you choose to do.)
cbind(predictions(model = carrots.2, newdata = new)) %>%
select(Product, estimate, group) %>%
pivot_wider(names_from = group, values_from = estimate)
Re-fitting to get Hessian
Depending on the width of your page, you might find that the predicted probabilities now go off the right side. This was acceptable when this was an assignment, but the very best answer would round the probabilities off to, say, 2 or 3 decimals so that you can see all seven of them6 even on a narrow screen. You can do that all at once like this:
cbind(predictions(model = carrots.2, newdata = new)) %>%
select(Product, estimate, group) %>%
pivot_wider(names_from = group, values_from = estimate) %>%
mutate(across(where(is.numeric), \(x) round(x, 2)))
Re-fitting to get Hessian
mutate
with across
redefines all the columns “in place” (so that the unrounded values are lost unless you calculate them again, but we only wanted to see the rounded ones).
The last point was for making a decent attempt to show the probabilities of each preference value for each carrot in a readable way.
If your model had eat_carrots
in it, your answers will look different, but the same considerations otherwise apply:
cbind(predictions(model = carrots.1, newdata = new2)) %>%
select(eat_carrots, Product, estimate, group) %>%
pivot_wider(names_from = group, values_from = estimate) %>%
mutate(across(where(is.numeric), \(x) round(x, 2)))
Re-fitting to get Hessian
- There was a significant difference in preference scores among the different types of carrot. What do your predictions tell you about why that is? Explain briefly.
See whether there are any carrots that are generally liked more or less than the others,
Casting my eye down the predictions, it seems that most Products are most likely to be rated 5. Exceptions are Major_E, Turbo_E, and Yukon_E, which are more likely to be rated 4, and are thus liked less than the others. You might also notice that Bolero_L and Yukon_L are the most likely to be rated 6 or 7, so people like them the most.
Pick out at least one product that is liked more or less than the others, to justify the significance of Product
. Also, say something about whether the products you named are liked more or less overall than the others. (Because the response category is ordered, having a higher probability of being a 6 or 7 means that people like it more, and having a higher probability of a 4 or even a 3 means that people like it less. Very few people gave a 1 or a 2 rating to anything.)
If you had eat_carrots
in your predictions, focus on just one of its values. The overall pattern should otherwise be the same as above.
Footnotes
predictions
is probably using more accurate numbers than I took from thesummary
output.↩︎It’s actually a slope in terms of log-odds.↩︎
In general, for a generalized linear model, it gives predictions on the scale the model was fitted, which is log-odds for a logistic regression, the scale on which the original predictions are done.↩︎
Adding on a log scale is the same thing as multiplying on the original scale. This is why you can multiply numbers by adding their logs.↩︎
That
filter
has to do with time series, where “filter” has its own meaning.↩︎If you are lucky.↩︎