---
title: "Logistic Regression"
editor:
markdown:
wrap: 72
---
## Logistic regression
- When response variable is measured/counted, regression can work
well.
- But what if response is yes/no, lived/died, success/failure?
- Model *probability* of success.
- Probability must be between 0 and 1; need method that ensures this.
- *Logistic regression* does this. In R, is a *generalized linear
model* with binomial "family":
```{r bLogistic-1, eval=FALSE}
glm(y ~ x, family="binomial")
```
- Begin with simplest case.
## Packages
```{r bLogistic-2, message = FALSE}
library(MASS)
library(tidyverse)
library(marginaleffects)
library(broom)
library(nnet)
library(conflicted)
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
conflict_prefer("rename", "dplyr")
conflict_prefer("summarize", "dplyr")
```
## The rats, part 1
- Rats given dose of some poison; either live or die:
```
dose status
0 lived
1 died
2 lived
3 lived
4 died
5 died
```
## Read in:
```{r bLogistic-3, message=FALSE}
my_url <- "http://ritsokiguess.site/datafiles/rat.txt"
rats <- read_delim(my_url, " ")
rats
```
## Basic logistic regression
- Make response into a factor first:
```{r bLogistic-4}
rats2 <- rats %>% mutate(status = factor(status))
rats2
```
- then fit model:
```{r bLogistic-5, error=T}
status.1 <- glm(status ~ dose, family = "binomial", data = rats2)
```
## Output
```{r bLogistic-6}
summary(status.1)
```
## Interpreting the output
- Like (multiple) regression, get tests of significance of individual
$x$'s
- Here not significant (only 6 observations).
- "Slope" for dose is negative, meaning that as dose increases,
probability of event modelled (survival) decreases.
## Output part 2: predicted survival probs
```{r bLogistic-7 }
cbind(predictions(status.1)) %>%
select(dose, estimate, conf.low, conf.high)
```
## On a graph
```{r, fig.height=4}
plot_predictions(status.1, condition = "dose")
```
## The rats, more
- More realistic: more rats at each dose (say 10).
- Listing each rat on one line makes a big data file.
- Use format below: dose, number of survivals, number of deaths.
```
dose lived died
0 10 0
1 7 3
2 6 4
3 4 6
4 2 8
5 1 9
```
- 6 lines of data correspond to 60 actual rats.
- Saved in `rat2.txt`.
## These data
```{r bLogistic-8}
my_url <- "http://ritsokiguess.site/datafiles/rat2.txt"
rat2 <- read_delim(my_url, " ")
rat2
```
## Response matrix:
- Each row contains *multiple* observations.
- Create *two-column* response with `cbind`:
- #survivals in first column,
- #deaths in second.
## Fit logistic regression
- constructing the response in the `glm`:
```{r bLogistic-11}
rat2.1 <- glm(cbind(lived, died) ~ dose, family = "binomial", data = rat2)
```
## Output
Significant effect of dose now:
```{r bLogistic-12}
summary(rat2.1)
```
## Predicted survival probs
```{r bLogistic-13 }
#| warning = FALSE
new <- datagrid(model = rat2.1, dose = 0:5)
cbind(predictions(rat2.1, newdata = new)) %>%
select(estimate, dose, conf.low, conf.high)
```
## On a picture
```{r}
#| error: true
plot_predictions(rat2.1, condition = "dose")
```
## Comments
- Significant effect of dose.
- Effect of larger dose is to *decrease* survival probability ("slope"
negative; also see in decreasing predictions.)
- Confidence intervals around prediction narrower (more data).
## Multiple logistic regression
- With more than one $x$, works much like multiple regression.
- Example: study of patients with blood poisoning severe enough to
warrant surgery. Relate survival to other potential risk factors.
- Variables, 1=present, 0=absent:
- survival (death from sepsis=1), response
- shock
- malnutrition
- alcoholism
- age (as numerical variable)
- bowel infarction
- See what relates to death.
## Read in data
```{r bLogistic-14}
my_url <-
"http://ritsokiguess.site/datafiles/sepsis.txt"
sepsis <- read_delim(my_url, " ")
sepsis
```
## Make sure categoricals really are
```{r}
sepsis %>%
mutate(across(-age, \(x) factor(x))) -> sepsis
```
## The data (some)
```{r bLogistic-15}
sepsis
```
## Fit model
```{r bLogistic-16 }
sepsis.1 <- glm(death ~ shock + malnut + alcohol + age +
bowelinf,
family = "binomial",
data = sepsis
)
```
## Output part 1
```{r bLogistic-17}
summary(sepsis.1)
tidy(sepsis.1)
```
- All P-values fairly small
- but `malnut` not significant: remove.
## Removing `malnut`
```{r bLogistic-18}
sepsis.2 <- update(sepsis.1, . ~ . - malnut)
tidy(sepsis.2)
```
- Everything significant now.
## Comments
- Most of the original $x$'s helped predict death. Only `malnut`
seemed not to add anything.
- Removed `malnut` and tried again.
- Everything remaining is significant (though `bowelinf` actually
became *less* significant).
- All coefficients are *positive*, so having any of the risk factors
(or being older) *increases* risk of death.
## Predictions from model without "malnut"
- A few (rows of original dataframe) chosen "at random":
```{r bLogistic-19}
sepsis %>% slice(c(4, 1, 2, 11, 32)) -> new
new
cbind(predictions(sepsis.2, newdata = new)) %>%
select(estimate, conf.low, conf.high, shock:bowelinf)
```
## Comments
- Survival chances pretty good if no risk factors, though decreasing
with age.
- Having more than one risk factor reduces survival chances
dramatically.
- Usually good job of predicting survival; sometimes death predicted
to survive.
## Another way to assess effects
of `age`:
```{r}
new <- datagrid(model = sepsis.2, age = seq(30, 70, 10))
new
```
## Assessing age effect
```{r}
cbind(predictions(sepsis.2, newdata = new)) %>%
select(estimate, shock:age)
```
## Assessing shock effect
```{r}
new <- datagrid(shock = c(0, 1), model = sepsis.2)
new
cbind(predictions(sepsis.2, newdata = new)) %>%
select(estimate, death:shock)
```
## Assessing proportionality of odds for age
- An assumption we made is that log-odds of survival depends linearly
on age.
- Hard to get your head around, but basic idea is that survival
chances go continuously up (or down) with age, instead of (for
example) going up and then down.
- In this case, seems reasonable, but should check:
## Residuals vs. age
```{r virtusentella,fig.height=3.4, warning=F}
sepsis.2 %>% augment(sepsis) %>%
ggplot(aes(x = age, y = .resid, colour = death)) +
geom_point()
```
## Comments
- No apparent problems overall.
- Confusing "line" across: no risk factors, survived.
## Probability and odds
For probability $p$, odds is $p/(1-p)$:
| Prob | Odds | Log-odds | Words |
|------|------------------|----------|------------|
| 0.5 | 0.5 / 0.5 = 1.00 | 0.00 | even money |
| 0.1 | 0.1 / 0.9 = 0.11 | -2.20 | 9 to 1 |
| 0.4 | 0.4 / 0.6 = 0.67 | -0.41 | 1.5 to 1 |
| 0.8 | 0.8 / 0.2 = 4.00 | 1.39 | 4 to 1 on |
- Gamblers use odds: if you win at 9 to 1 odds, get original stake
back plus 9 times the stake.
- Probability has to be between 0 and 1
- Odds between 0 and infinity
- *Log*-odds can be anything: any log-odds corresponds to valid
probability.
## Odds ratio
- Suppose 90 of 100 men drank wine last week, but only 20 of 100
women.
- Prob of man drinking wine $90/100=0.9$, woman $20/100=0.2$.
- Odds of man drinking wine $0.9/0.1=9$, woman $0.2/0.8=0.25$.
- Ratio of odds is $9/0.25=36$.
- Way of quantifying difference between men and women: \`\`odds of
drinking wine 36 times larger for males than females''.
## Sepsis data again
- Recall prediction of probability of death from risk factors:
```{r bLogistic-20}
sepsis
summary(sepsis.2)
sepsis.2.tidy <- tidy(sepsis.2)
sepsis.2.tidy
```
- Slopes in column `estimate`.
## Multiplying the odds
- Can interpret slopes by taking "exp" of them. We ignore intercept.
```{r expo}
sepsis.2.tidy %>%
mutate(exp_coeff=exp(estimate)) %>%
select(term, exp_coeff)
```
## Interpretation
```{r bLogistic-21, ref.label="expo", echo=F}
```
- These say "how much do you *multiply* odds of death by for increase
of 1 in corresponding risk factor?" Or, what is odds ratio for that
factor being 1 (present) vs. 0 (absent)?
- Eg. being alcoholic vs. not increases odds of death by 24 times
- One year older multiplies odds by about 1.1 times. Over 40 years,
about $1.09^{40}=31$ times.
## Odds ratio and relative risk
- **Relative risk** is ratio of probabilities.
- Above: 90 of 100 men (0.9) drank wine, 20 of 100 women (0.2).
- Relative risk 0.9/0.2=4.5. (odds ratio was 36).
- When probabilities small, relative risk and odds ratio similar.
- Eg. prob of man having disease 0.02, woman 0.01.
- Relative risk $0.02/0.01=2$.
## Odds ratio vs. relative risk
- Odds for men and for women:
```{r bLogistic-22 }
(od1 <- 0.02 / 0.98) # men
(od2 <- 0.01 / 0.99) # women
```
- Odds ratio
```{r bLogistic-23 }
od1 / od2
```
- Very close to relative risk of 2.
## More than 2 response categories
- With 2 response categories, model the probability of one, and prob
of other is one minus that. So doesn't matter which category you
model.
- With more than 2 categories, have to think more carefully about the
categories: are they
- *ordered*: you can put them in a natural order (like low, medium,
high)
- *nominal*: ordering the categories doesn't make sense (like red,
green, blue).
- R handles both kinds of response; learn how.
## Ordinal response: the miners
- Model probability of being in given category *or lower*.
- Example: coal-miners often suffer disease pneumoconiosis. Likelihood
of disease believed to be greater among miners who have worked
longer.
- Severity of disease measured on categorical scale: none, moderate,
severe.
## Miners data
- Data are frequencies:
```
Exposure None Moderate Severe
5.8 98 0 0
15.0 51 2 1
21.5 34 6 3
27.5 35 5 8
33.5 32 10 9
39.5 23 7 8
46.0 12 6 10
51.5 4 2 5
```
## Reading the data
Data in aligned columns with more than one space between, so:
```{r bLogistic-24 }
my_url <- "http://ritsokiguess.site/datafiles/miners-tab.txt"
freqs <- read_table(my_url)
```
## The data
```{r bLogistic-25 }
freqs
```
## Tidying
```{r bLogistic-26 }
freqs %>%
pivot_longer(-Exposure, names_to = "Severity", values_to = "Freq") %>%
mutate(Severity = fct_inorder(Severity)) -> miners
```
## Result
```{r bLogistic-27 }
miners
```
## Plot proportions against exposure
```{r bLogistic-28, fig.height=3.5, message=F}
miners %>%
group_by(Exposure) %>%
mutate(proportion = Freq / sum(Freq)) -> prop
prop
ggplot(prop, aes(x = Exposure, y = proportion,
colour = Severity)) +
geom_point() + geom_smooth(se = F)
```
## Reminder of data setup
```{r bLogistic-29 }
miners
```
## Fitting ordered logistic model
Use function `polr` from package `MASS`. Like `glm`.
```{r bLogistic-34 }
sev.1 <- polr(Severity ~ Exposure,
weights = Freq,
data = miners
)
```
## Output: not very illuminating
```{r}
sev.1 <- polr(Severity ~ Exposure,
weights = Freq,
data = miners,
Hess = TRUE
)
```
```{r bLogistic-35 }
summary(sev.1)
```
## Does exposure have an effect?
Fit model without `Exposure`, and compare using `anova`. Note `1` for
model with just intercept:
```{r bLogistic-36, echo=F}
w <- getOption("width")
options(width = w - 20)
```
```{r bLogistic-37}
sev.0 <- polr(Severity ~ 1, weights = Freq, data = miners)
anova(sev.0, sev.1)
```
Exposure definitely has effect on severity of disease.
## Another way
- What (if anything) can we drop from model with `exposure`?
```{r bLogistic-38 }
drop1(sev.1, test = "Chisq")
```
- Nothing. Exposure definitely has effect.
## Predicted probabilities 1/2
```{r}
freqs %>% select(Exposure) -> new
new
```
## Predicted probabilities 2/2
```{r}
cbind(predictions(sev.1, newdata = new)) %>%
select(group, estimate, Exposure) %>%
pivot_wider(names_from = group, values_from = estimate)
```
## Plot of predicted probabilities
```{r}
plot_predictions(model = sev.1, condition = c("Exposure", "group"), type = "probs") +
geom_point(data = prop, aes(x = Exposure, y = proportion, colour = Severity)) -> ggg
```
## The graph
```{r, fig.height=4.5}
ggg
```
## Comments
- Model appears to match data well enough.
- As exposure goes up, prob of None goes down, Severe goes up (sharply
for high exposure).
- So more exposure means worse disease.
## Unordered responses
- With unordered (nominal) responses, can use *generalized logit*.
- Example: 735 people, record age and sex (male 0, female 1), which of
3 brands of some product preferred.
- Data in `mlogit.csv` separated by commas (so `read_csv` will work):
```{r bLogistic-45 }
my_url <- "http://ritsokiguess.site/datafiles/mlogit.csv"
brandpref <- read_csv(my_url)
```
## The data (some)
```{r bLogistic-46 }
brandpref
```
## Bashing into shape
- `sex` and `brand` not meaningful as numbers, so turn into factors:
```{r bLogistic-47 }
brandpref %>%
mutate(sex = ifelse(sex == 1, "female", "male"),
sex = factor(sex),
brand = factor(brand)
) -> brandpref
```
```{r}
brandpref %>% count(sex)
```
## Fitting model
- We use `multinom` from package `nnet`. Works like `polr`.
```{r bLogistic-48 }
library(nnet)
levels(brandpref$sex)
brands.1 <- multinom(brand ~ age + sex, data = brandpref)
```
## Can we drop anything?
- Unfortunately `drop1` seems not to work:
```{r bLogistic-49, error=TRUE}
drop1(brands.1, test = "Chisq", trace = 0)
```
- So, fall back on fitting model without what you want to test, and
comparing using `anova`.
## Do age/sex help predict brand? 1/3
Fit models without each of age and sex:
```{r bLogistic-50 }
brands.2 <- multinom(brand ~ age, data = brandpref)
brands.3 <- multinom(brand ~ sex, data = brandpref)
```
## Do age/sex help predict brand? 2/3
```{r bLogistic-51}
anova(brands.2, brands.1)
anova(brands.3, brands.1)
```
## Do age/sex help predict brand? 3/3
- `age` definitely significant (second `anova`)
- `sex` significant also (first `anova`), though P-value less dramatic
- Keep both.
- Expect to see a large effect of `age`, and a smaller one of `sex`.
## Another way to build model
- Start from model with everything and run `step`:
```{r}
#| echo = FALSE,
#| message = FALSE,
#| results = "hide"
brands.1 <- with(brandpref, multinom(brand ~ age + sex))
```
```{r bLogistic-52 }
step(brands.1, trace = 0)
```
- Final model contains both `age` and `sex` so neither could be
removed.
## Making predictions
Find age 5-number summary, and the two sexes:
```{r}
summary(brandpref)
```
Space the ages out a bit for prediction (see over).
## Combinations
```{r}
new <- datagrid(age = seq(24, 30, 2),
sex = c("female", "male"), model = brands.1)
new
```
## The predictions
```{r bLogistic-54 }
cbind(predictions(brands.1, newdata = new)) %>%
select(group, estimate, age, sex) %>%
pivot_wider(names_from = group, values_from = estimate)
```
## Comments
- Young males prefer brand 1, but older males prefer brand 3.
- Females similar, but like brand 1 less and brand 2 more.
- A clear `brand` effect, but the `sex` effect is less clear.
## Making a plot
- I thought `plot_predictions` doesn't work as we want, but I was
(sort of) wrong about that:
```{r}
plot_predictions(brands.1, condition = c("age", "brand", "sex"),
type = "probs")
```
## Making it go
- We have to include `group` in the `condition`:
```{r}
plot_predictions(brands.1, condition = c("age", "group"))
```
- This picks the most common `sex` in the data (females).
- See younger females prefer brand 1, older ones preferring brand 3.
## For each `sex`
If we add the
other variable to the *end*, we get facets for `sex`:
```{r}
plot_predictions(brands.1, condition = c("age", "group", "sex"))
```
Not actually much difference between males and
females.
## A better graph
- but the male-female difference *was* significant. How?
- *don't* actually plot the graph, then plot the right things:
```{r}
plot_predictions(brands.1, condition = c("age", "brand", "sex"),
type = "probs", draw = FALSE) %>%
ggplot(aes(x = age, y = estimate, colour = group,
linetype = sex)) +
geom_line() -> g
```
## The graph
```{r, fig.height=4.5}
g
```
## Digesting the plot
- Brand vs. age: younger people (of both genders) prefer brand 1, but
older people (of both genders) prefer brand 3. (Explains significant
age effect.)
- Brand vs. sex: females (solid) like brand 1 less than males
(dashed), like brand 2 more (for all ages).
- Not much brand difference between genders (solid and dashed lines of
same colours close), but enough to be significant.
- Model didn't include interaction, so modelled effect of gender on
brand same for each age, modelled effect of age same for each
gender. (See also later.)
## Alternative data format
Summarize all people of same brand preference, same sex, same age on one
line of data file with frequency on end:
```{r}
brandpref
```
```
1 0 24 1
1 0 26 2
1 0 27 4
1 0 28 4
1 0 29 7
1 0 30 3
...
```
Whole data set in 65 lines not 735! But how?
## Getting alternative data format
```{r bLogistic-60, warning=FALSE, message=FALSE}
brandpref %>%
group_by(age, sex, brand) %>%
summarize(Freq = n()) %>%
ungroup() -> b
b
```
## Fitting models, almost the same
- Just have to remember `weights` to incorporate frequencies.
- Otherwise `multinom` assumes you have just 1 obs on each line!
- Again turn (numerical) `sex` and `brand` into factors:
```{r bLogistic-61, results="hide"}
b %>%
mutate(sex = factor(sex)) %>%
mutate(brand = factor(brand)) -> bf
b.1 <- multinom(brand ~ age + sex, data = bf, weights = Freq)
b.2 <- multinom(brand ~ age, data = bf, weights = Freq)
```
## P-value for `sex` identical
```{r bLogistic-62}
anova(b.2, b.1)
```
Same P-value as before, so we haven't changed anything important.
## Trying interaction between age and gender
```{r bLogistic-69, echo=F}
options(width = 60)
```
```{r bLogistic-70 }
brands.4 <- update(brands.1, . ~ . + age:sex)
anova(brands.1, brands.4)
```
- No evidence that effect of age on brand preference differs for the
two genders.
## Make graph again
```{r}
plot_predictions(brands.4, condition = c("age", "brand", "sex"),
type = "probs", draw = FALSE) %>%
ggplot(aes(x = age, y = estimate, colour = group,
linetype = sex)) +
geom_line() -> g4
```
## Not much difference in the graph
```{r, fig.height=4.5}
g4
```
## Compare model without interaction
```{r, fig.height=4.5}
g
```