When response variable is measured/counted, regression can work well.
But what if response is yes/no, lived/died, success/failure?
Model probability of success.
Probability must be between 0 and 1; need method that ensures this.
Logistic regression does this. In R, is a generalized linear model with binomial “family”:
dose status
0 lived
1 died
2 lived
3 lived
4 died
5 died
Call:
glm(formula = status ~ dose, family = "binomial", data = rats2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.6841 1.7979 0.937 0.349
dose -0.6736 0.6140 -1.097 0.273
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8.3178 on 5 degrees of freedom
Residual deviance: 6.7728 on 4 degrees of freedom
AIC: 10.773
Number of Fisher Scoring iterations: 4
Like (multiple) regression, get tests of significance of individual \(x\)’s
Here not significant (only 6 observations).
“Slope” for dose is negative, meaning that as dose increases, probability of event modelled (survival) decreases.
More realistic: more rats at each dose (say 10).
Listing each rat on one line makes a big data file.
Use format below: dose, number of survivals, number of deaths.
dose lived died
0 10 0
1 7 3
2 6 4
3 4 6
4 2 8
5 1 9
6 lines of data correspond to 60 actual rats.
Saved in rat2.txt
.
cbind
:
glm
:Significant effect of dose now:
Call:
glm(formula = cbind(lived, died) ~ dose, family = "binomial",
data = rat2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.3619 0.6719 3.515 0.000439 ***
dose -0.9448 0.2351 -4.018 5.87e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.530 on 5 degrees of freedom
Residual deviance: 2.474 on 4 degrees of freedom
AIC: 18.94
Number of Fisher Scoring iterations: 4
With more than one \(x\), works much like multiple regression.
Example: study of patients with blood poisoning severe enough to warrant surgery. Relate survival to other potential risk factors.
Variables, 1=present, 0=absent:
See what relates to death.
Call:
glm(formula = death ~ shock + malnut + alcohol + age + bowelinf,
family = "binomial", data = sepsis)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.75391 2.54170 -3.838 0.000124 ***
shock1 3.67387 1.16481 3.154 0.001610 **
malnut1 1.21658 0.72822 1.671 0.094798 .
alcohol1 3.35488 0.98210 3.416 0.000635 ***
age 0.09215 0.03032 3.039 0.002374 **
bowelinf1 2.79759 1.16397 2.403 0.016240 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 105.528 on 105 degrees of freedom
Residual deviance: 53.122 on 100 degrees of freedom
AIC: 65.122
Number of Fisher Scoring iterations: 7
All P-values fairly small
but malnut
not significant: remove.
malnut
Most of the original \(x\)’s helped predict death. Only malnut
seemed not to add anything.
Removed malnut
and tried again.
Everything remaining is significant (though bowelinf
actually became less significant).
All coefficients are positive, so having any of the risk factors (or being older) increases risk of death.
Survival chances pretty good if no risk factors, though decreasing with age.
Having more than one risk factor reduces survival chances dramatically.
Usually good job of predicting survival; sometimes death predicted to survive.
of age
:
An assumption we made is that log-odds of survival depends linearly on age.
Hard to get your head around, but basic idea is that survival chances go continuously up (or down) with age, instead of (for example) going up and then down.
In this case, seems reasonable, but should check:
No apparent problems overall.
Confusing “line” across: no risk factors, survived.
For probability \(p\), odds is \(p/(1-p)\):
Prob | Odds | Log-odds | Words |
---|---|---|---|
0.5 | 0.5 / 0.5 = 1.00 | 0.00 | even money |
0.1 | 0.1 / 0.9 = 0.11 | -2.20 | 9 to 1 |
0.4 | 0.4 / 0.6 = 0.67 | -0.41 | 1.5 to 1 |
0.8 | 0.8 / 0.2 = 4.00 | 1.39 | 4 to 1 on |
Gamblers use odds: if you win at 9 to 1 odds, get original stake back plus 9 times the stake.
Probability has to be between 0 and 1
Odds between 0 and infinity
Log-odds can be anything: any log-odds corresponds to valid probability.
Suppose 90 of 100 men drank wine last week, but only 20 of 100 women.
Prob of man drinking wine \(90/100=0.9\), woman \(20/100=0.2\).
Odds of man drinking wine \(0.9/0.1=9\), woman \(0.2/0.8=0.25\).
Ratio of odds is \(9/0.25=36\).
Way of quantifying difference between men and women: ``odds of drinking wine 36 times larger for males than females’’.
Call:
glm(formula = death ~ shock + alcohol + age + bowelinf, family = "binomial",
data = sepsis)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.89459 2.31689 -3.839 0.000124 ***
shock1 3.70119 1.10353 3.354 0.000797 ***
alcohol1 3.18590 0.91725 3.473 0.000514 ***
age 0.08983 0.02922 3.075 0.002106 **
bowelinf1 2.38647 1.07227 2.226 0.026039 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 105.528 on 105 degrees of freedom
Residual deviance: 56.073 on 101 degrees of freedom
AIC: 66.073
Number of Fisher Scoring iterations: 7
estimate
.These say “how much do you multiply odds of death by for increase of 1 in corresponding risk factor?” Or, what is odds ratio for that factor being 1 (present) vs. 0 (absent)?
Eg. being alcoholic vs. not increases odds of death by 24 times
One year older multiplies odds by about 1.1 times. Over 40 years, about \(1.09^{40}=31\) times.
Relative risk is ratio of probabilities.
Above: 90 of 100 men (0.9) drank wine, 20 of 100 women (0.2).
Relative risk 0.9/0.2=4.5. (odds ratio was 36).
When probabilities small, relative risk and odds ratio similar.
Eg. prob of man having disease 0.02, woman 0.01.
Relative risk \(0.02/0.01=2\).
With 2 response categories, model the probability of one, and prob of other is one minus that. So doesn’t matter which category you model.
With more than 2 categories, have to think more carefully about the categories: are they
ordered: you can put them in a natural order (like low, medium, high)
nominal: ordering the categories doesn’t make sense (like red, green, blue).
R handles both kinds of response; learn how.
Model probability of being in given category or lower.
Example: coal-miners often suffer disease pneumoconiosis. Likelihood of disease believed to be greater among miners who have worked longer.
Severity of disease measured on categorical scale: none, moderate, severe.
Exposure None Moderate Severe
5.8 98 0 0
15.0 51 2 1
21.5 34 6 3
27.5 35 5 8
33.5 32 10 9
39.5 23 7 8
46.0 12 6 10
51.5 4 2 5
Data in aligned columns with more than one space between, so:
Use function polr
from package MASS
. Like glm
.
Call:
polr(formula = Severity ~ Exposure, data = miners, weights = Freq,
Hess = TRUE)
Coefficients:
Value Std. Error t value
Exposure 0.0959 0.01194 8.034
Intercepts:
Value Std. Error t value
None|Moderate 3.9558 0.4097 9.6558
Moderate|Severe 4.8690 0.4411 11.0383
Residual Deviance: 416.9188
AIC: 422.9188
Fit model without Exposure
, and compare using anova
. Note 1
for model with just intercept:
Exposure definitely has effect on severity of disease.
exposure
?Model appears to match data well enough.
As exposure goes up, prob of None goes down, Severe goes up (sharply for high exposure).
So more exposure means worse disease.
With unordered (nominal) responses, can use generalized logit.
Example: 735 people, record age and sex (male 0, female 1), which of 3 brands of some product preferred.
Data in mlogit.csv
separated by commas (so read_csv
will work):
sex
and brand
not meaningful as numbers, so turn into factors:multinom
from package nnet
. Works like polr
.drop1
seems not to work:trying - age
Error in if (trace) {: argument is not interpretable as logical
anova
.Fit models without each of age and sex:
# weights: 9 (4 variable)
initial value 807.480032
iter 10 value 706.796323
iter 10 value 706.796322
final value 706.796322
converged
# weights: 9 (4 variable)
initial value 807.480032
final value 791.861266
converged
age
definitely significant (second anova
)
sex
significant also (first anova
), though P-value less dramatic
Keep both.
Expect to see a large effect of age
, and a smaller one of sex
.
step
:trying - age
trying - sex
Call:
multinom(formula = brand ~ age + sex)
Coefficients:
(Intercept) age sexmale
2 -11.25127 0.3682202 -0.5237736
3 -22.25571 0.6859149 -0.4658215
Residual Deviance: 1405.941
AIC: 1417.941
age
and sex
so neither could be removed.Find age 5-number summary, and the two sexes:
brand sex age
1:207 female:466 Min. :24.0
2:307 male :269 1st Qu.:32.0
3:221 Median :32.0
Mean :32.9
3rd Qu.:34.0
Max. :38.0
Space the ages out a bit for prediction (see over).
Young males prefer brand 1, but older males prefer brand 3.
Females similar, but like brand 1 less and brand 2 more.
A clear brand
effect, but the sex
effect is less clear.
plot_predictions
doesn’t work as we want, but I was (sort of) wrong about that:group
in the condition
:sex
in the data (females).sex
If we add the other variable to the end, we get facets for sex
:
Not actually much difference between males and females.
Brand vs. age: younger people (of both genders) prefer brand 1, but older people (of both genders) prefer brand 3. (Explains significant age effect.)
Brand vs. sex: females (solid) like brand 1 less than males (dashed), like brand 2 more (for all ages).
Not much brand difference between genders (solid and dashed lines of same colours close), but enough to be significant.
Model didn’t include interaction, so modelled effect of gender on brand same for each age, modelled effect of age same for each gender. (See also later.)
Summarize all people of same brand preference, same sex, same age on one line of data file with frequency on end:
1 0 24 1
1 0 26 2
1 0 27 4
1 0 28 4
1 0 29 7
1 0 30 3
...
Whole data set in 65 lines not 735! But how?
Just have to remember weights
to incorporate frequencies.
Otherwise multinom
assumes you have just 1 obs on each line!
Again turn (numerical) sex
and brand
into factors:
sex
identicalSame P-value as before, so we haven’t changed anything important.
# weights: 15 (8 variable)
initial value 807.480032
iter 10 value 703.191146
iter 20 value 702.572260
iter 30 value 702.570900
iter 30 value 702.570893
iter 30 value 702.570893
final value 702.570893
converged
Comments
Significant effect of dose.
Effect of larger dose is to decrease survival probability (“slope” negative; also see in decreasing predictions.)
Confidence intervals around prediction narrower (more data).