So far, have seen:
response variable counted or measured (regression)
response variable categorized (logistic regression)
But what if response is time until event (eg. time of survival after surgery)?
Additional complication: event might not have happened at end of study (eg. patient still alive). But knowing that patient has “not died yet” presumably informative. Such data called censored.
Enter survival analysis, in particular the “Cox proportional hazards model”.
Explanatory variables in this context often called covariates.
survival
if not done. Also use broom
and marginaleffects
from earlier.12 women who have just started taking dancing lessons are followed for up to a year, to see whether they are still taking dancing lessons, or have quit. The “event” here is “quit”.
This might depend on:
a treatment (visit to a dance competition)
woman’s age (at start of study).
Months Quit Treatment Age
1 1 0 16
2 1 0 24
2 1 0 18
3 0 0 27
4 1 0 25
7 1 1 26
8 1 1 36
10 1 1 38
10 0 1 45
12 1 1 47
months
and quit
are kind of combined response:
Months
is number of months a woman was actually observed dancing
quit
is 1 if woman quit, 0 if still dancing at end of study.
Treatment is 1 if woman went to dance competition, 0 otherwise.
Fit model and see whether Age
or Treatment
have effect on survival.
Want to do predictions for probabilities of still dancing as they depend on whatever is significant, and draw plot.
Months
) and whether or not the event, quitting, happened (that is, if Quit
is 1).Surv
from survival
package, with two inputs:
TRUE
or 1 if the event happened.Call:
coxph(formula = Surv(Months, Quit) ~ Treatment + Age, data = dance)
n= 12, number of events= 10
coef exp(coef) se(coef) z Pr(>|z|)
Treatment -4.44915 0.01169 2.60929 -1.705 0.0882 .
Age -0.36619 0.69337 0.15381 -2.381 0.0173 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
Treatment 0.01169 85.554 7.026e-05 1.9444
Age 0.69337 1.442 5.129e-01 0.9373
Concordance= 0.964 (se = 0.039 )
Likelihood ratio test= 21.68 on 2 df, p=2e-05
Wald test = 5.67 on 2 df, p=0.06
Score (logrank) test = 14.75 on 2 df, p=6e-04
Use \(\alpha=0.10\) here since not much data.
Three tests at bottom like global F-test. Consensus that something predicts survival time (whether or not dancer quit and/or how long it took).
Age
(definitely), Treatment
(marginally) both predict survival time.
All depends on hazard rate, which is based on probability that event happens in the next short time period, given that event has not happened yet:
\(X\) denotes time to event, \(\delta\) is small time interval:
\(h(t) = P(X \le t + \delta | X \ge t) / \delta\)
if \(h(t)\) large, event likely to happen soon (lifetime short)
if \(h(t)\) small, event unlikely to happen soon (lifetime long).
want to model hazard rate
but hazard rate always positive, so actually model log of hazard rate
modelling how (log-)hazard rate depends on other things eg \(X_1 =\) age, \(X_2 =\) treatment, with the \(\beta\) being regression coefficients:
Cox model \(h(t)=h_0(t)\exp(\beta_0+\beta_1X_1+\beta_2 X_2+\cdots)\), or:
\(\log(h(t)) = \log(h_0(t)) + \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots\)
like a generalized linear model with log link.
marginaleffects
marginaleffects
knows about survival models (with sufficient care)Months
.These are actually for women who did not go to the dance competition.
cbind(predictions(dance.1, newdata = new, type = "survival")) %>%
select(Age, Treatment, Months, estimate)
The estimated survival probabilities go down over time. For example a 20-year-old woman here has estimated probability 0.0293 of still dancing after 5 months.
We can plot the predictions over time for an experimental condition such as age. The key for plot_predictions
is to put time first in the condition
:
The same procedure will get predictions for women who did or did not go to the dance competition, at various times:
The age used for predictions is the mean of all ages.
cbind(predictions(dance.1, newdata = new, type = "survival")) %>%
select(Age, Treatment, Months, estimate)
Women of this age have a high (0.879) chance of still dancing after 7 months if they went to the dance competition, but much lower (0.165) if they did not.
In condition
, put the time variable first, and then the effect of interest:
Call:
coxph(formula = Surv(Months, Quit) ~ Treatment + Age, data = dance)
n= 12, number of events= 10
coef exp(coef) se(coef) z Pr(>|z|)
Treatment -4.44915 0.01169 2.60929 -1.705 0.0882 .
Age -0.36619 0.69337 0.15381 -2.381 0.0173 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
Treatment 0.01169 85.554 7.026e-05 1.9444
Age 0.69337 1.442 5.129e-01 0.9373
Concordance= 0.964 (se = 0.039 )
Likelihood ratio test= 21.68 on 2 df, p=2e-05
Wald test = 5.67 on 2 df, p=0.06
Score (logrank) test = 14.75 on 2 df, p=6e-04
coef
column describe effect of that variable on log-hazard of quitting.Treatment = 1
) is less likely to quit soon vs. a woman who didn’t (more likely to be still dancing).With regression, usually plot residuals against fitted values.
Not quite same here (nonlinear model), but “martingale residuals” should have no pattern vs. “linear predictor”.
Use broom
ideas to get them, in .resid
and .fitted
as below.
Martingale residuals can go very negative, so won’t always look normal.
When you load in an R package, get data sets to illustrate functions in the package.
One such is lung
. Data set measuring survival in patients with advanced lung cancer.
Along with survival time, number of “performance scores” included, measuring how well patients can perform daily activities.
Sometimes high good, but sometimes bad!
Variables below, from the data set help file (?lung
).
inst time status age
Min. : 1.00 Min. : 5.0 Min. :1.000 Min. :39.00
1st Qu.: 3.00 1st Qu.: 166.8 1st Qu.:1.000 1st Qu.:56.00
Median :11.00 Median : 255.5 Median :2.000 Median :63.00
Mean :11.09 Mean : 305.2 Mean :1.724 Mean :62.45
3rd Qu.:16.00 3rd Qu.: 396.5 3rd Qu.:2.000 3rd Qu.:69.00
Max. :33.00 Max. :1022.0 Max. :2.000 Max. :82.00
NA's :1
sex ph.ecog ph.karno pat.karno
Min. :1.000 Min. :0.0000 Min. : 50.00 Min. : 30.00
1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 75.00 1st Qu.: 70.00
Median :1.000 Median :1.0000 Median : 80.00 Median : 80.00
Mean :1.395 Mean :0.9515 Mean : 81.94 Mean : 79.96
3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.: 90.00 3rd Qu.: 90.00
Max. :2.000 Max. :3.0000 Max. :100.00 Max. :100.00
NA's :1 NA's :1 NA's :3
meal.cal wt.loss
Min. : 96.0 Min. :-24.000
1st Qu.: 635.0 1st Qu.: 0.000
Median : 975.0 Median : 7.000
Mean : 928.8 Mean : 9.832
3rd Qu.:1150.0 3rd Qu.: 15.750
Max. :2600.0 Max. : 68.000
NA's :47 NA's :14
Missing values seem to be gone.
inst time status age
Min. : 1.00 Min. : 5.0 Min. :1.000 Min. :39.00
1st Qu.: 3.00 1st Qu.: 174.5 1st Qu.:1.000 1st Qu.:57.00
Median :11.00 Median : 268.0 Median :2.000 Median :64.00
Mean :10.71 Mean : 309.9 Mean :1.719 Mean :62.57
3rd Qu.:15.00 3rd Qu.: 419.5 3rd Qu.:2.000 3rd Qu.:70.00
Max. :32.00 Max. :1022.0 Max. :2.000 Max. :82.00
sex ph.ecog ph.karno pat.karno
Min. :1.000 Min. :0.0000 Min. : 50.00 Min. : 30.00
1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 70.00 1st Qu.: 70.00
Median :1.000 Median :1.0000 Median : 80.00 Median : 80.00
Mean :1.383 Mean :0.9581 Mean : 82.04 Mean : 79.58
3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.: 90.00 3rd Qu.: 90.00
Max. :2.000 Max. :3.0000 Max. :100.00 Max. :100.00
meal.cal wt.loss
Min. : 96.0 Min. :-24.000
1st Qu.: 619.0 1st Qu.: 0.000
Median : 975.0 Median : 7.000
Mean : 929.1 Mean : 9.719
3rd Qu.:1162.5 3rd Qu.: 15.000
Max. :2600.0 Max. : 68.000
No missing values left.
inst
[1] "inst" "time" "status" "age" "sex"
[6] "ph.ecog" "ph.karno" "pat.karno" "meal.cal" "wt.loss"
status
of 2:“Dot” means “all the other variables”.
summary
of model 1Call:
coxph(formula = Surv(time, status == 2) ~ . - inst - time - status,
data = lung.complete)
n= 167, number of events= 120
coef exp(coef) se(coef) z Pr(>|z|)
age 1.080e-02 1.011e+00 1.160e-02 0.931 0.35168
sex -5.536e-01 5.749e-01 2.016e-01 -2.746 0.00603 **
ph.ecog 7.395e-01 2.095e+00 2.250e-01 3.287 0.00101 **
ph.karno 2.244e-02 1.023e+00 1.123e-02 1.998 0.04575 *
pat.karno -1.207e-02 9.880e-01 8.116e-03 -1.488 0.13685
meal.cal 2.835e-05 1.000e+00 2.594e-04 0.109 0.91298
wt.loss -1.420e-02 9.859e-01 7.766e-03 -1.828 0.06748 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.0109 0.9893 0.9881 1.0341
sex 0.5749 1.7395 0.3872 0.8534
ph.ecog 2.0950 0.4773 1.3479 3.2560
ph.karno 1.0227 0.9778 1.0004 1.0455
pat.karno 0.9880 1.0121 0.9724 1.0038
meal.cal 1.0000 1.0000 0.9995 1.0005
wt.loss 0.9859 1.0143 0.9710 1.0010
Concordance= 0.653 (se = 0.029 )
Likelihood ratio test= 28.16 on 7 df, p=2e-04
Wald test = 27.5 on 7 df, p=3e-04
Score (logrank) test = 28.31 on 7 df, p=2e-04
The three tests of overall significance:
All strongly significant. Something predicts survival.
sex
and ph.ecog
definitely significant here
age
, pat.karno
and meal.cal
definitely not
Take out definitely non-sig variables, and try again.
Take out ph.karno
and wt.loss
as well.
Call:
coxph(formula = Surv(time, status == 2) ~ sex + ph.ecog, data = lung.complete)
n= 167, number of events= 120
coef exp(coef) se(coef) z Pr(>|z|)
sex -0.5101 0.6004 0.1969 -2.591 0.009579 **
ph.ecog 0.4825 1.6201 0.1323 3.647 0.000266 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
sex 0.6004 1.6655 0.4082 0.8832
ph.ecog 1.6201 0.6172 1.2501 2.0998
Concordance= 0.641 (se = 0.031 )
Likelihood ratio test= 19.48 on 2 df, p=6e-05
Wald test = 19.35 on 2 df, p=6e-05
Score (logrank) test = 19.62 on 2 df, p=5e-05
Just OK.
OK (just) to take out those two covariates.
Both remaining variables strongly significant.
Nature of effect on survival time? Consider later.
Picture?
sex
and ph.ecog
score using plot_predictions
time
) to the condition
.sex
:sex = 2
) have better survival than males.ph.ecog
score, but the male-female comparison is the same for any score.ph.ecog
score:ph.ecog
score is better.ph.ecog
scoreph.ecog
score (above and below on each facet), and the effect of males (left) vs. females (right).ph.ecog
scale (compare the red curve on the left facet with the green curve on the right facet).Call:
coxph(formula = Surv(time, status == 2) ~ sex + ph.ecog, data = lung.complete)
n= 167, number of events= 120
coef exp(coef) se(coef) z Pr(>|z|)
sex -0.5101 0.6004 0.1969 -2.591 0.009579 **
ph.ecog 0.4825 1.6201 0.1323 3.647 0.000266 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
sex 0.6004 1.6655 0.4082 0.8832
ph.ecog 1.6201 0.6172 1.2501 2.0998
Concordance= 0.641 (se = 0.031 )
Likelihood ratio test= 19.48 on 2 df, p=6e-05
Wald test = 19.35 on 2 df, p=6e-05
Score (logrank) test = 19.62 on 2 df, p=5e-05
ph.ecog
score goes with a higher hazard of death (positive coef). So patients with a lower score are more likely to survive longer.No problems here:
Call:
coxph(formula = y ~ age, data = d)
n= 9, number of events= 8
coef exp(coef) se(coef) z Pr(>|z|)
age 0.01984 1.02003 0.03446 0.576 0.565
exp(coef) exp(-coef) lower .95 upper .95
age 1.02 0.9804 0.9534 1.091
Concordance= 0.545 (se = 0.105 )
Likelihood ratio test= 0.33 on 1 df, p=0.6
Wald test = 0.33 on 1 df, p=0.6
Score (logrank) test = 0.33 on 1 df, p=0.6
Down-and-up indicates incorrect relationship between age and survival:
Add squared term in age:
Not great, but less problematic than before:
Comments