---
title: "Survival Analysis"
editor:
markdown:
wrap: 72
---
## Survival analysis
- So far, have seen:
- response variable counted or measured (regression)
- response variable categorized (logistic regression)
- But what if response is time until event (eg. time of survival after
surgery)?
- Additional complication: event might not have happened at end of
study (eg. patient still alive). But knowing that patient has "not
died yet" presumably informative. Such data called *censored*.
- Enter *survival analysis*, in particular the "Cox proportional
hazards model".
- Explanatory variables in this context often called *covariates*.
## Packages
- Install package `survival` if not done. Also use `broom` and
`marginaleffects` from earlier.
```{r}
library(tidyverse)
library(survival)
library(broom)
library(marginaleffects)
```
## Example: still dancing?
- 12 women who have just started taking dancing lessons are followed
for up to a year, to see whether they are still taking dancing
lessons, or have quit. The "event" here is "quit".
- This might depend on:
- a treatment (visit to a dance competition)
- woman's age (at start of study).
## Data
\normalsize
```
Months Quit Treatment Age
1 1 0 16
2 1 0 24
2 1 0 18
3 0 0 27
4 1 0 25
7 1 1 26
8 1 1 36
10 1 1 38
10 0 1 45
12 1 1 47
```
\normalsize
## About the data
- `months` and `quit` are kind of combined response:
- `Months` is number of months a woman was actually observed
dancing
- `quit` is 1 if woman quit, 0 if still dancing at end of study.
- Treatment is 1 if woman went to dance competition, 0 otherwise.
- Fit model and see whether `Age` or `Treatment` have effect on
survival.
- Want to do predictions for probabilities of still dancing as they
depend on whatever is significant, and draw plot.
## Read data
- Column-aligned:
\normalsize
```{r bSurvival-2}
url <- "http://ritsokiguess.site/datafiles/dancing.txt"
dance <- read_table(url)
```
\normalsize
## The data
\small
```{r bSurvival-3}
dance
```
\normalsize
## Fit model
- Response variable has to incorporate both the survival time
(`Months`) and whether or not the event, quitting, happened (that
is, if `Quit` is 1).
- This is made using `Surv` from `survival` package, with two inputs:
- the column that has the survival times
- something that is `TRUE` or 1 if the event happened.
- Easiest for us to create this when we fit the model, predicting
response from explanatories:
```{r bSurvival-5 }
dance.1 <- coxph(Surv(Months, Quit) ~ Treatment + Age, data = dance)
```
## Output looks a lot like regression
\scriptsize
```{r bSurvival-6}
summary(dance.1)
```
\normalsize
## Conclusions
- Use $\alpha=0.10$ here since not much data.
- Three tests at bottom like global F-test. Consensus that something
predicts survival time (whether or not dancer quit and/or how long
it took).
- `Age` (definitely), `Treatment` (marginally) both predict survival
time.
## Behind the scenes
- All depends on *hazard rate*, which is based on probability that
event happens in the next short time period, given that event has
not happened yet:
- $X$ denotes time to event, $\delta$ is small time interval:
- $h(t) = P(X \le t + \delta | X \ge t) / \delta$
- if $h(t)$ large, event likely to happen soon (lifetime short)
- if $h(t)$ small, event unlikely to happen soon (lifetime long).
## Modelling lifetime
- want to model hazard rate
- but hazard rate always positive, so actually model *log* of hazard
rate
- modelling how (log-)hazard rate depends on other things eg $X_1 =$
age, $X_2 =$ treatment, with the $\beta$ being regression
coefficients:
- Cox model $h(t)=h_0(t)\exp(\beta_0+\beta_1X_1+\beta_2 X_2+\cdots)$,
or:
- $\log(h(t)) = \log(h_0(t)) + \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots$
- like a generalized linear model with log link.
## Predictions with `marginaleffects`
- `marginaleffects` knows about survival models (with sufficient care)
- Predicted survival probabilities depend on:
- the combination of explanatory variables you are looking at
- the time at which you are looking at them (when more time has
passed, it is more likely that the event has happened, so the
"survival probability" should be lower).
- look at effect of age by comparing ages 20 and 40, and later look at
the effect of treatment (values 1 and 0).
- Also have to provide some times to predict for, in `Months`.
## Effect of age
```{r}
new <- datagrid(model = dance.1, Age = c(20, 40), Months = c(3, 5, 7))
new
```
These are actually for women who *did not* go to the dance competition.
## The predictions
```{r}
cbind(predictions(dance.1, newdata = new, type = "survival")) %>%
select(Age, Treatment, Months, estimate)
```
The estimated survival probabilities go down over time. For example a
20-year-old woman here has estimated probability 0.0293 of still dancing
after 5 months.
## A graph
We can plot the predictions over time for an experimental condition such
as age. The key for `plot_predictions` is to put time *first* in the
`condition`:
```{r}
plot_predictions(dance.1, condition = c("Months", "Age"),
type = "survival")
```
## Comments
- The plot picks some representative ages.
- It is (usually) best to be up and to the right (has the highest
chance of surviving longest).
- Hence the oldest women have the best chance to still be dancing
longest (the youngest women are most likely to quit soonest).
## The effect of treatment
The same procedure will get predictions for women who did or did not go
to the dance competition, at various times:
```{r}
new <- datagrid(model = dance.1, Treatment = c(0, 1), Months = c(3, 5, 7))
new
```
The age used for predictions is the mean of all ages.
## The predictions
```{r}
cbind(predictions(dance.1, newdata = new, type = "survival")) %>%
select(Age, Treatment, Months, estimate)
```
Women of this age have a high (0.879) chance of still dancing after 7
months if they went to the dance competition, but much lower (0.165) if
they did not.
## A graph
In `condition`, put the time variable first, and then the effect of
interest:
```{r}
plot_predictions(dance.1, condition = c("Months", "Treatment"),
type = "survival")
```
## Comments
- The survival curve for Treatment 1 is higher all the way along
- Hence at any time, the women who went to the dance competition have
a higher chance of still dancing than those who did not.
## The model summary again
```{r}
summary(dance.1)
```
## Comments
- The numbers in the `coef` column describe effect of that variable on
log-hazard of quitting.
- Both numbers are negative, so a higher value on both variables goes
with a lower hazard of quitting:
- an older woman is less likely to quit soon (more likely to be
still dancing)
- a woman who went to the dance competition (`Treatment = 1`) is
less likely to quit soon vs. a woman who didn't (more likely to
be still dancing).
## Model checking
- With regression, usually plot residuals against fitted values.
- Not quite same here (nonlinear model), but "martingale residuals"
should have no pattern vs. "linear predictor".
- Use `broom` ideas to get them, in `.resid` and `.fitted` as below.
- Martingale residuals can go very negative, so won't always look
normal.
## Martingale residuals
```{r}
dance.1 %>% augment(dance) %>%
ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_smooth()
```
## A more realistic example: lung cancer
- When you load in an R package, get data sets to illustrate functions
in the package.
- One such is `lung`. Data set measuring survival in patients with
advanced lung cancer.
- Along with survival time, number of "performance scores" included,
measuring how well patients can perform daily activities.
- Sometimes high good, but sometimes bad!
- Variables below, from the data set help file (`?lung`).
## The variables
![](lung-cancer-data.png)
## Uh oh, missing values
\scriptsize
```{r bSurvival-13}
lung
```
\normalsize
## A closer look
```{r bSurvival-14, echo=F}
options(width = 70)
```
\tiny
```{r bSurvival-15}
summary(lung)
```
\normalsize
## Remove obs with *any* missing values
```{r bSurvival-16}
lung %>% drop_na() -> lung.complete
lung.complete %>%
select(meal.cal:wt.loss) %>%
slice(1:10)
```
Missing values seem to be gone.
## Check!
```{r bSurvival-17}
summary(lung.complete)
```
No missing values left.
## Model 1: use everything except `inst`
\footnotesize
```{r bSurvival-18}
names(lung.complete)
```
\normalsize
- Event was death, goes with `status` of 2:
```{r bSurvival-19 }
lung.1 <- coxph(Surv(time, status == 2) ~ . - inst - time - status,
data = lung.complete
)
```
"Dot" means "all the other variables".
## `summary` of model 1
\tiny
```{r bSurvival-20}
summary(lung.1)
```
\normalsize
## Overall significance
The three tests of overall significance: \small
```{r bSurvival-21}
glance(lung.1) %>% select(starts_with("p.value"))
```
\normalsize
All strongly significant. *Something* predicts survival.
## Coefficients for model 1
\small
```{r bSurvival-22 }
tidy(lung.1) %>% select(term, p.value) %>% arrange(p.value)
```
\normalsize
- `sex` and `ph.ecog` definitely significant here
- `age`, `pat.karno` and `meal.cal` definitely not
- Take out definitely non-sig variables, and try again.
## Model 2
\normalsize
```{r bSurvival-23}
lung.2 <- update(lung.1, . ~ . - age - pat.karno - meal.cal)
tidy(lung.2) %>% select(term, p.value)
```
\normalsize
## Compare with first model:
\normalsize
```{r bSurvival-24}
anova(lung.2, lung.1)
```
\normalsize
- No harm in taking out those variables.
## Model 3
Take out `ph.karno` and `wt.loss` as well.
```{r bSurvival-25}
lung.3 <- update(lung.2, . ~ . - ph.karno - wt.loss)
```
```{r tidy-lung-3}
tidy(lung.3) %>% select(term, estimate, p.value)
summary(lung.3)
```
## Check whether that was OK
```{r bSurvival-26}
anova(lung.3, lung.2)
```
*Just* OK.
## Commentary
- OK (just) to take out those two covariates.
- Both remaining variables strongly significant.
- Nature of effect on survival time? Consider later.
- Picture?
## Plotting survival probabilities
- Assess (separately) the effect of `sex` and `ph.ecog` score using
`plot_predictions`
- Don't forget to add time (here actually called `time`) to the
`condition`.
## Effect of `sex`:
```{r}
plot_predictions(lung.3, condition = c("time", "sex"), type = "survival")
```
- Females (`sex = 2`) have better survival than males.
- This graph from a mean `ph.ecog` score, but the male-female
comparison is the same for any score.
## Effect of `ph.ecog` score:
```{r}
plot_predictions(lung.3, condition = c("time", "ph.ecog"), type = "survival")
```
## Comments
- A lower `ph.ecog` score is better.
- For example, a patient with a score of 0 has almost a 50-50 chance
of living 500 days, but a patient with a score of 3 has almost no
chance to survive that long.
- Is this for males or females? See over. (The comparison of scores is
the same for both.)
## Sex and `ph.ecog` score
```{r}
plot_predictions(lung.3, condition = c("time", "ph.ecog", "sex"), type = "survival")
```
## Comments
- I think the previous graph was males.
- This pair of graphs shows the effect of `ph.ecog` score (above and
below on each facet), and the effect of males (left) vs. females
(right).
- The difference between males and females is about the same as 1
point on the `ph.ecog` scale (compare the red curve on the left
facet with the green curve on the right facet).
## The summary again
```{r}
summary(lung.3)
```
## Comments
- A higher-numbered sex (female) has a lower hazard of death (negative
coef). That is, females are more likely to survive longer than
males.
- A higher `ph.ecog` score goes with a *higher* hazard of death
(positive coef). So patients with a *lower* score are more likely to
survive longer.
- These are consistent with the graphs we drew.
## Martingale residuals for this model
No problems here:
```{r bSurvival-32, fig.height=3}
lung.3 %>% augment(lung.complete) %>%
ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_smooth()
```
## When the Cox model fails (optional)
- Invent some data where survival is best at middling age, and worse
at high *and* low age:
```{r bSurvival-33 }
age <- seq(20, 60, 5)
survtime <- c(10, 12, 11, 21, 15, 20, 8, 9, 11)
stat <- c(1, 1, 1, 1, 0, 1, 1, 1, 1)
d <- tibble(age, survtime, stat)
d %>% mutate(y = Surv(survtime, stat)) -> d
```
- Small survival time 15 in middle was actually censored, so would
have been longer if observed.
## Fit Cox model
\footnotesize
```{r bSurvival-34 }
y.1 <- coxph(y ~ age, data = d)
summary(y.1)
```
\normalsize
## Martingale residuals
Down-and-up indicates incorrect relationship between age and survival:
```{r bSurvival-35, fig.height=3.4, message=F}
#| warning = FALSE
y.1 %>% augment(d) %>%
ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_smooth()
```
## Attempt 2
Add squared term in age:
```{r bSurvival-36}
y.2 <- coxph(y ~ age + I(age^2), data = d)
tidy(y.2) %>% select(term, estimate, p.value)
```
- (Marginally) helpful.
## Martingale residuals this time
Not great, but less problematic than before:
```{r bSurvival-37, fig.height=3.2, message=F}
#| warning = FALSE
y.2 %>% augment(d) %>%
ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_smooth()
```