Chapter 29 Survival analysis
library(survival)
library(tidyverse)
library(survminer)
29.1 The Worcester survey
The Worcester survey was a long-term study of all myocardial-infarction36 victims admitted to hospitals in the Worcester, Massachusetts area.37 The data have been well studied, and can be found in the file link.
- Read the data and display the first few rows of the data frame. You might get an extra column, which you can ignore. For your information, the variables are:
patient ID code
admission date
date of last followup (this is the date of death if the patient died)
length of hospital stay (days)
followup time (days) (time between admission and last followup)
followup status: 1=dead, 0=alive
Age in years (at admission)
gender (0=male, 1=female)
body mass index (kg/m\(^2\))
Create a suitable response variable for a Cox proportional hazards model for time of survival, using the followup time and followup status.
Fit a Cox proportional hazards model predicting survival time from age, gender and BMI. Obtain the
summary
(but you don’t need to comment on it yet).Test the overall fit of the model. What does the result mean?
Can any of your explanatory variables be removed from the model? Explain briefly.
Remove your most non-significant explanatory variable from the model and fit again. Take a look at the results. Are all your remaining explanatory variables significant? (If all your explanatory variables were previously significant, you can skip this part.)
Calculate the 1st quartile, median, and 3rd quartiles of age and BMI. (
quantile
.) Round these off to the nearest whole number. (Do the rounding off yourself, though R has a functionround
that does this, which you can investigate if you want.) As an alternative, you can get these by passing the whole data frame, or the columns of it you want, intosummary
.Make a data frame out of all the combinations of age and BMI values (that you obtained in the previous part) suitable for predicting with.
Obtain predicted survival probabilities for each of the values in your new data frame. Use your best model. (You don’t need to look at the results, though you can if you want to.)
Make a graph depicting the survival curves from
survfit
with different colours distinguishing the different survival curves.What is the effect of age on survival? What is the effect of BMI on survival? Explain briefly. (You will have to disentangle the meaning of the different coloured lines on the plot to do this.)
29.2 Drug treatment programs
One of the goals of drug treatment programs is to lengthen the time until the patient returns to using drugs. (It is not generally possible to prevent patients from ever using drugs again.) In one study, over 600 former drug users took part. Two different programs, a short program and a long program, were offered at two different sites, labelled A and B. The data can be found in link. The variables are these:
ID
: patient ID numberage
: patient age at enrollment into the studyndrugtx
: number of previous drug treatmentstreat
: 0 for short treatment program, 1 for long programsite
: 0 for site A, 1 for site Btime
: time until return to drug usecensor
: whether the subject returned to drug use (1) or not (0) during the follow-up periodherco
: whether subject used heroine or cocaine in the last 3 months: 1 is both, 2 is one (either heroine or cocaine), 3 is neither.
Read in the data and check in one way or another that you have what was promised above.
There are some missing values in the dataframe. Demonstrate this using
summary
. Pipe the dataframe intodrop_na
and show that they have gone. (drop_na
removes all rows that have missing values in them.)Some of these variables are recorded as numbers but are actually categorical. Which ones? Re-define these variables in your data frame so that they have sensible (text) values.
Create a suitable reponse variable for a Cox proportional hazards regression that predicts time until return to drug use from the other variables. This requires some care, because you need to be sure about what the censoring variable actually represents and what you need it to represent.
Look at the first few values of your response variable. Why is the fifth one marked with a
+
? Explain briefly.Fit a Cox proportional hazards model, predicting from all the other variables (except for
row
andID
) that you haven’t used yet. Display the results.Find which explanatory variables can be removed at \(\alpha=0.05\) (there should be two of them). Bear in mind that we have categorical variables, so that looking at the output from
summary
is not enough.Remove all the non-significant explanatory variables and re-fit your model. By carrying out a suitable test demonstrate that your smaller model is the better one.
* Display your better model. Are all of the explanatory variables significant? Do their slope coefficients have sensible signs (plus or minus), based on what you know or can guess about drug treatments? Explain briefly.
We have three variables left in our model,
age
,ndrugtx
andtreat
. The quartiles of age are 27 and 37, the quartiles ofndrugtx
are 1 and 6, and the two possible values oftreat
areshort
andlong
. Create a data frame with variables of these names and all possible combinations of their values (so there should be 8 rows in the resulting data frame). Display the resulting data frame.Obtain predicted survival probabilities for each of the values of
age
,ndrugtx
andtreat
used in the previous part. You don’t need to display it (we are going to plot it shortly).Plot your predicted survival curves.
Which of your combinations of values is predicted to take the longest to return to drug use? Which is predicted to take the shortest time? Explain briefly.
Are your survival curve plot and your conclusions from part (here) consistent, or not? Explain briefly.
29.3 Multiple myeloma
Multiple myeloma is a kind of cancer. It forms in a plasma cell (which is a type of white blood cell). It causes cancer cells to accumulate in the bone marrow, where they crowd out healthy blood cells. Plasma cells make antibodies (to help fight infections), while the cancer cells don’t: they produce abnormal proteins that can cause kidney problems. (This adapted from link.) The variables are:
time
: survival time from diagnosis (months)vstatus
: 0=alive, 1=dead at end of studylogbun
: log of BUN test score (BUN test is a test of kidney function, not to be confused with cha siu bao38).hgb
: hemoglobin (at diagnosis).platelet
: platelets: 1=normal, 0=abnormal (at diagnosis).age
at diagnosis, in yearslogwbc
: log of WBC (white blood cell count, at diagnosis)frac
: fractures at diagnosis (0=absent, 1=present)logpbm
: log of percent of plasma cells in bone marrowprotein
: proteinuria (protein in urine) at diagnosis. Most people have very little, so a larger than normal amount indicates illness of some kind.scalc
: serum calcium at diagnosis.
The data, on 65 patients with multiple myeloma, are in link. Some of the variables are logs because they could take very large values.
There are a lot of parts here, but each part is supposed to be short.
Read in the data and display (some of) the values. Confirm that you have the right number of observations and the right variables.
Create a suitable response variable for a Cox proportional-hazards survival model, bearing in mind that the “event” here is death. Display your response variable, and explain briefly what the
+
signs attached to some of the values mean, without using a technical term.What is the technical term for those patients that have a
+
by their values for the response variable?Fit a Cox proportional-hazards survival model predicting your response variable from all the other variables (except for the ones that you used to make the response variable). Display the
summary
of your model.In your model, which explanatory variables have a P-value less than 0.10? Fit a model containing only those and display the results.
Do a test to compare the two models that you fit. Why do you prefer the second model? Explain briefly.
There should be two explanatory variables left in your model. These are both numerical variables. Find their first and third quartiles, any way you like.
Create a data frame containing all possible combinations of the two quartiles for each of the two variables, and display the result.
Obtain predicted survival probabilities for each of the combinations of variables you created above. You don’t need to look at the results (they are rather long).
Obtain a graph of the predicted survival curves for each combination of your variables.
Is it better to have high or low values for each of the variables in your prediction? Explain briefly.
29.4 Ovarian cancer
R’s survival
package contains several data
sets. One of these is called ovarian
; it comes from a study
of 26 ovarian cancer patients. The major purpose of this study was to
compare the effects of two treatments on survival time.
Obtain and display (all of) the data set. This is as simple as loading the package and typing the data set’s name.
The columns of interest to us are:
futime
: the time for which a patient was followed-up: the number of days until either they died or the study ended (or they withdrew from the study for some other reason).fustat
: follow-up status: 1 if the patient died of ovarian cancer, 0 if they were still alive when the study ended.age
: of patient, at diagnosis, in yearsrx
: treatment, numbered 1 or 2, but really labels for the two treatments.
Create and display a suitable response variable y
for a Cox
proportional-hazards model.
In the display of your response variable, some values are marked with a
+
. Why is that? Explain briefly. (If you use a technical term, you should explain what it means.)Fit a Cox proportional-hazards model for predicting survival time from age and treatment. Note that the numeric values for treatment make sense only as labels for the two treatments, so in your model formula make treatment into a factor. Display the results.
Is there a significant difference between the treatments in terms of their effects on survival (from ovarian cancer)?
Is there a significant effect of age? If there is, describe the effect that age has on survival.
Make a martingale residual plot for this model. Do you see any problems? Explain briefly.
Find the quartiles of
age
, and make a data frame containing all combinations of those two ages and the two treatments. Display what you have. (Feel free to copy the values by hand, rather than trying to save them and use them.)Obtain predicted survival probabilities for each of your age-treatment combinations, for each of a variety of survival times. (This is only one thing, despite it sounding like a lot.)
Draw a plot that compares the survival probabilities at the different times.
According to your plot, how would you describe the effects of treatment and of age?
My solutions follow:
29.5 The Worcester survey
The Worcester survey was a long-term study of all myocardial-infarction39 victims admitted to hospitals in the Worcester, Massachusetts area.40 The data have been well studied, and can be found in the file link.
- Read the data and display the first few rows of the data frame. You might get an extra column, which you can ignore. For your information, the variables are:
patient ID code
admission date
date of last followup (this is the date of death if the patient died)
length of hospital stay (days)
followup time (days) (time between admission and last followup)
followup status: 1=dead, 0=alive
Age in years (at admission)
gender (0=male, 1=female)
body mass index (kg/m\(^2\))
Solution
<- "http://ritsokiguess.site/datafiles/whas100.csv"
my_url <- read_csv(my_url) whas100
## New names:
## Rows: 100 Columns: 10
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): admitdate, foldate dbl (8): ...1, id, los, lenfol, fstat, age, gender, bmi
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
whas100
## # A tibble: 100 × 10
## ...1 id admitdate foldate los lenfol fstat age gender bmi
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 3/13/1995 3/19/1995 4 6 1 65 0 31.4
## 2 2 2 1/14/1995 1/23/1996 5 374 1 88 1 22.7
## 3 3 3 2/17/1995 10/4/2001 5 2421 1 77 0 27.9
## 4 4 4 4/7/1995 7/14/1995 9 98 1 81 1 21.5
## 5 5 5 2/9/1995 5/29/1998 4 1205 1 78 0 30.7
## 6 6 6 1/16/1995 9/11/2000 7 2065 1 82 1 26.5
## 7 7 7 1/17/1995 10/15/1997 3 1002 1 66 1 35.7
## 8 8 8 11/15/1994 11/24/2000 56 2201 1 81 1 28.3
## 9 9 9 8/18/1995 2/23/1996 5 189 1 76 0 27.1
## 10 10 10 7/22/1995 12/31/2002 9 2719 0 40 0 21.8
## # … with 90 more rows
I seem to have an extra column called X1
. This is because I
saved my version of the data using the old write.csv
, which
comes with row names, and I forgot to
get rid of them. These came back as an extra unnamed variable to which
read_delim
gave the name X1
.
\(\blacksquare\)
- Create a suitable response variable for a Cox proportional hazards model for time of survival, using the followup time and followup status.
Solution
Surv
. The event here is death, so the two parts of the
response variable are followup time lenfol
and followup
status, 1 being “dead”, fstat
:
<- with(whas100, Surv(lenfol, fstat == 1))
y y
## [1] 6 374 2421 98 1205 2065 1002 2201 189 2719+ 2638+ 492
## [13] 302 2574+ 2610+ 2641+ 1669 2624 2578+ 2595+ 123 2613+ 774 2012
## [25] 2573+ 1874 2631+ 1907 538 104 6 1401 2710 841 148 2137+
## [37] 2190+ 2173+ 461 2114+ 2157+ 2054+ 2124+ 2137+ 2031 2003+ 2074+ 274
## [49] 1984+ 1993+ 1939+ 1172 89 128 1939+ 14 1011 1497 1929+ 2084+
## [61] 107 451 2183+ 1876+ 936 363 1048 1889+ 2072+ 1879+ 1870+ 1859+
## [73] 2052+ 1846+ 2061+ 1912+ 1836+ 114 1557 1278 1836+ 1916+ 1934+ 1923+
## [85] 44 1922+ 274 1860+ 1806 2145+ 182 2013+ 2174+ 1624 187 1883+
## [97] 1577 62 1969+ 1054
Just using fstat
alone as the second thing in Surv
also works, because anything that gives TRUE
or 1 when the
event (death) occurs is equally good. (In R, TRUE
as a number
is 1 and FALSE
as a number is 0.)
I listed the values by way of checking. The ones with a +
are
censored: that is, the patient was still alive the last time the
doctor saw them. Most of the censored values are longer times. Usually
this happens because the patient was still alive at the end of the study.
\(\blacksquare\)
- Fit a Cox proportional hazards model predicting survival time
from age, gender and BMI. Obtain the
summary
(but you don’t need to comment on it yet).
Solution
This, using the response variable that we just created:
.1 <- coxph(y ~ age + gender + bmi, data = whas100)
whas100summary(whas100.1)
## Call:
## coxph(formula = y ~ age + gender + bmi, data = whas100)
##
## n= 100, number of events= 51
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.03713 1.03783 0.01272 2.918 0.00352 **
## gender 0.14325 1.15402 0.30604 0.468 0.63973
## bmi -0.07083 0.93162 0.03607 -1.964 0.04956 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0378 0.9636 1.0123 1.0640
## gender 1.1540 0.8665 0.6334 2.1024
## bmi 0.9316 1.0734 0.8680 0.9999
##
## Concordance= 0.683 (se = 0.037 )
## Likelihood ratio test= 21.54 on 3 df, p=8e-05
## Wald test = 19.46 on 3 df, p=2e-04
## Score (logrank) test = 20.82 on 3 df, p=1e-04
\(\blacksquare\)
- Test the overall fit of the model. What does the result mean?
Solution
Look at those three P-values at the bottom. They are all small, so something in the model is helping to predict survival. As to what? Well, that’s the next part.
\(\blacksquare\)
- Can any of your explanatory variables be removed from the model? Explain briefly.
Solution
gender
has a (very) large P-value, so that can be taken
out of the model. The other two variables have small P-values
(bmi
only just under 0.05), so they need to stay.
The other way to think about this is step
, or drop1
:
drop1(whas100.1, test = "Chisq")
## Single term deletions
##
## Model:
## y ~ age + gender + bmi
## Df AIC LRT Pr(>Chi)
## <none> 402.66
## age 1 410.25 9.5972 0.001949 **
## gender 1 400.87 0.2185 0.640218
## bmi 1 404.60 3.9389 0.047182 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is here equivalent to41 the output
from summary
, but where it
scores is if you have a categorical explanatory variable like
“treatment” with more than two levels: drop1
will tell you
about keeping or dropping it as a whole.42
If you prefer:
step(whas100.1, trace = 0, test = "Chisq")
## Call:
## coxph(formula = y ~ age + bmi, data = whas100)
##
## coef exp(coef) se(coef) z p
## age 0.03927 1.04005 0.01187 3.309 0.000938
## bmi -0.07116 0.93131 0.03614 -1.969 0.048952
##
## Likelihood ratio test=21.32 on 2 df, p=2.346e-05
## n= 100, number of events= 51
gender
comes out, but the others stay. As usual, put
trace=1
or trace=2
to get more output, which will
look like a sequence of drop1
’s one after the other.
\(\blacksquare\)
- Remove your most non-significant explanatory variable from the model and fit again. Take a look at the results. Are all your remaining explanatory variables significant? (If all your explanatory variables were previously significant, you can skip this part.)
Solution
So, take out gender
:
.2 <- update(whas100.1, . ~ . - gender)
whas100summary(whas100.2)
## Call:
## coxph(formula = y ~ age + bmi, data = whas100)
##
## n= 100, number of events= 51
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.03927 1.04005 0.01187 3.309 0.000938 ***
## bmi -0.07116 0.93131 0.03614 -1.969 0.048952 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0401 0.9615 1.0161 1.0645
## bmi 0.9313 1.0738 0.8676 0.9997
##
## Concordance= 0.681 (se = 0.037 )
## Likelihood ratio test= 21.32 on 2 df, p=2e-05
## Wald test = 19 on 2 df, p=7e-05
## Score (logrank) test = 19.99 on 2 df, p=5e-05
Both explanatory variables are significant: age
definitely,
bmi
only just. This is the same model as step
gave me.
\(\blacksquare\)
- Calculate the 1st quartile, median, and 3rd quartiles of
age and BMI. (
quantile
.) Round these off to the nearest whole number. (Do the rounding off yourself, though R has a functionround
that does this, which you can investigate if you want.) As an alternative, you can get these by passing the whole data frame, or the columns of it you want, intosummary
.
Solution
quantile(whas100$age)
## 0% 25% 50% 75% 100%
## 32.00 59.75 71.00 80.25 92.00
quantile(whas100$bmi)
## 0% 25% 50% 75% 100%
## 14.91878 23.53717 27.19158 30.34770 39.93835
or
%>%
whas100 select(age, bmi) %>%
summary()
## age bmi
## Min. :32.00 Min. :14.92
## 1st Qu.:59.75 1st Qu.:23.54
## Median :71.00 Median :27.19
## Mean :68.25 Mean :27.04
## 3rd Qu.:80.25 3rd Qu.:30.35
## Max. :92.00 Max. :39.94
Or, pure tidyverse: summarize all the columns (after you’ve done the select
).
%>%
whas100 select(age, bmi) %>%
summarize(across(everything(), \(x) list(quantile(x)))) %>%
unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(age, bmi)`
## # A tibble: 5 × 2
## age bmi
## <dbl> <dbl>
## 1 32 14.9
## 2 59.8 23.5
## 3 71 27.2
## 4 80.2 30.3
## 5 92 39.9
The reason for the list()
and the unnest()
is that
quantile
returns five numbers rather than just one, so it has
to go into the output data frame as a list-column, which then gets
broken apart with unnest
so that you can see it. Try it
without the unnest()
line to see what happens if you don’t do
that.
Using whichever of this multitude of ways appeals to you:
60, 71 and 80 for age, 24, 27 and 30 for BMI.
\(\blacksquare\)
- Make a data frame out of all the combinations of age and BMI values (that you obtained in the previous part) suitable for predicting with.
Solution
The inevitable crossing
. This is probably quickest:
<- crossing(age = c(60, 71, 80), bmi = c(24, 27, 30))
whas100.new whas100.new
## # A tibble: 9 × 2
## age bmi
## <dbl> <dbl>
## 1 60 24
## 2 60 27
## 3 60 30
## 4 71 24
## 5 71 27
## 6 71 30
## 7 80 24
## 8 80 27
## 9 80 30
Or, with some setup beforehand to make the crossing
clearer. This is my no-think approach, which is what I recommend;
vectors with plural names containing the values, and inside the
crossing
, a singular name equals a plural one:
<- c(60, 71, 80)
ages <- c(24, 27, 30)
bmis <- crossing(age = ages, bmi = bmis)
whas100.new whas100.new
## # A tibble: 9 × 2
## age bmi
## <dbl> <dbl>
## 1 60 24
## 2 60 27
## 3 60 30
## 4 71 24
## 5 71 27
## 6 71 30
## 7 80 24
## 8 80 27
## 9 80 30
\(\blacksquare\)
- Obtain predicted survival probabilities for each of the values in your new data frame. Use your best model. (You don’t need to look at the results, though you can if you want to.)
Solution
The magic word is survfit
(which plays the role of
predict
here). The best model is whas100.2
,
with the non-significant gender
removed:
<- survfit(whas100.2, whas100.new, data = whas100) pp2
This doesn’t need the data=
at the end (it works perfectly
well without), but the plot (later) seems to need it to be there. I
think the plot needs the information from the original data to be in
the predictions somewhere.
This is kind of long to look at (summary(pp2)
would be the
thing), so we will need to make a graph of it. I gave it a name, since
I want to use it again later.
\(\blacksquare\)
- Make a graph depicting the survival curves from
survfit
with different colours distinguishing the different survival curves.
Solution
This is actually easy once you work out what to do:
ggsurvplot(pp2, conf.int = FALSE)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the survminer package.
## Please report the issue at <]8;;https://github.com/kassambara/survminer/issueshttps://github.com/kassambara/survminer/issues]8;;>.
Without the conf.int
thing, you get confidence intervals for
each survival curve, which overlap each other and generally make the
plot look messy.
The “strata” are the different age-BMI combinations that you predicted for, so it’s usually a good idea to list the “new” prediction data frame, either here or when you assess the effects of the variables (next part) so that you can see which is which:
whas100.new
## # A tibble: 9 × 2
## age bmi
## <dbl> <dbl>
## 1 60 24
## 2 60 27
## 3 60 30
## 4 71 24
## 5 71 27
## 6 71 30
## 7 80 24
## 8 80 27
## 9 80 30
\(\blacksquare\)
- What is the effect of age on survival? What is the effect of BMI on survival? Explain briefly. (You will have to disentangle the meaning of the different coloured lines on the plot to do this.)
Solution
Bear in mind that up-and-to-the-right is best for a survival curve, since that means that people in the upper-right group have a higher chance of surviving for longer.
The best survival curve is therefore the olive-green one. According
to the legend, this goes with stratum 3, which is (according to
the listing of whas100.new
) age 60 (the youngest) and BMI
30 (the highest). So it looks as if the best survival goes with a
lower age (not surprising) and a higher BMI (surprising; see
discussion about BMI below).
You can also leave one variable constant and see the effects of changing the other one. Let’s pick the oldest age 80: the BMI values are 24 (stratum 7, blue), 27 (stratum 8, purple), 30 (stratum 9, pink). These survival curves are the bottom one, the second bottom one, and the fourth bottom one. At this age, survival chances are not great, but having a higher BMI goes with a greater chance of surviving longer.
Or pick a BMI, say 30. These are strata 3 (olive green), 6 (light blue) and 9 (pink) respectively for ages 60, 71 and 80. These are the best, 3rd best and 5th best survival curves; that is, as age increases, the chance of surviving a long time decreases.
The effect of BMI, though, seems backwards: a higher BMI is associated with a higher chance of survival.
That’s the end of what I wanted you to do, but:
A higher BMI is usually associated with being obese (and therefore unhealthy), so you’d expect the effect of BMI to be the other way around. According to Wikipedia (link), the BMI values here are “overweight” or close to it. Maybe being heavier helps the body recover from a heart attack.
Let’s start with the martingale residual plot:
ggcoxdiagnostics(whas100.2) + geom_smooth()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
There is a suspicion of bendiness here, though the left side of the curve is entirely because of that one positive residual on the left. In any case, this suggests that nonlinearity (evidently in terms of BMI, since that’s the relationship that currently makes no sense) would be worth exploring.
Thus:
.3 <- update(whas100.2, . ~ . + I(bmi^2))
whas100summary(whas100.3)
## Call:
## coxph(formula = y ~ age + bmi + I(bmi^2), data = whas100)
##
## n= 100, number of events= 51
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.040542 1.041375 0.012035 3.369 0.000755 ***
## bmi -0.848949 0.427864 0.231562 -3.666 0.000246 ***
## I(bmi^2) 0.014500 1.014606 0.004227 3.430 0.000603 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0414 0.9603 1.0171 1.0662
## bmi 0.4279 2.3372 0.2718 0.6736
## I(bmi^2) 1.0146 0.9856 1.0062 1.0230
##
## Concordance= 0.693 (se = 0.04 )
## Likelihood ratio test= 30.71 on 3 df, p=1e-06
## Wald test = 32.56 on 3 df, p=4e-07
## Score (logrank) test = 36.57 on 3 df, p=6e-08
Ah, that seems to be it. The significant positive coefficient on
bmi
-squared
means that the “hazard of dying” increases faster with increasing
bmi
, so there ought to be an optimal BMI beyond which
survival chances decrease again.
Have we improved the residuals by adding the squared term?
ggcoxdiagnostics(whas100.3) + geom_smooth()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
I call those “inconsequential wiggles” now, so I think we are good. Let’s explore the quadratic relationship on a graph.
I’m going to focus on a close-to-median age of 70, since, in this model, the effect of BMI is the same for all ages (to make it different, we would need an interaction term, ANOVA-style).
First we create a data frame with a bunch of different BMIs in, and one age 70:
<- seq(20, 36, 4)
bmis <- 70
ages .2 <- crossing(bmi = bmis, age = ages)
whas100.new.2 whas100.new
## # A tibble: 5 × 2
## bmi age
## <dbl> <dbl>
## 1 20 70
## 2 24 70
## 3 28 70
## 4 32 70
## 5 36 70
It is rather absurd to have a plural ages
with only one age
in it, but that’s the way it goes, if you’re me and trying to avoid thinking.
Predictions, using the model with the squared term in it:
<- survfit(whas100.3, whas100.new.2, data = whas100) pp3
And then the plot:
ggsurvplot(pp3, conf.int = F)
and the customary reminder of which stratum is which, with its rather ungainly name:
.2 whas100.new
## # A tibble: 5 × 2
## bmi age
## <dbl> <dbl>
## 1 20 70
## 2 24 70
## 3 28 70
## 4 32 70
## 5 36 70
This time, the green survival curve is best, stratum 3, which means that survival is best at BMI 28, and worse for both higher BMIs and lower BMIs. You can follow the sequence of colours: red, olive-green, green, blue, pink, that goes up and then down again. But it’s still true that having a very low BMI is worst, which is why our (linear) model said that having a higher BMI was better.
It would have been better to have you put a squared term in the model, but the question was already long and complicated enough, and I didn’t want to make your lives more of a nightmare than they are already becoming!
\(\blacksquare\)
29.6 Drug treatment programs
One of the goals of drug treatment programs is to lengthen the time until the patient returns to using drugs. (It is not generally possible to prevent patients from ever using drugs again.) In one study, over 600 former drug users took part. Two different programs, a short program and a long program, were offered at two different sites, labelled A and B. The data can be found in link. The variables are these:
ID
: patient ID numberage
: patient age at enrollment into the studyndrugtx
: number of previous drug treatmentstreat
: 0 for short treatment program, 1 for long programsite
: 0 for site A, 1 for site Btime
: time until return to drug usecensor
: whether the subject returned to drug use (1) or not (0) during the follow-up periodherco
: whether subject used heroine or cocaine in the last 3 months: 1 is both, 2 is one (either heroine or cocaine), 3 is neither.
- Read in the data and check in one way or another that you have what was promised above.
Solution
This:
<- "http://ritsokiguess.site/datafiles/drugusers.txt"
my_url <- read_delim(my_url, " ") drugusers
## Rows: 628 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## dbl (9): row, ID, age, ndrugtx, treat, site, time, censor, herco
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
drugusers
## # A tibble: 628 × 9
## row ID age ndrugtx treat site time censor herco
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 39 1 1 0 188 1 3
## 2 2 2 33 8 1 0 26 1 3
## 3 3 3 33 3 1 0 207 1 2
## 4 4 4 32 1 0 0 144 1 3
## 5 5 5 24 5 1 0 551 0 2
## 6 6 6 30 1 1 0 32 1 1
## 7 7 7 39 34 1 0 459 1 3
## 8 8 8 27 2 1 0 22 1 3
## 9 9 9 40 3 1 0 210 1 2
## 10 10 10 36 7 1 0 184 1 2
## # … with 618 more rows
This shows that you have over 600 rows and the variables described.
\(\blacksquare\)
- There are some missing values in the dataframe. Demonstrate
this using
summary
. Pipe the dataframe intodrop_na
and show that they have gone. (drop_na
removes all rows that have missing values in them.)
Solution
First off, summary
is a quick way to show how many missing
values there are:43
summary(drugusers)
## row ID age ndrugtx
## Min. : 1.0 Min. : 1.0 Min. :20.00 Min. : 0.000
## 1st Qu.:157.8 1st Qu.:157.8 1st Qu.:27.00 1st Qu.: 1.000
## Median :314.5 Median :314.5 Median :32.00 Median : 3.000
## Mean :314.5 Mean :314.5 Mean :32.37 Mean : 4.574
## 3rd Qu.:471.2 3rd Qu.:471.2 3rd Qu.:37.00 3rd Qu.: 6.000
## Max. :628.0 Max. :628.0 Max. :56.00 Max. :40.000
## NA's :5 NA's :17
## treat site time censor
## Min. :0.0000 Min. :0.000 Min. : 2.0 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.: 79.0 1st Qu.:1.0000
## Median :0.0000 Median :0.000 Median : 166.0 Median :1.0000
## Mean :0.4904 Mean :0.293 Mean : 234.7 Mean :0.8089
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.: 365.2 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1172.0 Max. :1.0000
##
## herco
## Min. :1.000
## 1st Qu.:1.000
## Median :2.000
## Mean :1.898
## 3rd Qu.:3.000
## Max. :3.000
##
Age has five missing values and “number of previous drug treatments” has seventeen.
Following the instructions, and saving back into the original dataframe:
%>% drop_na() -> drugusers drugusers
and then
summary(drugusers)
## row ID age ndrugtx
## Min. : 1.0 Min. : 1.0 Min. :20.00 Min. : 0.000
## 1st Qu.:155.2 1st Qu.:155.2 1st Qu.:27.00 1st Qu.: 1.000
## Median :312.5 Median :312.5 Median :32.00 Median : 3.000
## Mean :313.8 Mean :313.8 Mean :32.39 Mean : 4.579
## 3rd Qu.:473.8 3rd Qu.:473.8 3rd Qu.:37.00 3rd Qu.: 6.000
## Max. :628.0 Max. :628.0 Max. :56.00 Max. :40.000
## treat site time censor
## Min. :0.0000 Min. :0.0000 Min. : 2.0 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 79.0 1st Qu.:1.0000
## Median :0.0000 Median :0.0000 Median : 166.0 Median :1.0000
## Mean :0.4918 Mean :0.2984 Mean : 234.4 Mean :0.8115
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 361.8 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1172.0 Max. :1.0000
## herco
## Min. :1.00
## 1st Qu.:1.00
## Median :2.00
## Mean :1.89
## 3rd Qu.:3.00
## Max. :3.00
No NA left. Gosh, as they say, that was easy. Extra: how many rows did we lose?
nrow(drugusers)
## [1] 610
There were 628 rows before, so we lost 18. (There were 22 missing values, but some of them were two on one row, so we only lost 18 rows.)
This is a very unsophisticated way of dealing with missing values. Another way is to “impute” them, that is, to guess what they would have been, and then fill in the guessed values and use them as if they were the truth, for example by regressing the columns with missing values on all the others, and using the regression predictions in place of the missing values.
\(\blacksquare\)
- Some of these variables are recorded as numbers but are actually categorical. Which ones? Re-define these variables in your data frame so that they have sensible (text) values.
Solution
These variables are actually categorical rather than quantitative:
treat
site
censor
herco
Most of them have only two levels, so it doesn’t matter whether
we make them categorical or leave them as numbers, but for
herco
it matters. Let’s give them all sensible values,
mostly with ifelse
,44 thus:
%>% mutate(
drugusers treat = ifelse(treat == 0, "short", "long"),
site = ifelse(site == 0, "A", "B"),
censor = ifelse(censor == 1, "returned", "no-return"),
herco = case_when(
== 1 ~ "both",
herco == 2 ~ "one",
herco == 3 ~ "neither"
herco
)-> drugusers )
I’m living on the edge and overwriting everything:
drugusers
## # A tibble: 610 × 9
## row ID age ndrugtx treat site time censor herco
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 1 1 39 1 long A 188 returned neither
## 2 2 2 33 8 long A 26 returned neither
## 3 3 3 33 3 long A 207 returned one
## 4 4 4 32 1 short A 144 returned neither
## 5 5 5 24 5 long A 551 no-return one
## 6 6 6 30 1 long A 32 returned both
## 7 7 7 39 34 long A 459 returned neither
## 8 8 8 27 2 long A 22 returned neither
## 9 9 9 40 3 long A 210 returned one
## 10 10 10 36 7 long A 184 returned one
## # … with 600 more rows
\(\blacksquare\)
- Create a suitable reponse variable for a Cox proportional hazards regression that predicts time until return to drug use from the other variables. This requires some care, because you need to be sure about what the censoring variable actually represents and what you need it to represent.
Solution
This is Surv
in package survival
. The response
variable needs to encode two things: the time until the event of
interest (return to drug use) and whether or not that event
happened for each patient.45
In this case, that is
censor="returned"
.
<- with(drugusers, Surv(time, censor == "returned")) y
Use whatever name you gave to the level of censor
that means “returned to drug use”.
\(\blacksquare\)
- Look at the first few values of your response variable. Why
is the fifth one marked with a
+
? Explain briefly.
Solution
head
works as well with a vector (displaying the first
six values) as it does with a data frame:
head(y)
## [1] 188 26 207 144 551+ 32
The fifth value is marked with a +
because it is a censored
value: this is a patient who was never observed to go back to drug
use. You can tell this by looking at the head
of the entire
data frame:
head(drugusers)
## # A tibble: 6 × 9
## row ID age ndrugtx treat site time censor herco
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 1 1 39 1 long A 188 returned neither
## 2 2 2 33 8 long A 26 returned neither
## 3 3 3 33 3 long A 207 returned one
## 4 4 4 32 1 short A 144 returned neither
## 5 5 5 24 5 long A 551 no-return one
## 6 6 6 30 1 long A 32 returned both
since this patient has censor="no-return"
. The other ones have
censor="returned"
; these are all “uncensored” in the jargon.
Typically, censored values will be bigger than uncensored ones, because (in general) the individual will be observed until the study ends, and studies of this kind carry on for years:
ggplot(drugusers, aes(x = censor, y = time)) + geom_boxplot()
Yep. The smallest time for a censored observation would be an upper outlier if it were observed for an uncensored observation.
One nice side-effect of turning censor
into a categorical variable is that it can now
distinguish groups as a boxplot requires.
I discovered something rather amusing when I originally wrote this (a year ago). Suppose you want to compare times for the two treatment groups, and you also want to distinguish censored from non-censored observations. Then, this works:
ggplot(drugusers, aes(x = treat, y = time, colour = censor)) +
geom_boxplot()
For each treatment, you get side-by-side boxplots of the times for censored (red) and uncensored (blue) observations, and so you see for both treatments (short and long) the censored times are typically longer than the uncensored ones.
(This you may recognize as a “grouped boxplot”, for when we have two categorical variables and one quantitative one.)
I borrow this idea for two-way ANOVA (coming up later).
\(\blacksquare\)
- Fit a Cox proportional hazards model, predicting from all
the other variables (except for
row
andID
) that you haven’t used yet. Display the results.
Solution
.1 <- coxph(y ~ age + ndrugtx + treat + site + herco, data = drugusers)
druguserssummary(drugusers.1)
## Call:
## coxph(formula = y ~ age + ndrugtx + treat + site + herco, data = drugusers)
##
## n= 610, number of events= 495
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age -0.023798 0.976483 0.007561 -3.148 0.00165 **
## ndrugtx 0.034815 1.035429 0.007755 4.490 7.14e-06 ***
## treatshort 0.254606 1.289953 0.091006 2.798 0.00515 **
## siteB -0.173021 0.841120 0.102105 -1.695 0.09016 .
## herconeither 0.125779 1.134032 0.103075 1.220 0.22236
## hercoone 0.247318 1.280586 0.122759 2.015 0.04394 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 0.9765 1.0241 0.9621 0.9911
## ndrugtx 1.0354 0.9658 1.0198 1.0513
## treatshort 1.2900 0.7752 1.0792 1.5418
## siteB 0.8411 1.1889 0.6886 1.0275
## herconeither 1.1340 0.8818 0.9266 1.3879
## hercoone 1.2806 0.7809 1.0067 1.6289
##
## Concordance= 0.581 (se = 0.014 )
## Likelihood ratio test= 35.08 on 6 df, p=4e-06
## Wald test = 36.96 on 6 df, p=2e-06
## Score (logrank) test = 37.36 on 6 df, p=1e-06
Another way to handle
“all the other \(x\)’s except row
, ID
, time
and censor
” is this:
.1a <- coxph(y ~ . - row - ID - time - censor, data = drugusers)
drugusers::tidy(drugusers.1a) broom
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 age -0.0238 0.00756 -3.15 0.00165
## 2 ndrugtx 0.0348 0.00775 4.49 0.00000714
## 3 treatshort 0.255 0.0910 2.80 0.00515
## 4 siteB -0.173 0.102 -1.69 0.0902
## 5 herconeither 0.126 0.103 1.22 0.222
## 6 hercoone 0.247 0.123 2.01 0.0439
Same. I used tidy
from broom
to shorten the output a bit.
\(\blacksquare\)
- Find which explanatory variables can be removed at \(\alpha=0.05\)
(there should be two of them). Bear in mind that we have categorical variables, so that looking at the output from
summary
is not enough.
Solution
The hint is meant to suggest to you that looking at drop1
is the right way to go:
drop1(drugusers.1, test = "Chisq")
## Single term deletions
##
## Model:
## y ~ age + ndrugtx + treat + site + herco
## Df AIC LRT Pr(>Chi)
## <none> 5712.2
## age 1 5720.3 10.0972 0.001485 **
## ndrugtx 1 5727.3 17.1446 3.464e-05 ***
## treat 1 5718.0 7.8352 0.005124 **
## site 1 5713.1 2.9335 0.086760 .
## herco 2 5712.5 4.3182 0.115427
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that herco
, a categorical variable with three levels,
has 2 degrees of freedom here, since a test of
“no effect of herco
” is testing that survival is the same at
all three levels of herco
.
\(\blacksquare\)
- Remove all the non-significant explanatory variables and re-fit your model. By carrying out a suitable test demonstrate that your smaller model is the better one.
Solution
site
and herco
are the two variables to come
out.46 I like
update
, but there is no
problem about copying-pasting your coxph
and taking out
what you no longer need.
.2 <- update(drugusers.1, . ~ . - site - herco) drugusers
Having fit two models, we can use anova
to compare them. The
right test gets done, so no need for test=
:
anova(drugusers.2, drugusers.1)
## Analysis of Deviance Table
## Cox model: response is y
## Model 1: ~ age + ndrugtx + treat
## Model 2: ~ age + ndrugtx + treat + site + herco
## loglik Chisq Df Pr(>|Chi|)
## 1 -2853.7
## 2 -2850.1 7.2117 3 0.06545 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is no significant difference between these two
models,47 so we can go with the
smaller, simpler one
(with just age
, ndrugtx
and treat
).
\(\blacksquare\)
- * Display your better model. Are all of the explanatory variables significant? Do their slope coefficients have sensible signs (plus or minus), based on what you know or can guess about drug treatments? Explain briefly.
Solution
summary(drugusers.2)
## Call:
## coxph(formula = y ~ age + ndrugtx + treat, data = drugusers)
##
## n= 610, number of events= 495
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age -0.020801 0.979414 0.007419 -2.804 0.00505 **
## ndrugtx 0.035567 1.036207 0.007621 4.667 3.05e-06 ***
## treatshort 0.231055 1.259929 0.090175 2.562 0.01040 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 0.9794 1.0210 0.9653 0.9938
## ndrugtx 1.0362 0.9651 1.0208 1.0518
## treatshort 1.2599 0.7937 1.0558 1.5035
##
## Concordance= 0.572 (se = 0.014 )
## Likelihood ratio test= 27.87 on 3 df, p=4e-06
## Wald test = 30.5 on 3 df, p=1e-06
## Score (logrank) test = 30.62 on 3 df, p=1e-06
The three remaining explanatory variables are all clearly significant:
the patient’s age, the number of previous drug treatments, and whether
the treatment was short or long. This is legit (we don’t need to run
drop1
again) because the remaining explanatory variables are
all quantitative or have only two levels, so that the single-df tests
in summary
are what we need.
Do their slope coefficients have sensible signs? Well, this requires careful thought. A positive coefficient means that increasing that variable increases the hazard of the event: ie., it makes the event likelier to happen sooner. Here, the “event” is “return to drug use”:
Age has a negative coefficient, so an older person is likely to take longer to return to drug use, other things being equal. This makes some kind of sense, if you imagine drug use as being related to maturity, or an older drug user as being more strongly committed to “turning their life around”, so that a drug treatment of any kind is going to be more effective on an older patient.
The number of previous treatments has a positive coefficient, so that a patient who has had a lot of previous treatments is likely to go back to drugs sooner. Such a person might be an “addict” for whom treatments really do not work, or might not be committed to giving up drugs.
treatshort
has a positive coefficient. This says that if you give a patient a short treatment, they are more likely (other things being equal) to go back to drugs sooner, as compared to the baseline long treatment. That is, a longer treatment is more effective than a shorter one. Given a significant effect of treatment length, this is the way around you would expect it to be.
\(\blacksquare\)
- We have three variables left in our model,
age
,ndrugtx
andtreat
. The quartiles of age are 27 and 37, the quartiles ofndrugtx
are 1 and 6, and the two possible values oftreat
areshort
andlong
. Create a data frame with variables of these names and all possible combinations of their values (so there should be 8 rows in the resulting data frame). Display the resulting data frame.
Solution
This data frame is going to be used for prediction, so I will
call it new
and construct it in pieces as I did before
(thus meaning that I don’t have to think too hard about what I’m doing):
<- c(27, 37)
ages <- c(1, 6)
ndrugtxs <- c("short", "long")
treats <- crossing(age = ages, ndrugtx = ndrugtxs, treat = treats)
new new
## # A tibble: 8 × 3
## age ndrugtx treat
## <dbl> <dbl> <chr>
## 1 27 1 long
## 2 27 1 short
## 3 27 6 long
## 4 27 6 short
## 5 37 1 long
## 6 37 1 short
## 7 37 6 long
## 8 37 6 short
8 rows as promised.
\(\blacksquare\)
- Obtain predicted survival probabilities for each of the
values of
age
,ndrugtx
andtreat
used in the previous part. You don’t need to display it (we are going to plot it shortly).
Solution
survfit
is the survival analysis version of
predict
and works the same way, so this is all you need:
<- survfit(drugusers.2, new, data = drugusers) pp
Make sure that you use your best model, ie. the second one. The
data=
is needed for the plot below, not in itself for the
prediction.
\(\blacksquare\)
- Plot your predicted survival curves.
Solution
This:
ggsurvplot(pp, conf.int = F)
The only thing to remember is that you plot your predictions, not the model from which they came.
If your plot didn’t come out, you may have to go back and re-do the
survfit
with the data=
at the end.
For reference in a minute:
new
## # A tibble: 8 × 3
## age ndrugtx treat
## <dbl> <dbl> <chr>
## 1 27 1 long
## 2 27 1 short
## 3 27 6 long
## 4 27 6 short
## 5 37 1 long
## 6 37 1 short
## 7 37 6 long
## 8 37 6 short
\(\blacksquare\)
- Which of your combinations of values is predicted to take the longest to return to drug use? Which is predicted to take the shortest time? Explain briefly.
Solution
Remember that “up and to the right” is the best survival curve: that is, the people on this survival curve are predicted to take the longest to return to drug use. On my plot, this is the pale blue survival curve, stratum 5. Going back to my combinations data frame, this is 37-year-olds with only one previous drug treatment and the longer drug treatment this time.
The worst is my green survival curve, stratum 4, which is the exact opposite of this: 27-year-olds, 6 previous drug treatments, shorter treatment this time.
“Returning to drug use” is like “death” in that you want it to be a long time before it happens, so “best” is top right on the plot of survival curves. In other curcumstances, you might want the event to happen sooner, in which case the lower-left survival curve would be the “best” one.
\(\blacksquare\)
- Are your survival curve plot and your conclusions from part (here) consistent, or not? Explain briefly.
Solution
The survival curves say that being older, having fewer previous
treatments and being on the long treatment are better in terms
of taking longer to return to drug use. Our analysis of whether
the slope coefficients in drugusers.2
were positive or
negative came to exactly the same conclusion. So the survival
curves and part (here) are entirely consistent.
On my plot with the legend, you can assess the effects of the individual variables: for example, to assess the effect of age, find two combos that differ only in age, say strata 1 and 5, the red and light blue ones. Of these, the light blue survival curve is higher, so age 37 is better in terms of survival than age 27. This will work whichever such pair you pick: for example, strata 3 and 7, the olive green and purple curves, compare the same way.
Extra: more comparisons for you to do: to assess the effect of number of previous treatments, compare eg. strata 1 and 3, red and olive green, and to assess the effect of treatment length, compare eg. strata 5 and 6, light blue and darker blue.
All this struggling to identify colours makes me think of link, in which the guy behind the webcomic XKCD did a survey where he showed people a whole bunch of different colours and asked the people to name the colours.
\(\blacksquare\)
29.7 Multiple myeloma
Multiple myeloma is a kind of cancer. It forms in a plasma cell (which is a type of white blood cell). It causes cancer cells to accumulate in the bone marrow, where they crowd out healthy blood cells. Plasma cells make antibodies (to help fight infections), while the cancer cells don’t: they produce abnormal proteins that can cause kidney problems. (This adapted from link.) The variables are:
time
: survival time from diagnosis (months)vstatus
: 0=alive, 1=dead at end of studylogbun
: log of BUN test score (BUN test is a test of kidney function, not to be confused with cha siu bao48).hgb
: hemoglobin (at diagnosis).platelet
: platelets: 1=normal, 0=abnormal (at diagnosis).age
at diagnosis, in yearslogwbc
: log of WBC (white blood cell count, at diagnosis)frac
: fractures at diagnosis (0=absent, 1=present)logpbm
: log of percent of plasma cells in bone marrowprotein
: proteinuria (protein in urine) at diagnosis. Most people have very little, so a larger than normal amount indicates illness of some kind.scalc
: serum calcium at diagnosis.
The data, on 65 patients with multiple myeloma, are in link. Some of the variables are logs because they could take very large values.
There are a lot of parts here, but each part is supposed to be short.
- Read in the data and display (some of) the values. Confirm that you have the right number of observations and the right variables.
Solution
The usual:
<- "http://ritsokiguess.site/datafiles/myeloma.csv"
my_url <- read_csv(my_url) myeloma
## Rows: 65 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (11): time, vstatus, logbun, hgb, platelet, age, logwbc, frac, logpbm, p...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
myeloma
## # A tibble: 65 × 11
## time vstatus logbun hgb platelet age logwbc frac logpbm protein scalc
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.25 1 2.22 9.4 1 67 3.66 1 1.95 12 10
## 2 1.25 1 1.94 12 1 38 3.99 1 1.95 20 18
## 3 2 1 1.52 9.8 1 81 3.88 1 2 2 15
## 4 2 1 1.75 11.3 0 75 3.81 1 1.26 0 12
## 5 2 1 1.30 5.1 0 57 3.72 1 2 3 9
## 6 3 1 1.54 6.7 1 46 4.48 0 1.93 12 10
## 7 5 1 2.24 10.1 1 50 4.95 1 1.66 4 9
## 8 5 1 1.68 6.5 1 74 3.73 0 1.73 5 9
## 9 6 1 1.36 9 1 77 3.54 0 1.46 1 8
## 10 6 1 2.11 10.2 0 70 3.54 1 1.36 1 8
## # … with 55 more rows
65 observations, and all the variables listed. If you want to go
further (not necessary here), you can check that the variables
vstatus
, platelet
and frac
that should be
zero and one actually are zero and one, at least for the values
shown (they are), and the ages look like ages (they do).
The tidyverse also offers:
glimpse(myeloma)
## Rows: 65
## Columns: 11
## $ time <dbl> 1.25, 1.25, 2.00, 2.00, 2.00, 3.00, 5.00, 5.00, 6.00, 6.00, 6…
## $ vstatus <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ logbun <dbl> 2.2175, 1.9395, 1.5185, 1.7482, 1.3010, 1.5441, 2.2355, 1.681…
## $ hgb <dbl> 9.4, 12.0, 9.8, 11.3, 5.1, 6.7, 10.1, 6.5, 9.0, 10.2, 9.7, 10…
## $ platelet <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1…
## $ age <dbl> 67, 38, 81, 75, 57, 46, 50, 74, 77, 70, 60, 67, 48, 61, 53, 5…
## $ logwbc <dbl> 3.6628, 3.9868, 3.8751, 3.8062, 3.7243, 4.4757, 4.9542, 3.732…
## $ frac <dbl> 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ logpbm <dbl> 1.9542, 1.9542, 2.0000, 1.2553, 2.0000, 1.9345, 1.6628, 1.732…
## $ protein <dbl> 12, 20, 2, 0, 3, 12, 4, 5, 1, 1, 0, 0, 5, 1, 1, 0, 0, 1, 1, 0…
## $ scalc <dbl> 10, 18, 15, 12, 9, 10, 9, 9, 8, 8, 10, 8, 10, 10, 13, 12, 10,…
which gives a bit more of a picture of the values.49 Or if you were serious about checking, you could do
summary(myeloma)
## time vstatus logbun hgb
## Min. : 1.25 Min. :0.0000 Min. :0.7782 Min. : 4.9
## 1st Qu.: 7.00 1st Qu.:0.0000 1st Qu.:1.1461 1st Qu.: 8.8
## Median :15.00 Median :1.0000 Median :1.3222 Median :10.2
## Mean :24.01 Mean :0.7385 Mean :1.3929 Mean :10.2
## 3rd Qu.:35.00 3rd Qu.:1.0000 3rd Qu.:1.5682 3rd Qu.:12.0
## Max. :92.00 Max. :1.0000 Max. :2.2355 Max. :14.6
## platelet age logwbc frac
## Min. :0.0000 Min. :38.00 Min. :3.362 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:51.00 1st Qu.:3.643 1st Qu.:1.0000
## Median :1.0000 Median :60.00 Median :3.732 Median :1.0000
## Mean :0.8615 Mean :60.15 Mean :3.769 Mean :0.7538
## 3rd Qu.:1.0000 3rd Qu.:67.00 3rd Qu.:3.875 3rd Qu.:1.0000
## Max. :1.0000 Max. :82.00 Max. :4.954 Max. :1.0000
## logpbm protein scalc
## Min. :0.4771 Min. : 0.000 Min. : 7.00
## 1st Qu.:1.3617 1st Qu.: 0.000 1st Qu.: 9.00
## Median :1.6232 Median : 1.000 Median :10.00
## Mean :1.5497 Mean : 3.615 Mean :10.12
## 3rd Qu.:1.8451 3rd Qu.: 4.000 3rd Qu.:10.00
## Max. :2.0000 Max. :27.000 Max. :18.00
which gives means and five-number summaries for each of the variables (the numeric ones, but they all are here, even the ones coded as 0 or 1 that are really categorical).
\(\blacksquare\)
- Create a suitable response variable for a Cox
proportional-hazards survival model, bearing in mind that the
“event” here is death. Display your response variable, and explain
briefly what the
+
signs attached to some of the values mean, without using a technical term.
Solution
I seem to call my response variables y
, but you can call
yours whatever you like. Two things: the survival times, here
time
, and the indicator of the event, here
vstatus
being 1:
<- with(myeloma, Surv(time, vstatus == 1))
y y
## [1] 1.25 1.25 2.00 2.00 2.00 3.00 5.00 5.00 6.00 6.00
## [11] 6.00 6.00 7.00 7.00 7.00 9.00 11.00 11.00 11.00 11.00
## [21] 11.00 13.00 14.00 15.00 16.00 16.00 17.00 17.00 18.00 19.00
## [31] 19.00 24.00 25.00 26.00 32.00 35.00 37.00 41.00 41.00 51.00
## [41] 52.00 54.00 58.00 66.00 67.00 88.00 89.00 92.00 4.00+ 4.00+
## [51] 7.00+ 7.00+ 8.00+ 12.00+ 11.00+ 12.00+ 13.00+ 16.00+ 19.00+ 19.00+
## [61] 28.00+ 41.00+ 53.00+ 57.00+ 77.00+
Or use
myeloma$
(twice) before the variable names.
The values of y
that have a +
by them go with
patients who were never observed to die (or were still alive at the
end of the study). There were 17 of these, listed at the end of the
data frame. Usually, these values of the response will be higher than
the others, but they weren’t here. (Maybe some of these patients were
withdrawn from the study, or they joined it late.)
\(\blacksquare\)
- What is the technical term for those patients that have a
+
by their values for the response variable?
Solution
Censored. A quick one. I was trying to dissuade you from using the word “censored” in the previous part, since I wanted you to demonstrate that you understood what it meant. But you should know the technical term as well, which is why I asked you for it here. Grading note: if this part and the previous one contain, somewhere, the word “censored” and a clear explanation of what “censored” means, then I don’t mind what is where.
\(\blacksquare\)
- Fit a Cox proportional-hazards survival model predicting
your response variable from all the other variables (except for the
ones that you used to make the response variable). Display the
summary
of your model.
Solution
The obvious way to do this is to list all the other variables on the right side of the squiggle, but a faster way is this:
.1 <- coxph(y ~ . - time - vstatus, data = myeloma)
ysummary(y.1)
## Call:
## coxph(formula = y ~ . - time - vstatus, data = myeloma)
##
## n= 65, number of events= 48
##
## coef exp(coef) se(coef) z Pr(>|z|)
## logbun 1.85557 6.39536 0.65628 2.827 0.00469 **
## hgb -0.12629 0.88136 0.07212 -1.751 0.07994 .
## platelet -0.25488 0.77501 0.51194 -0.498 0.61858
## age -0.01306 0.98702 0.01957 -0.668 0.50439
## logwbc 0.35389 1.42460 0.71576 0.494 0.62101
## frac 0.34232 1.40821 0.40725 0.841 0.40059
## logpbm 0.38165 1.46470 0.48743 0.783 0.43364
## protein 0.01302 1.01311 0.02612 0.498 0.61817
## scalc 0.12976 1.13856 0.10502 1.236 0.21659
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## logbun 6.3954 0.1564 1.7670 23.147
## hgb 0.8814 1.1346 0.7652 1.015
## platelet 0.7750 1.2903 0.2841 2.114
## age 0.9870 1.0131 0.9499 1.026
## logwbc 1.4246 0.7020 0.3503 5.794
## frac 1.4082 0.7101 0.6339 3.128
## logpbm 1.4647 0.6827 0.5634 3.808
## protein 1.0131 0.9871 0.9625 1.066
## scalc 1.1386 0.8783 0.9268 1.399
##
## Concordance= 0.675 (se = 0.051 )
## Likelihood ratio test= 17.62 on 9 df, p=0.04
## Wald test = 17.93 on 9 df, p=0.04
## Score (logrank) test = 18.97 on 9 df, p=0.03
The .
in this model formula means
“all the columns in the data frame”
(except for the response variable if it was in the data
frame, which here it was not). I used time
and
vstatus
to make y
, so I had to exclude them explicitly.
If you forget to exclude time
and vstatus
, you are
in danger of having a model that fits perfectly:
.00 <- coxph(y ~ ., data = myeloma) y
## Warning in coxph.fit(X, Y, istrat, offset, init, control, weights = weights, :
## Ran out of iterations and did not converge
summary(y.00)
## Call:
## coxph(formula = y ~ ., data = myeloma)
##
## n= 65, number of events= 48
##
## coef exp(coef) se(coef) z Pr(>|z|)
## time -1.109e+01 1.526e-05 9.799e-01 -11.318 <2e-16 ***
## vstatus 1.543e+01 5.031e+06 6.233e+02 0.025 0.98
## logbun 1.759e-04 1.000e+00 7.904e-01 0.000 1.00
## hgb 6.861e-06 1.000e+00 9.143e-02 0.000 1.00
## platelet 9.226e-05 1.000e+00 6.221e-01 0.000 1.00
## age -4.311e-06 1.000e+00 2.245e-02 0.000 1.00
## logwbc -1.220e-06 1.000e+00 8.940e-01 0.000 1.00
## frac -6.694e-05 9.999e-01 6.262e-01 0.000 1.00
## logpbm 2.400e-04 1.000e+00 7.307e-01 0.000 1.00
## protein 3.472e-05 1.000e+00 6.776e-02 0.001 1.00
## scalc -1.447e-05 1.000e+00 1.091e-01 0.000 1.00
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## time 1.526e-05 6.554e+04 2.236e-06 0.0001041
## vstatus 5.031e+06 1.988e-07 0.000e+00 Inf
## logbun 1.000e+00 9.998e-01 2.124e-01 4.7087742
## hgb 1.000e+00 1.000e+00 8.359e-01 1.1962633
## platelet 1.000e+00 9.999e-01 2.955e-01 3.3849492
## age 1.000e+00 1.000e+00 9.569e-01 1.0449881
## logwbc 1.000e+00 1.000e+00 1.734e-01 5.7673011
## frac 9.999e-01 1.000e+00 2.930e-01 3.4119518
## logpbm 1.000e+00 9.998e-01 2.389e-01 4.1885763
## protein 1.000e+00 1.000e+00 8.757e-01 1.1420792
## scalc 1.000e+00 1.000e+00 8.075e-01 1.2384242
##
## Concordance= 1 (se = 0 )
## Likelihood ratio test= 277 on 11 df, p=<2e-16
## Wald test = 128.1 on 11 df, p=<2e-16
## Score (logrank) test = 87.4 on 11 df, p=5e-14
The warning at the top is your clue that something has gone wrong. This kind of warning can happen with real data, but not often: it is usually an indication that something is wrong with the way you specified the model. If you look at the output, you’ll realize that predicting survival time from survival time makes no sense at all.
There is of course nothing wrong with typing out all the variable names, except that the first time you type them out, you will likely make a typo (unless you are more careful than I usually am).
\(\blacksquare\)
- In your model, which explanatory variables have a P-value less than 0.10? Fit a model containing only those and display the results.
Solution
Only logbun
and hgb
; the other P-values are
larger, usually much larger.
Because there are so many variables to remove, I am frightened
away from update
here (which I would normally try to use
in this situation). I’m going to copy-and-paste my code for
y.1
and edit it:
.2 <- coxph(y ~ logbun + hgb, data = myeloma)
ysummary(y.2)
## Call:
## coxph(formula = y ~ logbun + hgb, data = myeloma)
##
## n= 65, number of events= 48
##
## coef exp(coef) se(coef) z Pr(>|z|)
## logbun 1.71597 5.56209 0.61855 2.774 0.00553 **
## hgb -0.11966 0.88722 0.05742 -2.084 0.03717 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## logbun 5.5621 0.1798 1.6547 18.6961
## hgb 0.8872 1.1271 0.7928 0.9929
##
## Concordance= 0.675 (se = 0.043 )
## Likelihood ratio test= 12.27 on 2 df, p=0.002
## Wald test = 12.51 on 2 df, p=0.002
## Score (logrank) test = 13.07 on 2 df, p=0.001
That’s all I wanted, but you can note that hgb
has become
significant at \(\alpha=0.05\). I suspect it was somewhat correlated
with a variable that we removed, so that its value to the regression
has become clearer.
\(\blacksquare\)
- Do a test to compare the two models that you fit. Why do you prefer the second model? Explain briefly.
Solution
Comparing two models is anova
, which also works here. The
right test
is Chisq
:
anova(y.2, y.1, test = "Chisq")
## Analysis of Deviance Table
## Cox model: response is y
## Model 1: ~ logbun + hgb
## Model 2: ~ (time + vstatus + logbun + hgb + platelet + age + logwbc + frac + logpbm + protein + scalc) - time - vstatus
## loglik Chisq Df Pr(>|Chi|)
## 1 -148.06
## 2 -145.39 5.347 7 0.6177
The usual logic here: this is far from significant, so the null
hypothesis (that the two models are equally good) is not rejected, so
we prefer the smaller model y.2
because it is simpler.
I wasn’t sure about the Model 2
line of my anova
output (what are time
and vstatus
doing there?), but
the test has 7 degrees of freedom, which is correct since we started with 9
explanatory variables and finished with 2, so that we took out 7 of
them. I checked what went off the right side of the page: there is a
-time-vstatus
on the end, so that it is correct. What
happened is that the .
got expanded out into all the
variables separated by +
, and then whatever else (the
“minus” variables) were on the end.
In case you are curious, step
also works on models like these:
.3 <- step(y.1, direction = "backward", trace = 0)
ysummary(y.3)
## Call:
## coxph(formula = y ~ logbun + hgb, data = myeloma)
##
## n= 65, number of events= 48
##
## coef exp(coef) se(coef) z Pr(>|z|)
## logbun 1.71597 5.56209 0.61855 2.774 0.00553 **
## hgb -0.11966 0.88722 0.05742 -2.084 0.03717 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## logbun 5.5621 0.1798 1.6547 18.6961
## hgb 0.8872 1.1271 0.7928 0.9929
##
## Concordance= 0.675 (se = 0.043 )
## Likelihood ratio test= 12.27 on 2 df, p=0.002
## Wald test = 12.51 on 2 df, p=0.002
## Score (logrank) test = 13.07 on 2 df, p=0.001
The same model as the one we found by brute force. You can change the
value of trace
to see the progress, but in this case it’s not
very illuminating, since <none>
and the variables we end up
keeping are always at the bottom of the list to remove.
step
is built on add1
and drop1
. In this
case, drop1
is run repeatedly and the variable with lowest
AIC is removed. We had all numeric variables in this one, but if our
model had something categorical like treatment
with, let’s
say, 4 levels, drop1
would contemplate dropping all four of
these in one shot, the same way it works with a categorical variable
in a regression of any other kind.
\(\blacksquare\)
- There should be two explanatory variables left in your model. These are both numerical variables. Find their first and third quartiles, any way you like.
Solution
The obvious way is probably this:
quantile(myeloma$logbun)
## 0% 25% 50% 75% 100%
## 0.7782 1.1461 1.3222 1.5682 2.2355
quantile(myeloma$hgb)
## 0% 25% 50% 75% 100%
## 4.9 8.8 10.2 12.0 14.6
So the quartiles are 1.15 and 1.57 for logbun
, and 8.8 and
12.0 for hgb
.
There are (at least) three other ways to do it. This is the easiest:
summary(myeloma)
## time vstatus logbun hgb
## Min. : 1.25 Min. :0.0000 Min. :0.7782 Min. : 4.9
## 1st Qu.: 7.00 1st Qu.:0.0000 1st Qu.:1.1461 1st Qu.: 8.8
## Median :15.00 Median :1.0000 Median :1.3222 Median :10.2
## Mean :24.01 Mean :0.7385 Mean :1.3929 Mean :10.2
## 3rd Qu.:35.00 3rd Qu.:1.0000 3rd Qu.:1.5682 3rd Qu.:12.0
## Max. :92.00 Max. :1.0000 Max. :2.2355 Max. :14.6
## platelet age logwbc frac
## Min. :0.0000 Min. :38.00 Min. :3.362 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:51.00 1st Qu.:3.643 1st Qu.:1.0000
## Median :1.0000 Median :60.00 Median :3.732 Median :1.0000
## Mean :0.8615 Mean :60.15 Mean :3.769 Mean :0.7538
## 3rd Qu.:1.0000 3rd Qu.:67.00 3rd Qu.:3.875 3rd Qu.:1.0000
## Max. :1.0000 Max. :82.00 Max. :4.954 Max. :1.0000
## logpbm protein scalc
## Min. :0.4771 Min. : 0.000 Min. : 7.00
## 1st Qu.:1.3617 1st Qu.: 0.000 1st Qu.: 9.00
## Median :1.6232 Median : 1.000 Median :10.00
## Mean :1.5497 Mean : 3.615 Mean :10.12
## 3rd Qu.:1.8451 3rd Qu.: 4.000 3rd Qu.:10.00
## Max. :2.0000 Max. :27.000 Max. :18.00
from which you pick out the ones you need. Or, you select
the
ones you need first:
%>% select(logbun, hgb) %>% summary() myeloma
## logbun hgb
## Min. :0.7782 Min. : 4.9
## 1st Qu.:1.1461 1st Qu.: 8.8
## Median :1.3222 Median :10.2
## Mean :1.3929 Mean :10.2
## 3rd Qu.:1.5682 3rd Qu.:12.0
## Max. :2.2355 Max. :14.6
The obvious tidyverse
way is actually a bit
inelegant, because you have to calculate two things for two variables:50
%>% summarize(
myeloma logbun.q1 = quantile(logbun, 0.25),
logbun.q3 = quantile(logbun, 0.75),
hgb.q1 = quantile(hgb, 0.25),
hgb.q3 = quantile(hgb, 0.75)
)
## # A tibble: 1 × 4
## logbun.q1 logbun.q3 hgb.q1 hgb.q3
## <dbl> <dbl> <dbl> <dbl>
## 1 1.15 1.57 8.8 12
Next is the tidyverse-approved way to get both quartiles for both variables at once. Use across
to select the variables to use, and then something with a squiggle and a dot to say “do this on each of the columns selected in the across
”. If you have a cleverer way to select those two columns without naming them, go for it. Read this in English as “for each of the columns logbun
and hgb
, work out the first and third quantiles of it”, where the dot is read as “it”:
%>%
myeloma summarize(across(c(logbun, hgb),
quantile(x, c(0.25, 0.75)))) \(x)
## # A tibble: 2 × 2
## logbun hgb
## <dbl> <dbl>
## 1 1.15 8.8
## 2 1.57 12
We have lost which quartile is which, but of course the lower one must be Q1 and the higher one Q3 for each variable.51
\(\blacksquare\)
- Create a data frame containing all possible combinations of the two quartiles for each of the two variables, and display the result.
Solution
The usual crossing
:
<- c(1.1461, 1.5682)
logbuns <- c(8.8, 12)
hgbs <- crossing(logbun = logbuns, hgb = hgbs)
new new
## # A tibble: 4 × 2
## logbun hgb
## <dbl> <dbl>
## 1 1.15 8.8
## 2 1.15 12
## 3 1.57 8.8
## 4 1.57 12
Or anything equivalent to that. This is my procedure, but you are free
to shorten it if it will still work. The place you have to get to in
the end is a data frame with columns called logbun
and
hgb
, and the right four combinations of values. If you want
to round the logbun
values off more, for example to two
decimals, that’s fine; it won’t affect the graph that’s coming up.
\(\blacksquare\)
- Obtain predicted survival probabilities for each of the combinations of variables you created above. You don’t need to look at the results (they are rather long).
Solution
This seems as if it ought to be predict
, but the
survival
version of it is called survfit
:
<- survfit(y.2, new, data = myeloma) s
It works the same as predict
: a fitted model object (your
smaller survival model), and a data frame of values to predict
for. The data=
is not strictly needed here, but if you want
ggsurvplot
to work right, you do need it to be here.
\(\blacksquare\)
- Obtain a graph of the predicted survival curves for each combination of your variables.
Solution
This is easier than you think: it’s just ggsurvplot
from survminer
:
ggsurvplot(s, conf.int = F)
\(\blacksquare\)
- Is it better to have high or low values for each of the variables in your prediction? Explain briefly.
Solution
Those four “strata” are the four rows in your prediction data
frame (the four combinations). They are in the same order that
they were in new
(or whatever name you used):
new
## # A tibble: 4 × 2
## logbun hgb
## <dbl> <dbl>
## 1 1.15 8.8
## 2 1.15 12
## 3 1.57 8.8
## 4 1.57 12
The best survival curve is the top-right green one. This is
stratum52
2, from the legend at the top. In new
,
this goes with a low value of logbun
and a high value
of hgb
.
You can check this by looking at the worst survival curve,
which should be diametrically opposed. This is the blue one, stratum
3, which is high logbun
and low hgb
,
indeed exactly the opposite of the best one.
Things that are tests, like logbun
, are often set up so that
a high value is the abnormal one (so that an abnormal one will be easy
to spot). Things that are measurements, like hgb
, might have
an ideal range, but the better value could be high or low, depending
on what is being measured.
\(\blacksquare\)
29.8 Ovarian cancer
R’s survival
package contains several data
sets. One of these is called ovarian
; it comes from a study
of 26 ovarian cancer patients. The major purpose of this study was to
compare the effects of two treatments on survival time.
- Obtain and display (all of) the data set. This is as simple as loading the package and typing the data set’s name.
Solution
Thus. You may need to start with library(survival)
:
ovarian
## futime fustat age resid.ds rx ecog.ps
## 1 59 1 72.3315 2 1 1
## 2 115 1 74.4932 2 1 1
## 3 156 1 66.4658 2 1 2
## 4 421 0 53.3644 2 2 1
## 5 431 1 50.3397 2 1 1
## 6 448 0 56.4301 1 1 2
## 7 464 1 56.9370 2 2 2
## 8 475 1 59.8548 2 2 2
## 9 477 0 64.1753 2 1 1
## 10 563 1 55.1781 1 2 2
## 11 638 1 56.7562 1 1 2
## 12 744 0 50.1096 1 2 1
## 13 769 0 59.6301 2 2 2
## 14 770 0 57.0521 2 2 1
## 15 803 0 39.2712 1 1 1
## 16 855 0 43.1233 1 1 2
## 17 1040 0 38.8932 2 1 2
## 18 1106 0 44.6000 1 1 1
## 19 1129 0 53.9068 1 2 1
## 20 1206 0 44.2055 2 2 1
## 21 1227 0 59.5890 1 2 2
## 22 268 1 74.5041 2 1 2
## 23 329 1 43.1370 2 1 1
## 24 353 1 63.2192 1 2 2
## 25 365 1 64.4247 2 2 1
## 26 377 0 58.3096 1 2 1
There are indeed 26 rows. This is a data.frame
rather than a
tibble
, so you see the whole thing, in case you were
expecting something like this:
%>% as.tibble() ovarian
## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` instead.
## ℹ The signature and semantics have changed, see `?as_tibble`.
## # A tibble: 26 × 6
## futime fustat age resid.ds rx ecog.ps
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 59 1 72.3 2 1 1
## 2 115 1 74.5 2 1 1
## 3 156 1 66.5 2 1 2
## 4 421 0 53.4 2 2 1
## 5 431 1 50.3 2 1 1
## 6 448 0 56.4 1 1 2
## 7 464 1 56.9 2 2 2
## 8 475 1 59.9 2 2 2
## 9 477 0 64.2 2 1 1
## 10 563 1 55.2 1 2 2
## # … with 16 more rows
which doesn’t change anything in ovarian
, but changes what
kind of thing it is (and thus how it displays). Usually when you read
something in from a file, you use something like read_delim
that makes a tibble
, but this one wasn’t read in from a
file. It was stored in the package as an old-fashioned
data.frame
, and so that’s how it stays.
\(\blacksquare\)
- The columns of interest to us are:
futime
: the time for which a patient was followed-up: the number of days until either they died or the study ended (or they withdrew from the study for some other reason).fustat
: follow-up status: 1 if the patient died of ovarian cancer, 0 if they were still alive when the study ended.age
: of patient, at diagnosis, in yearsrx
: treatment, numbered 1 or 2, but really labels for the two treatments.
Create and display a suitable response variable y
for a Cox
proportional-hazards model.
Solution
The idea is to use the appropriate one(s) of these columns in
Surv
. Remember that the response variable in a survival
model encodes two things: the survival time, and whether or not
the event (here death) actually happened to that patient or not.
I always forget whether the second thing in Surv
has to
be 1 or 0 if the event happened. The help says that it needs to be
1 or TRUE
if the event (death) happened, which is what
fustat
is, so we can use it as it is:
<- with(ovarian, Surv(futime, fustat))
y y
## [1] 59 115 156 421+ 431 448+ 464 475 477+ 563 638 744+
## [13] 769+ 770+ 803+ 855+ 1040+ 1106+ 1129+ 1206+ 1227+ 268 329 353
## [25] 365 377+
This creates a separate variable y
outside of any data
frame. This is how it has to be done, because although y
looks as if it is 26 long (one per patient), it’s actually more
complicated than that:
<- ovarian %>% mutate(y = Surv(futime, fustat)) ov2
This doesn’t work, because internally y
is a matrix
with two columns:
print.default(y)
## time status
## [1,] 59 1
## [2,] 115 1
## [3,] 156 1
## [4,] 421 0
## [5,] 431 1
## [6,] 448 0
## [7,] 464 1
## [8,] 475 1
## [9,] 477 0
## [10,] 563 1
## [11,] 638 1
## [12,] 744 0
## [13,] 769 0
## [14,] 770 0
## [15,] 803 0
## [16,] 855 0
## [17,] 1040 0
## [18,] 1106 0
## [19,] 1129 0
## [20,] 1206 0
## [21,] 1227 0
## [22,] 268 1
## [23,] 329 1
## [24,] 353 1
## [25,] 365 1
## [26,] 377 0
## attr(,"type")
## [1] "right"
## attr(,"class")
## [1] "Surv"
The attr(,"class")
thing at the bottom of this says that y
is actually a Surv
object, which is how it knows to display as it does, as 26 values some of which have plusses.53
The way R works out the length of something is its total number of
entries, which for this matrix is \(26 \times 2=52\). The only things
you can add via mutate
to a data frame with 26 rows have
length 26 (since everything in a data frame has to be the same length)
or length 1 (in which case it is repeated 26 times).
With some cajoling, you can get this into a data frame, but with more cajoling than seems necessary. It’s much easier to leave it as a separate thing outside a data frame.54
I gave ov2
a “temporary” name because I am not going to use
it again.
\(\blacksquare\)
- In the display of your response variable, some values are
marked with a
+
. Why is that? Explain briefly. (If you use a technical term, you should explain what it means.)
Solution
These are the censored observations. You can say this, but you
also need to say what that means (this is the “technical term”
referred to in the question). The observations with a +
are individuals who were never observed to die, or who were still
alive at the end of the study.
I want you to demonstrate that you know what censored means, not just that you know when you have a censored observation.
Extra: in a study like this, patients are typically “recruited”
into the study at various different times. Patients who happened
to be in the study near the beginning and who survived can have a
large (censored) value of y
(like those values over 1000
days). But a patient might join the study later on; if they
survive, they might produce a censored observation with a small
survival time, like the last value 377. I’m sure the doctor would
have liked to follow them for longer, but the funding ran out, and
the doctor had a paper to write. (There is some information
in these small censored values, but not much, because most of the
patients, even the ones who eventually died, survived for longer
than 377 days.)
The other thing that might have happened is that a patient with the 377-censored value died from something else unrelated to ovarian cancer. The study is only concerned with deaths from ovarian cancer, so such a patient is treated as censored at their death time. After this point we cannot assess how long this patient survived ovarian cancer.
\(\blacksquare\)
- Fit a Cox proportional-hazards model for predicting survival time from age and treatment. Note that the numeric values for treatment make sense only as labels for the two treatments, so in your model formula make treatment into a factor. Display the results.
Solution
The hint suggests something like this:
.1 <- coxph(y ~ age + factor(rx), data = ovarian)
timesummary(time.1)
## Call:
## coxph(formula = y ~ age + factor(rx), data = ovarian)
##
## n= 26, number of events= 12
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.14733 1.15873 0.04615 3.193 0.00141 **
## factor(rx)2 -0.80397 0.44755 0.63205 -1.272 0.20337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.1587 0.863 1.0585 1.268
## factor(rx)2 0.4475 2.234 0.1297 1.545
##
## Concordance= 0.798 (se = 0.076 )
## Likelihood ratio test= 15.89 on 2 df, p=4e-04
## Wald test = 13.47 on 2 df, p=0.001
## Score (logrank) test = 18.56 on 2 df, p=9e-05
Alternatively, define the factor version of rx
in the data
frame first. This is the slick way to do that:
.1a <- ovarian %>%
timemutate(rxf = factor(rx)) %>%
coxph(y ~ age + rxf, data = .)
summary(time.1a)
## Call:
## coxph(formula = y ~ age + rxf, data = .)
##
## n= 26, number of events= 12
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.14733 1.15873 0.04615 3.193 0.00141 **
## rxf2 -0.80397 0.44755 0.63205 -1.272 0.20337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.1587 0.863 1.0585 1.268
## rxf2 0.4475 2.234 0.1297 1.545
##
## Concordance= 0.798 (se = 0.076 )
## Likelihood ratio test= 15.89 on 2 df, p=4e-04
## Wald test = 13.47 on 2 df, p=0.001
## Score (logrank) test = 18.56 on 2 df, p=9e-05
The answer is the same either way.
The data=.
means “use the data frame that came out of the previous step, the one with rxf
in it.”
\(\blacksquare\)
- Is there a significant difference between the treatments in terms of their effects on survival (from ovarian cancer)?
Solution
Look at the P-value for my factor(rx)2
, 0.203. This is
not small, so there is no evidence of a difference between
treatments.
Extra: the reason for the odd label is that we have turned treatment into a categorical variable; treatment 1 is used as the baseline, and the negative slope says that the “hazard of death” is lower for treatment 2 than for treatment 1: that is, people survive longer on treatment 2, but the difference is not big enough to be significant (we also have a smallish sample size).
Since there are only two treatments, it would in fact have been OK
to leave them as numbers (with two numbers one unit apart, the
slope would have been the same size as here), but I think it’s a
good idea to treat categorical variables as categorical. My own habit
is to use letters or something non-numerical to distinguish
categories. I might have used t1
and t2
in this
case, or the names of the different treatments.
\(\blacksquare\)
- Is there a significant effect of age? If there is, describe the effect that age has on survival.
Solution
The P-value for age is 0.0014, small, so age definitely has a significant effect on survival. As to what kind of effect, look at the slope coefficient, 0.15, positive, which means that increased age-at-diagnosis goes with an increased hazard of death, or that older patients do not survive as long.
I would like you to get to the plain-English words at the end. Part of your job as a statistician is explaining what you got to people who are doctors, managers, etc., who won’t understand the terminology.
Thus, one mark for assessing significance via P-value, one for looking at the slope coefficient and noting that it is positive, and one for getting to “older patients do not survive as long”, or “older patients have a larger chance of dying sooner”. (Strictly, this is also “all else equal” as usual, since survival time might also have depended on treatment, but the point of this question is for you to get to “older patients do not survive as long”.)
(The interpretation of the slope may seem backwards: a positive slope means a shorter survival time for a larger age. This is why I talk about “hazard of death”, since that guides us to the correct interpretation.)
Extra: I was curious about what would happen if I just included
rx
in the model:
.2 <- update(time.1, . ~ . - age)
timesummary(time.2)
## Call:
## coxph(formula = y ~ factor(rx), data = ovarian)
##
## n= 26, number of events= 12
##
## coef exp(coef) se(coef) z Pr(>|z|)
## factor(rx)2 -0.5964 0.5508 0.5870 -1.016 0.31
##
## exp(coef) exp(-coef) lower .95 upper .95
## factor(rx)2 0.5508 1.816 0.1743 1.74
##
## Concordance= 0.608 (se = 0.07 )
## Likelihood ratio test= 1.05 on 1 df, p=0.3
## Wald test = 1.03 on 1 df, p=0.3
## Score (logrank) test = 1.06 on 1 df, p=0.3
Still not significant, but this model is a lot worse because I
took out the significant age
. What this is doing is mixing up
all the people of different ages (and we know that age has an effect
on survival) and trying (and failing) to discern an effect of
treatment.
We could have been badly misled by this model if one of the treatments
had predominantly older patients. We know that older patients have
worse survival, so the treatment with older patients would have looked
worse, even if it actually wasn’t. The model time.1
which
contained age
properly adjusted for the effect of age, so
that was the best way to see whether there was a difference between
treatments.
What you often see early on in a paper on this kind of stuff is a
graph showing that the treatment groups are similar in terms of
important things like age
. Here, that could be a boxplot:
ggplot(ovarian, aes(x = factor(rx), y = age)) + geom_boxplot()
I needed to do factor(rx)
because geom_boxplot
needs a genuine categorical variable, not just a numerical variable
masquerading as one. If you just leave it as rx
, as I
discovered, you get one boxplot of all the ages together
regardless of treatment. The key, for you as user of software, is not
(necessarily) to get it right the first time, but to know what to do
to fix up the errors you will inevitably get. If you have worked
through the boxplot examples in C32 and D29, you will have enough
experience to remember that a boxplot has to have a categorical x
(text will do, but definitely not numbers). This is why I give you so
many things to work through: so that you gain the experience to know
how to fix up problems.
Treatment 1 has a larger spread of ages and treatment 2 has a low outlier age, but the median ages are very similar.
\(\blacksquare\)
- Make a martingale residual plot for this model. Do you see any problems? Explain briefly.
Solution
The plot is just the same idea as the one in the notes. Make sure
you have survminer
installed and loaded:
ggcoxdiagnostics(time.1) + geom_smooth()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Make a call about whether you think the smooth trend deviates too much from the red dotted line going across at zero. Martingale residuals can get very negative (and that is OK), so that residual of \(-2\) is not a problem, and this is pulling the smooth trend down a bit (it’s the reason for the dip on the right side of the smooth trend). So I’d call this OK, but you can make whatever call you like as long as it’s supported by what you see here.
I observe that the obvious fixable thing is where you have a curve
here, one that looks like a parabola (at which point you add squared
terms to your explanatory variables and see if that helps, as for
bmi
in one of the other problems). This one
is too wiggly to be a parabola (it bends twice), and so is more
like a cubic than anything.
The other thing you can note is that the grey envelope is “not significantly different from zero”, since 0 is clearly within the grey envelope all the way across.
\(\blacksquare\)
- Find the quartiles of
age
, and make a data frame containing all combinations of those two ages and the two treatments. Display what you have. (Feel free to copy the values by hand, rather than trying to save them and use them.)
Solution
I imagine you can guess what we are going to be doing with these:
predictions, so we’ll call the data frame new
when we get
there.
Quartiles first:
quantile(ovarian$age)
## 0% 25% 50% 75% 100%
## 38.89320 50.16712 56.84660 62.37810 74.50410
or, if you prefer,
%>%
ovarian summarize(
q1 = quantile(age, 0.25),
q3 = quantile(age, 0.75)
)
## q1 q3
## 1 50.16712 62.3781
The quartiles are 50.17 and 62.38 (rounding slightly).
Either way is good.
Then follow my standard procedure (or one of your own devising),
remembering that “treatment” is called rx
here:
<- c(50.17, 62.38)
ages <- c(1, 2)
rxs <- crossing(age = ages, rx = rxs)
new new
## # A tibble: 4 × 2
## age rx
## <dbl> <dbl>
## 1 50.2 1
## 2 50.2 2
## 3 62.4 1
## 4 62.4 2
\(\blacksquare\)
- Obtain predicted survival probabilities for each of your age-treatment combinations, for each of a variety of survival times. (This is only one thing, despite it sounding like a lot.)
Solution
The magic word here is survfit
. The structure is the same
as for predict
:
<- survfit(time.1, new, data = ovarian)
s summary(s)
## Call: survfit(formula = time.1, newdata = new, data = ovarian)
##
## time n.risk n.event survival1 survival2 survival3 survival4
## 59 26 1 0.993 0.997 0.959 0.981
## 115 25 1 0.985 0.993 0.911 0.959
## 156 24 1 0.973 0.988 0.846 0.928
## 268 23 1 0.959 0.981 0.777 0.893
## 329 22 1 0.932 0.969 0.653 0.826
## 353 21 1 0.905 0.956 0.548 0.764
## 365 20 1 0.877 0.943 0.452 0.701
## 431 17 1 0.843 0.926 0.356 0.630
## 464 15 1 0.805 0.908 0.271 0.557
## 475 14 1 0.768 0.888 0.202 0.489
## 563 12 1 0.701 0.853 0.117 0.382
## 638 11 1 0.634 0.816 0.064 0.292
I didn’t ask you to display it, so doing so is optional. Also, you
don’t need (here) that data=ovarian
(the predictions will
work just fine without it), but the plot coming up will not. So
my recommendation is to put it in.
Extra: The four columns survival1
through survival4
are the
survival probabilities for the times shown in the left-hand column
(these are numbers of days), for the four rows of my
new
, in the same order.
There are only a few different times, because these are the numbers of days at which somebody in the data set died, and the estimated survival probability does not change at the times in between these. (You’ll see this on a plot in a minute.)
These survival probabilities are pretty easy to eyeball: the best survival is in stratum 2, which is the younger patients in treatment 2. This we’ll come back to.
\(\blacksquare\)
- Draw a plot that compares the survival probabilities at the different times.
Solution
Thus. The conf.int=F
means to skip the confidence interval
“envelopes” that I find make the plot rather messy:
ggsurvplot(s, conf.int = F)
\(\blacksquare\)
- According to your plot, how would you describe the effects of treatment and of age?
Solution
The best survival curve, in terms of surviving longest, is up and to the right, so the green one is best and the blue one is worst.55 To figure out which those are, we have to go back to the data frame we created:
new
## # A tibble: 4 × 2
## age rx
## <dbl> <dbl>
## 1 50.2 1
## 2 50.2 2
## 3 62.4 1
## 4 62.4 2
To see the effect of age, compare strata 1 and 3 (or 2 and 4). This means comparing the red and blue curves; the red one is clearly better (in the sense of longer survival time), which means that age has a big effect on survival, with younger people living longer, other things being equal (that we saw earlier was significant). You could equally well compare the green and purple survival curves and come to the same conclusion.
To assess the effect of treatment, compare strata 1 and 2 (red and green), or strata 3 and 4 (blue and purple). In both cases, the stratum corresponding to treatment 2 has slightly better survival (has a higher chance of living for longer), but there is not as big an effect as for age. (You’ll recall that the treatment difference was not significant).
State or imply that you know which stratum is which, say something about the effects of age and of treatment, including which one is better.
Extra: recall the output from the Cox model:
summary(time.1)
## Call:
## coxph(formula = y ~ age + factor(rx), data = ovarian)
##
## n= 26, number of events= 12
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.14733 1.15873 0.04615 3.193 0.00141 **
## factor(rx)2 -0.80397 0.44755 0.63205 -1.272 0.20337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.1587 0.863 1.0585 1.268
## factor(rx)2 0.4475 2.234 0.1297 1.545
##
## Concordance= 0.798 (se = 0.076 )
## Likelihood ratio test= 15.89 on 2 df, p=4e-04
## Wald test = 13.47 on 2 df, p=0.001
## Score (logrank) test = 18.56 on 2 df, p=9e-05
The slope coefficient for treatment 2 (as compared to the baseline treatment 1) was \(-0.83097\), negative, which meant that patients on treatment 2 had a lower hazard of death than patients on treatment 1: that is, that treatment 2 was better for survival than treatment 1. That is what the plot said also (and the relatively small difference is consistent with that difference not being significant).
\(\blacksquare\)
Heart attack.↩︎
Worcester is pronounced, by locals, Woo-stuh.↩︎
Barbecued pork in a bun. A staple of Chinese dim sum and Chinese bakeries, such as Ding Dong bakery on Spadina.↩︎
Heart attack.↩︎
Worcester is pronounced, by locals, Woo-stuh.↩︎
Not exactly the same as that output, because it is doing a test that would be the same if you had an infinitely large sample, but is slightly different with an ordinary finite number of observations.↩︎
Our categorical variable gender has only two levels.↩︎
It doesn’t work with text columns, but it does work if you temporarily turn the text columns into factors, eg. by using
mutate
withwhere
. However, we don’t have any text columns here, so what we do here is good for this data set.↩︎case_when
is much clearer than using nestedif-else
s when you have three or more categories, as forherco
.↩︎Some people define the response variable right inside the
coxph
, in the same way as putting something likelog(y)
as a response in anlm
, but I think, especially while you’re getting used to the process, it’s better to create the response variable first and look at it to make sure it’s the right thing.↩︎The researchers were probably relieved that there was not quite a significant effect of
site
.↩︎Not at the 0.05 level, anyway.↩︎
Barbecued pork in a bun. A staple of Chinese dim sum and Chinese bakeries, such as Ding Dong bakery on Spadina.↩︎
Don’t confuse this with
glance
frombroom
, which gives a one-line summary of a model, containing things like R-squared and a test for the overall model significance.↩︎Because summarize will only allow you to have a single-number answer.↩︎
The way, as we have seen elsewhere, is to use {} or {}, which produce a two-column data frame with the percentiles shown.↩︎
Strata is plural; the singular is stratum. Like data and datum.↩︎
Internally, printing a
Surv
thing calls the methodprint.Surv
, which is defined in thesurvival
package.↩︎In some respects, it’s like that two-column response in logistic regression, where each row of the data frame represents several observations pooled together.↩︎
In other cases, having the event happen sooner is better. For example, you might be poisoning rats, in which case you want them to die quicker. Or the event might be something desirable like becoming qualified to fly an airplane. In those cases, down and to the left is better.↩︎