```
library(survival)
library(tidyverse)
library(survminer)
```

# 30 Survival analysis

## 30.1 The Worcester survey

The Worcester survey was a long-term study of all myocardial-infarction^{1} victims admitted to hospitals in the Worcester, Massachusetts area.^{2} The data have been well studied, and can be found in the file link.

- Read the data and display the first few rows of the data frame. You might get an extra column, which you can ignore. For your information, the variables are:

patient ID code

admission date

date of last followup (this is the date of death if the patient died)

length of hospital stay (days)

followup time (days) (time between admission and last followup)

followup status: 1=dead, 0=alive

Age in years (at admission)

gender (0=male, 1=female)

body mass index (kg/m\(^2\))

Create a suitable response variable for a Cox proportional hazards model for time of survival, using the followup time and followup status.

Fit a Cox proportional hazards model predicting survival time from age, gender and BMI. Obtain the

`summary`

(but you don’t need to comment on it yet).Test the overall fit of the model. What does the result mean?

Can any of your explanatory variables be removed from the model? Explain briefly.

Remove your most non-significant explanatory variable from the model and fit again. Take a look at the results. Are all your remaining explanatory variables significant? (If all your explanatory variables were previously significant, you can skip this part.)

Calculate the 1st quartile, median, and 3rd quartiles of age and BMI. (

`quantile`

.) Round these off to the nearest whole number. (Do the rounding off yourself, though R has a function`round`

that does this, which you can investigate if you want.) As an alternative, you can get these by passing the whole data frame, or the columns of it you want, into`summary`

.Make a data frame out of all the combinations of age and BMI values (that you obtained in the previous part) suitable for predicting with.

Obtain predicted survival probabilities for each of the values in your new data frame. Use your best model. (You don’t need to look at the results, though you can if you want to.)

Make a graph depicting the survival curves from

`survfit`

with different colours distinguishing the different survival curves.What is the effect of age on survival? What is the effect of BMI on survival? Explain briefly. (You will have to disentangle the meaning of the different coloured lines on the plot to do this.)

## 30.2 Drug treatment programs

One of the goals of drug treatment programs is to lengthen the time until the patient returns to using drugs. (It is not generally possible to prevent patients from *ever* using drugs again.) In one study, over 600 former drug users took part. Two different programs, a short program and a long program, were offered at two different sites, labelled A and B. The data can be found in link. The variables are these:

`ID`

: patient ID number`age`

: patient age at enrollment into the study`ndrugtx`

: number of previous drug treatments`treat`

: 0 for short treatment program, 1 for long program`site`

: 0 for site A, 1 for site B`time`

: time until return to drug use`censor`

: whether the subject returned to drug use (1) or not (0) during the follow-up period`herco`

: whether subject used heroine or cocaine in the last 3 months: 1 is both, 2 is one (either heroine or cocaine), 3 is neither.

Read in the data and check in one way or another that you have what was promised above.

There are some missing values in the dataframe. Demonstrate this using

`summary`

. Pipe the dataframe into`drop_na`

and show that they have gone. (`drop_na`

removes all rows that have missing values in them.)Some of these variables are recorded as numbers but are actually categorical. Which ones? Re-define these variables in your data frame so that they have sensible (text) values.

Create a suitable reponse variable for a Cox proportional hazards regression that predicts time until return to drug use from the other variables. This requires some care, because you need to be sure about what the censoring variable actually represents and what you need it to represent.

Look at the first few values of your response variable. Why is the fifth one marked with a

`+`

? Explain briefly.Fit a Cox proportional hazards model, predicting from all the other variables (except for

`row`

and`ID`

) that you haven’t used yet. Display the results.Find which explanatory variables can be removed at \(\alpha=0.05\) (there should be two of them). Bear in mind that we have categorical variables, so that looking at the output from

`summary`

is not enough.Remove

*all*the non-significant explanatory variables and re-fit your model. By carrying out a suitable test demonstrate that your smaller model is the better one.* Display your better model. Are all of the explanatory variables significant? Do their slope coefficients have sensible signs (plus or minus), based on what you know or can guess about drug treatments? Explain briefly.

We have three variables left in our model,

`age`

,`ndrugtx`

and`treat`

. The quartiles of age are 27 and 37, the quartiles of`ndrugtx`

are 1 and 6, and the two possible values of`treat`

are`short`

and`long`

. Create a data frame with variables of these names and all possible combinations of their values (so there should be 8 rows in the resulting data frame). Display the resulting data frame.Obtain predicted survival probabilities for each of the values of

`age`

,`ndrugtx`

and`treat`

used in the previous part. You don’t need to display it (we are going to plot it shortly).Plot your predicted survival curves.

Which of your combinations of values is predicted to take the longest to return to drug use? Which is predicted to take the shortest time? Explain briefly.

Are your survival curve plot and your conclusions from part (here) consistent, or not? Explain briefly.

## 30.3 Multiple myeloma

Multiple myeloma is a kind of cancer. It forms in a plasma cell (which is a type of white blood cell). It causes cancer cells to accumulate in the bone marrow, where they crowd out healthy blood cells. Plasma cells make antibodies (to help fight infections), while the cancer cells don’t: they produce abnormal proteins that can cause kidney problems. (This adapted from link.) The variables are:

`time`

: survival time from diagnosis (months)`vstatus`

: 0=alive, 1=dead at end of study`logbun`

: log of BUN test score (BUN test is a test of kidney function, not to be confused with cha siu bao^{3}).`hgb`

: hemoglobin (at diagnosis).`platelet`

: platelets: 1=normal, 0=abnormal (at diagnosis).`age`

at diagnosis, in years`logwbc`

: log of WBC (white blood cell count, at diagnosis)`frac`

: fractures at diagnosis (0=absent, 1=present)`logpbm`

: log of percent of plasma cells in bone marrow`protein`

: proteinuria (protein in urine) at diagnosis. Most people have very little, so a larger than normal amount indicates illness of some kind.`scalc`

: serum calcium at diagnosis.

The data, on 65 patients with multiple myeloma, are in link. Some of the variables are logs because they could take very large values.

There are a lot of parts here, but each part is supposed to be short.

Read in the data and display (some of) the values. Confirm that you have the right number of observations and the right variables.

Create a suitable response variable for a Cox proportional-hazards survival model, bearing in mind that the “event” here is death. Display your response variable, and explain briefly what the

`+`

signs attached to some of the values mean, without using a technical term.What is the technical term for those patients that have a

`+`

by their values for the response variable?Fit a Cox proportional-hazards survival model predicting your response variable from all the other variables (except for the ones that you used to make the response variable). Display the

`summary`

of your model.In your model, which explanatory variables have a P-value less than 0.10? Fit a model containing only those and display the results.

Do a test to compare the two models that you fit. Why do you prefer the second model? Explain briefly.

There should be two explanatory variables left in your model. These are both numerical variables. Find their first and third quartiles, any way you like.

Create a data frame containing all possible combinations of the two quartiles for each of the two variables, and display the result.

Obtain predicted survival probabilities for each of the combinations of variables you created above. You don’t need to look at the results (they are rather long).

Obtain a graph of the predicted survival curves for each combination of your variables.

Is it better to have high or low values for each of the variables in your prediction? Explain briefly.

## 30.4 Ovarian cancer

R’s `survival`

package contains several data sets. One of these is called `ovarian`

; it comes from a study of 26 ovarian cancer patients. The major purpose of this study was to compare the effects of two treatments on survival time.

Obtain and display (all of) the data set. This is as simple as loading the package and typing the data set’s name.

The columns of interest to us are:

`futime`

: the time for which a patient was followed-up: the number of days until either they died or the study ended (or they withdrew from the study for some other reason).`fustat`

: follow-up status: 1 if the patient died of ovarian cancer, 0 if they were still alive when the study ended.`age`

: of patient, at diagnosis, in years`rx`

: treatment, numbered 1 or 2, but really labels for the two treatments.

Create and display a suitable response variable `y`

for a Cox proportional-hazards model.

In the display of your response variable, some values are marked with a

`+`

. Why is that? Explain briefly. (If you use a technical term, you should explain what it means.)Fit a Cox proportional-hazards model for predicting survival time from age and treatment. Note that the numeric values for treatment make sense only as labels for the two treatments, so in your model formula make treatment into a factor. Display the results.

Is there a significant difference between the treatments in terms of their effects on survival (from ovarian cancer)?

Is there a significant effect of age? If there is, describe the effect that age has on survival.

Make a martingale residual plot for this model. Do you see any problems? Explain briefly.

Find the quartiles of

`age`

, and make a data frame containing all combinations of those two ages and the two treatments. Display what you have. (Feel free to copy the values by hand, rather than trying to save them and use them.)Obtain predicted survival probabilities for each of your age-treatment combinations, for each of a variety of survival times. (This is only one thing, despite it sounding like a lot.)

Draw a plot that compares the survival probabilities at the different times.

According to your plot, how would you describe the effects of treatment and of age?

My solutions follow:

## 30.5 The Worcester survey

The Worcester survey was a long-term study of all myocardial-infarction^{4} victims admitted to hospitals in the Worcester, Massachusetts area.^{5} The data have been well studied, and can be found in the file link.

- Read the data and display the first few rows of the data frame. You might get an extra column, which you can ignore. For your information, the variables are:

patient ID code

admission date

date of last followup (this is the date of death if the patient died)

length of hospital stay (days)

followup time (days) (time between admission and last followup)

followup status: 1=dead, 0=alive

Age in years (at admission)

gender (0=male, 1=female)

body mass index (kg/m\(^2\))

Solution

```
<- "http://ritsokiguess.site/datafiles/whas100.csv"
my_url <- read_csv(my_url) whas100
```

```
New names:
Rows: 100 Columns: 10
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): admitdate, foldate dbl (8): ...1, id, los, lenfol, fstat, age, gender, bmi
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
```

` whas100`

I seem to have an extra column called `X1`

. This is because I saved my version of the data using the old `write.csv`

, which comes with row names, and I forgot to get rid of them. These came back as an extra unnamed variable to which `read_delim`

gave the name `X1`

.

\(\blacksquare\)

- Create a suitable response variable for a Cox proportional hazards model for time of survival, using the followup time and followup status.

Solution

`Surv`

. The event here is death, so the two parts of the response variable are followup time `lenfol`

and followup status, 1 being “dead”, `fstat`

:

```
<- with(whas100, Surv(lenfol, fstat == 1))
y y
```

```
[1] 6 374 2421 98 1205 2065 1002 2201 189 2719+ 2638+ 492
[13] 302 2574+ 2610+ 2641+ 1669 2624 2578+ 2595+ 123 2613+ 774 2012
[25] 2573+ 1874 2631+ 1907 538 104 6 1401 2710 841 148 2137+
[37] 2190+ 2173+ 461 2114+ 2157+ 2054+ 2124+ 2137+ 2031 2003+ 2074+ 274
[49] 1984+ 1993+ 1939+ 1172 89 128 1939+ 14 1011 1497 1929+ 2084+
[61] 107 451 2183+ 1876+ 936 363 1048 1889+ 2072+ 1879+ 1870+ 1859+
[73] 2052+ 1846+ 2061+ 1912+ 1836+ 114 1557 1278 1836+ 1916+ 1934+ 1923+
[85] 44 1922+ 274 1860+ 1806 2145+ 182 2013+ 2174+ 1624 187 1883+
[97] 1577 62 1969+ 1054
```

Just using `fstat`

alone as the second thing in `Surv`

also works, because anything that gives `TRUE`

or 1 when the event (death) occurs is equally good. (In R, `TRUE`

as a number is 1 and `FALSE`

as a number is 0.)

I listed the values by way of checking. The ones with a `+`

are censored: that is, the patient was still alive the last time the doctor saw them. Most of the censored values are longer times. Usually this happens because the patient was still alive at the end of the study.

This is perhaps now the old way of doing it, because you can now create `y`

as a new column in your dataframe:

`%>% mutate(y = Surv(lenfol, fstat == 1)) whas100 `

If you scroll across, this has a column `y`

that contains the same values as the stand-alone `y`

we defined earlier, including plus signs for censored ones. At the top of the column is an indication that this is a `Surv`

object: that is, not just a column of numbers, but something that also contains censorship information.

\(\blacksquare\)

- Fit a Cox proportional hazards model predicting survival time from age, gender and BMI. Obtain the
`summary`

(but you don’t need to comment on it yet).

Solution

This, using the response variable that we just created:

```
.1 <- coxph(y ~ age + gender + bmi, data = whas100)
whas100summary(whas100.1)
```

```
Call:
coxph(formula = y ~ age + gender + bmi, data = whas100)
n= 100, number of events= 51
coef exp(coef) se(coef) z Pr(>|z|)
age 0.03713 1.03783 0.01272 2.918 0.00352 **
gender 0.14325 1.15402 0.30604 0.468 0.63973
bmi -0.07083 0.93162 0.03607 -1.964 0.04956 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.0378 0.9636 1.0123 1.0640
gender 1.1540 0.8665 0.6334 2.1024
bmi 0.9316 1.0734 0.8680 0.9999
Concordance= 0.683 (se = 0.037 )
Likelihood ratio test= 21.54 on 3 df, p=8e-05
Wald test = 19.46 on 3 df, p=2e-04
Score (logrank) test = 20.82 on 3 df, p=1e-04
```

\(\blacksquare\)

- Test the overall fit of the model. What does the result mean?

Solution

Look at those three P-values at the bottom. They are all small, so something in the model is helping to predict survival. As to what? Well, that’s the next part.

\(\blacksquare\)

- Can any of your explanatory variables be removed from the model? Explain briefly.

Solution

`gender`

has a (very) large P-value, so that can be taken out of the model. The other two variables have small P-values (`bmi`

only just under 0.05), so they need to stay. The other way to think about this is `step`

, or `drop1`

:

`drop1(whas100.1, test = "Chisq")`

This is here equivalent to^{6} the output from `summary`

, but where it scores is if you have a categorical explanatory variable like “treatment” with more than two levels: `drop1`

will tell you about keeping or dropping it as a whole.^{7}

If you prefer:

`step(whas100.1, trace = 0, test = "Chisq")`

```
Call:
coxph(formula = y ~ age + bmi, data = whas100)
coef exp(coef) se(coef) z p
age 0.03927 1.04005 0.01187 3.309 0.000938
bmi -0.07116 0.93131 0.03614 -1.969 0.048952
Likelihood ratio test=21.32 on 2 df, p=2.346e-05
n= 100, number of events= 51
```

`gender`

comes out, but the others stay. As usual, put `trace=1`

or `trace=2`

to get more output, which will look like a sequence of `drop1`

’s one after the other.

\(\blacksquare\)

- Remove your most non-significant explanatory variable from the model and fit again. Take a look at the results. Are all your remaining explanatory variables significant? (If all your explanatory variables were previously significant, you can skip this part.)

Solution

So, take out `gender`

:

```
.2 <- update(whas100.1, . ~ . - gender)
whas100summary(whas100.2)
```

```
Call:
coxph(formula = y ~ age + bmi, data = whas100)
n= 100, number of events= 51
coef exp(coef) se(coef) z Pr(>|z|)
age 0.03927 1.04005 0.01187 3.309 0.000938 ***
bmi -0.07116 0.93131 0.03614 -1.969 0.048952 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.0401 0.9615 1.0161 1.0645
bmi 0.9313 1.0738 0.8676 0.9997
Concordance= 0.681 (se = 0.037 )
Likelihood ratio test= 21.32 on 2 df, p=2e-05
Wald test = 19 on 2 df, p=7e-05
Score (logrank) test = 19.99 on 2 df, p=5e-05
```

Both explanatory variables are significant: `age`

definitely, `bmi`

only just. This is the same model as `step`

gave me.

\(\blacksquare\)

- Calculate the 1st quartile, median, and 3rd quartiles of age and BMI. (
`quantile`

.) Round these off to the nearest whole number. (Do the rounding off yourself, though R has a function`round`

that does this, which you can investigate if you want.) As an alternative, you can get these by passing the whole data frame, or the columns of it you want, into`summary`

.

Solution

`quantile(whas100$age)`

```
0% 25% 50% 75% 100%
32.00 59.75 71.00 80.25 92.00
```

`quantile(whas100$bmi)`

```
0% 25% 50% 75% 100%
14.91878 23.53717 27.19158 30.34770 39.93835
```

or

```
%>%
whas100 select(age, bmi) %>%
summary()
```

```
age bmi
Min. :32.00 Min. :14.92
1st Qu.:59.75 1st Qu.:23.54
Median :71.00 Median :27.19
Mean :68.25 Mean :27.04
3rd Qu.:80.25 3rd Qu.:30.35
Max. :92.00 Max. :39.94
```

Or, pure tidyverse: summarize all the columns (after you’ve done the `select`

).

```
%>%
whas100 select(age, bmi) %>%
reframe(across(everything(), \(x) quantile(x)))
```

This one is `reframe`

rather than `summarize`

because the “summaries” are five numbers rather than one.

Using whichever of this multitude of ways appeals to you:

60, 71 and 80 for age, 24, 27 and 30 for BMI.

\(\blacksquare\)

- Make a data frame out of all the combinations of age and BMI values (that you obtained in the previous part) suitable for predicting with.

Solution

The inevitable `datagrid`

. This is probably quickest, with the best model being the second one:

```
<- datagrid(model = whas100.2, age = c(60, 71, 80), bmi = c(24, 27, 30))
whas100.new whas100.new
```

Extra: I set it up this way so that you would find the median and quartiles and then type the values into the `datagrid`

(easier conceptually), but there is nothing stopping us doing it all in one step:

```
datagrid(model = whas100.2,
age = quantile(whas100$age, c(0.25, 0.5, 0.75)),
bmi = quantile(whas100$bmi, c(0.25, 0.5, 0.75))) %>%
mutate(across(everything(), \(x) round(x)))
```

The last line rounds everything off to the (default) 0 decimal places. The repetitiousness of the preceding two lines makes me wonder whether I should have written a function.

\(\blacksquare\)

- Obtain predicted survival probabilities for each of the values in your new data frame. Use your best model. (You don’t need to look at the results, though you can if you want to.)

Solution

The magic word is `survfit`

(which plays the role of `predictions`

here). The best model is `whas100.2`

, with the non-significant `gender`

removed:

`<- survfit(whas100.2, whas100.new, data = whas100) pp2 `

This doesn’t need the `data=`

at the end (it works perfectly well without), but the plot (later) seems to need it to be there. I think the plot needs the information from the original data to be in the predictions somewhere.

This is kind of long to look at (`summary(pp2)`

would be the thing), so we will need to make a graph of it. I gave it a name, since I want to use it again later.

\(\blacksquare\)

- Make a graph depicting the survival curves from
`survfit`

with different colours distinguishing the different survival curves.

Solution

This is actually easy once you work out what to do:

`ggsurvplot(pp2, conf.int = FALSE)`

```
Warning: `gather_()` was deprecated in tidyr 1.2.0.
ℹ Please use `gather()` instead.
ℹ The deprecated feature was likely used in the survminer package.
Please report the issue at <https://github.com/kassambara/survminer/issues>.
```

Without the `conf.int`

thing, you get confidence intervals for each survival curve, which overlap each other and generally make the plot look messy.

The “strata” are the different age-BMI combinations that you predicted for, so it’s usually a good idea to list the “new” prediction data frame, either here or when you assess the effects of the variables (next part) so that you can see which is which:

` whas100.new`

\(\blacksquare\)

- What is the effect of age on survival? What is the effect of BMI on survival? Explain briefly. (You will have to disentangle the meaning of the different coloured lines on the plot to do this.)

Solution

Bear in mind that up-and-to-the-right is best for a survival curve, since that means that people in the upper-right group have a higher chance of surviving for longer.

The best survival curve is therefore the olive-green one. According to the legend, this goes with stratum 3, which is (according to the listing of `whas100.new`

) age 60 (the youngest) and BMI 30 (the highest). So it looks as if the best survival goes with a lower age (not surprising) and a higher BMI (surprising; see discussion about BMI below).

You can also leave one variable constant and see the effects of changing the other one. Let’s pick the oldest age 80: the BMI values are 24 (stratum 7, blue), 27 (stratum 8, purple), 30 (stratum 9, pink). These survival curves are the bottom one, the second bottom one, and the fourth bottom one. At this age, survival chances are not great, but having a higher BMI goes with a greater chance of surviving longer.

Or pick a BMI, say 30. These are strata 3 (olive green), 6 (light blue) and 9 (pink) respectively for ages 60, 71 and 80. These are the best, 3rd best and 5th best survival curves; that is, as age increases, the chance of surviving a long time decreases.

The effect of BMI, though, seems backwards: a higher BMI is associated with a *higher* chance of survival.

That’s the end of what I wanted you to do, but:

A higher BMI is usually associated with being obese (and therefore unhealthy), so you’d expect the effect of BMI to be the other way around. According to Wikipedia (link), the BMI values here are “overweight” or close to it. Maybe being heavier helps the body recover from a heart attack.

Let’s start with the martingale residual plot:

`ggcoxdiagnostics(whas100.2) + geom_smooth()`

```
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
```

There is a suspicion of bendiness here, though the left side of the curve is entirely because of that one positive residual on the left. In any case, this suggests that nonlinearity (evidently in terms of BMI, since that’s the relationship that currently makes no sense) would be worth exploring.

Thus:

```
.3 <- update(whas100.2, . ~ . + I(bmi^2))
whas100summary(whas100.3)
```

```
Call:
coxph(formula = y ~ age + bmi + I(bmi^2), data = whas100)
n= 100, number of events= 51
coef exp(coef) se(coef) z Pr(>|z|)
age 0.040542 1.041375 0.012035 3.369 0.000755 ***
bmi -0.848949 0.427864 0.231562 -3.666 0.000246 ***
I(bmi^2) 0.014500 1.014606 0.004227 3.430 0.000603 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.0414 0.9603 1.0171 1.0662
bmi 0.4279 2.3372 0.2718 0.6736
I(bmi^2) 1.0146 0.9856 1.0062 1.0230
Concordance= 0.693 (se = 0.04 )
Likelihood ratio test= 30.71 on 3 df, p=1e-06
Wald test = 32.56 on 3 df, p=4e-07
Score (logrank) test = 36.57 on 3 df, p=6e-08
```

Ah, that seems to be it. The significant positive coefficient on `bmi`

-squared means that the “hazard of dying” increases faster with increasing `bmi`

, so there ought to be an optimal BMI beyond which survival chances decrease again. Have we improved the residuals by adding the squared term?

`ggcoxdiagnostics(whas100.3) + geom_smooth()`

```
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
```

I call those “inconsequential wiggles” now, so I think we are good. Let’s explore the quadratic relationship on a graph.

I’m going to focus on a close-to-median age of 70, since, in this model, the effect of BMI is the same for all ages (to make it different, we would need an interaction term, ANOVA-style).

First we create a data frame with a bunch of different BMIs in, and one age 70:

```
<- seq(20, 36, 4)
bmis <- 70
ages .2 <- datagrid(model = whas100.3, bmi = bmis, age = ages)
whas100.new.2 whas100.new
```

It is rather absurd to have a plural `ages`

with only one age in it, but that’s the way it goes, if you’re me and trying to avoid thinking.

Predictions, using the model with the squared term in it:

`<- survfit(whas100.3, whas100.new.2, data = whas100) pp3 `

And then the plot:

`ggsurvplot(pp3, conf.int = F)`

and the customary reminder of which stratum is which, with its rather ungainly name:

`.2 whas100.new`

This time, the green survival curve is best, stratum 3, which means that survival is best at BMI 28, and worse for both higher BMIs and lower BMIs. You can follow the sequence of colours: red, olive-green, green, blue, pink, that goes up and then down again. But it’s still true that having a very *low* BMI is worst, which is why our (linear) model said that having a higher BMI was better.

It would have been better to have you put a squared term in the model, but the question was already long and complicated enough, and I didn’t want to make your lives more of a nightmare than they are already becoming!

\(\blacksquare\)

## 30.6 Drug treatment programs

One of the goals of drug treatment programs is to lengthen the time until the patient returns to using drugs. (It is not generally possible to prevent patients from *ever* using drugs again.) In one study, over 600 former drug users took part. Two different programs, a short program and a long program, were offered at two different sites, labelled A and B. The data can be found in link. The variables are these:

`ID`

: patient ID number`age`

: patient age at enrollment into the study`ndrugtx`

: number of previous drug treatments`treat`

: 0 for short treatment program, 1 for long program`site`

: 0 for site A, 1 for site B`time`

: time until return to drug use`censor`

: whether the subject returned to drug use (1) or not (0) during the follow-up period`herco`

: whether subject used heroine or cocaine in the last 3 months: 1 is both, 2 is one (either heroine or cocaine), 3 is neither.

- Read in the data and check in one way or another that you have what was promised above.

Solution

This:

```
<- "http://ritsokiguess.site/datafiles/drugusers.txt"
my_url <- read_delim(my_url, " ") drugusers
```

```
Rows: 628 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
dbl (9): row, ID, age, ndrugtx, treat, site, time, censor, herco
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

` drugusers`

This shows that you have over 600 rows and the variables described.

\(\blacksquare\)

- There are some missing values in the dataframe. Demonstrate this using
`summary`

. Pipe the dataframe into`drop_na`

and show that they have gone. (`drop_na`

removes all rows that have missing values in them.)

Solution

First off, `summary`

is a quick way to show how many missing values there are:^{8}

`summary(drugusers)`

```
row ID age ndrugtx
Min. : 1.0 Min. : 1.0 Min. :20.00 Min. : 0.000
1st Qu.:157.8 1st Qu.:157.8 1st Qu.:27.00 1st Qu.: 1.000
Median :314.5 Median :314.5 Median :32.00 Median : 3.000
Mean :314.5 Mean :314.5 Mean :32.37 Mean : 4.574
3rd Qu.:471.2 3rd Qu.:471.2 3rd Qu.:37.00 3rd Qu.: 6.000
Max. :628.0 Max. :628.0 Max. :56.00 Max. :40.000
NA's :5 NA's :17
treat site time censor
Min. :0.0000 Min. :0.000 Min. : 2.0 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.000 1st Qu.: 79.0 1st Qu.:1.0000
Median :0.0000 Median :0.000 Median : 166.0 Median :1.0000
Mean :0.4904 Mean :0.293 Mean : 234.7 Mean :0.8089
3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.: 365.2 3rd Qu.:1.0000
Max. :1.0000 Max. :1.000 Max. :1172.0 Max. :1.0000
herco
Min. :1.000
1st Qu.:1.000
Median :2.000
Mean :1.898
3rd Qu.:3.000
Max. :3.000
```

Age has five missing values and “number of previous drug treatments” has seventeen.

Following the instructions, and saving back into the original dataframe:

`%>% drop_na() -> drugusers drugusers `

and then

`summary(drugusers)`

```
row ID age ndrugtx
Min. : 1.0 Min. : 1.0 Min. :20.00 Min. : 0.000
1st Qu.:155.2 1st Qu.:155.2 1st Qu.:27.00 1st Qu.: 1.000
Median :312.5 Median :312.5 Median :32.00 Median : 3.000
Mean :313.8 Mean :313.8 Mean :32.39 Mean : 4.579
3rd Qu.:473.8 3rd Qu.:473.8 3rd Qu.:37.00 3rd Qu.: 6.000
Max. :628.0 Max. :628.0 Max. :56.00 Max. :40.000
treat site time censor
Min. :0.0000 Min. :0.0000 Min. : 2.0 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 79.0 1st Qu.:1.0000
Median :0.0000 Median :0.0000 Median : 166.0 Median :1.0000
Mean :0.4918 Mean :0.2984 Mean : 234.4 Mean :0.8115
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 361.8 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1172.0 Max. :1.0000
herco
Min. :1.00
1st Qu.:1.00
Median :2.00
Mean :1.89
3rd Qu.:3.00
Max. :3.00
```

No NA left. Gosh, as they say, that was easy. Extra: how many rows did we lose?

`nrow(drugusers)`

`[1] 610`

There were 628 rows before, so we lost 18. (There were 22 missing values, but some of them were two on one row, so we only lost 18 rows.)

This is a very unsophisticated way of dealing with missing values. Another way is to “impute” them, that is, to guess what they would have been, and then fill in the guessed values and use them as if they were the truth, for example by regressing the columns with missing values on all the others, and using the regression predictions in place of the missing values.

\(\blacksquare\)

- Some of these variables are recorded as numbers but are actually categorical. Which ones? Re-define these variables in your data frame so that they have sensible (text) values.

Solution

These variables are actually categorical rather than quantitative:

`treat`

`site`

`censor`

`herco`

Most of them have only two levels, so it doesn’t matter whether we make them categorical or leave them as numbers, but for `herco`

it matters. Let’s give them all sensible values, mostly with `ifelse`

,^{9} thus:

```
%>% mutate(
drugusers treat = ifelse(treat == 0, "short", "long"),
site = ifelse(site == 0, "A", "B"),
censor = ifelse(censor == 1, "returned", "no-return"),
herco = case_when(
== 1 ~ "both",
herco == 2 ~ "one",
herco == 3 ~ "neither"
herco
)-> drugusers )
```

I’m living on the edge and overwriting everything:

` drugusers`

\(\blacksquare\)

- Create a suitable reponse variable for a Cox proportional hazards regression that predicts time until return to drug use from the other variables. This requires some care, because you need to be sure about what the censoring variable actually represents and what you need it to represent.

Solution

This is `Surv`

in package `survival`

. The response variable needs to encode two things: the time until the event of interest (return to drug use) and whether or not that event happened for each patient.^{10} In this case, that is `censor="returned"`

.

`<- with(drugusers, Surv(time, censor == "returned")) y `

Use whatever name you gave to the level of `censor`

that means “returned to drug use”.

Once again, there is no problem with adding a new column `y`

to your dataframe, thus:

```
%>%
drugusers mutate(y = Surv(time, censor == "returned"))
```

Add your response as a stand-alone vector or as a new column; your choice.

\(\blacksquare\)

- Look at the first few values of your response variable. Why is the fifth one marked with a
`+`

? Explain briefly.

Solution

`head`

works as well with a vector (displaying the first six values) as it does with a data frame:

`head(y)`

`[1] 188 26 207 144 551+ 32 `

The fifth value is marked with a `+`

because it is a censored value: this is a patient who was never observed to go back to drug use. You can tell this by looking at the `head`

of the entire data frame:

`head(drugusers)`

since this patient has `censor="no-return"`

. The other ones have `censor="returned"`

; these are all “uncensored” in the jargon.

If you added a new column `y`

to your dataframe, you can see all this in one go by looking at the fifth row of the dataframe.

Typically, censored values will be bigger than uncensored ones, because (in general) the individual will be observed until the study ends, and studies of this kind carry on for years:

`ggplot(drugusers, aes(x = censor, y = time)) + geom_boxplot()`

Yep. The smallest time for a censored observation would be an upper outlier if it were observed for an uncensored observation.

One nice side-effect of turning `censor`

into a categorical variable is that it can now distinguish groups as a boxplot requires.

I discovered something rather amusing when I originally wrote this (a year ago). Suppose you want to compare times for the two treatment groups, and you *also* want to distinguish censored from non-censored observations. Then, this works:

```
ggplot(drugusers, aes(x = treat, y = time, colour = censor)) +
geom_boxplot()
```

For each treatment, you get side-by-side boxplots of the times for censored (red) and uncensored (blue) observations, and so you see for both treatments (short and long) the censored times are typically longer than the uncensored ones.

(This you may recognize as a “grouped boxplot”, for when we have two categorical variables and one quantitative one.)

I borrow this idea for two-way ANOVA (coming up later).

\(\blacksquare\)

- Fit a Cox proportional hazards model, predicting from all the other variables (except for
`row`

and`ID`

) that you haven’t used yet. Display the results.

Solution

```
.1 <- coxph(y ~ age + ndrugtx + treat + site + herco, data = drugusers)
druguserssummary(drugusers.1)
```

```
Call:
coxph(formula = y ~ age + ndrugtx + treat + site + herco, data = drugusers)
n= 610, number of events= 495
coef exp(coef) se(coef) z Pr(>|z|)
age -0.023798 0.976483 0.007561 -3.148 0.00165 **
ndrugtx 0.034815 1.035429 0.007755 4.490 7.14e-06 ***
treatshort 0.254606 1.289953 0.091006 2.798 0.00515 **
siteB -0.173021 0.841120 0.102105 -1.695 0.09016 .
herconeither 0.125779 1.134032 0.103075 1.220 0.22236
hercoone 0.247318 1.280586 0.122759 2.015 0.04394 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 0.9765 1.0241 0.9621 0.9911
ndrugtx 1.0354 0.9658 1.0198 1.0513
treatshort 1.2900 0.7752 1.0792 1.5418
siteB 0.8411 1.1889 0.6886 1.0275
herconeither 1.1340 0.8818 0.9266 1.3879
hercoone 1.2806 0.7809 1.0067 1.6289
Concordance= 0.581 (se = 0.014 )
Likelihood ratio test= 35.08 on 6 df, p=4e-06
Wald test = 36.96 on 6 df, p=2e-06
Score (logrank) test = 37.36 on 6 df, p=1e-06
```

Another way to handle “all the other \(x\)’s except `row`

, `ID`

, `time`

and `censor`

” is this:

```
.1a <- coxph(y ~ . - row - ID - time - censor, data = drugusers)
druguserstidy(drugusers.1a)
```

Same. I used `tidy`

from `broom`

to shorten the output a bit.

\(\blacksquare\)

- Find which explanatory variables can be removed at \(\alpha=0.05\) (there should be two of them). Bear in mind that we have categorical variables, so that looking at the output from
`summary`

is not enough.

Solution

The hint is meant to suggest to you that looking at `drop1`

is the right way to go:

`drop1(drugusers.1, test = "Chisq")`

Note that `herco`

, a categorical variable with three levels, has 2 degrees of freedom here, since a test of “no effect of `herco`

” is testing that survival is the same at *all three* levels of `herco`

.

\(\blacksquare\)

- Remove
*all*the non-significant explanatory variables and re-fit your model. By carrying out a suitable test demonstrate that your smaller model is the better one.

Solution

`site`

and `herco`

are the two variables to come out.^{11} I like `update`

, but there is no problem about copying-pasting your `coxph`

and taking out what you no longer need.

`.2 <- update(drugusers.1, . ~ . - site - herco) drugusers`

Having fit two models, we can use `anova`

to compare them. The right test gets done, so no need for `test=`

:

`anova(drugusers.2, drugusers.1)`

There is no significant difference between these two models,^{12} so we can go with the smaller, simpler one (with just `age`

, `ndrugtx`

and `treat`

).

\(\blacksquare\)

- * Display your better model. Are all of the explanatory variables significant? Do their slope coefficients have sensible signs (plus or minus), based on what you know or can guess about drug treatments? Explain briefly.

Solution

`summary(drugusers.2)`

```
Call:
coxph(formula = y ~ age + ndrugtx + treat, data = drugusers)
n= 610, number of events= 495
coef exp(coef) se(coef) z Pr(>|z|)
age -0.020801 0.979414 0.007419 -2.804 0.00505 **
ndrugtx 0.035567 1.036207 0.007621 4.667 3.05e-06 ***
treatshort 0.231055 1.259929 0.090175 2.562 0.01040 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 0.9794 1.0210 0.9653 0.9938
ndrugtx 1.0362 0.9651 1.0208 1.0518
treatshort 1.2599 0.7937 1.0558 1.5035
Concordance= 0.572 (se = 0.014 )
Likelihood ratio test= 27.87 on 3 df, p=4e-06
Wald test = 30.5 on 3 df, p=1e-06
Score (logrank) test = 30.62 on 3 df, p=1e-06
```

The three remaining explanatory variables are all clearly significant: the patient’s age, the number of previous drug treatments, and whether the treatment was short or long. This is legit (we don’t need to run `drop1`

again) because the remaining explanatory variables are all quantitative or have only two levels, so that the single-df tests in `summary`

are what we need.

Do their slope coefficients have sensible signs? Well, this requires careful thought. A positive coefficient means that increasing that variable *increases the hazard of the event*: ie., it makes the event likelier to happen sooner. Here, the “event” is “return to drug use”:

Age has a negative coefficient, so an older person is likely to take

*longer*to return to drug use, other things being equal. This makes some kind of sense, if you imagine drug use as being related to maturity, or an older drug user as being more strongly committed to “turning their life around”, so that a drug treatment of any kind is going to be more effective on an older patient.The number of previous treatments has a positive coefficient, so that a patient who has had a lot of previous treatments is likely to go back to drugs sooner. Such a person might be an “addict” for whom treatments really do not work, or might not be committed to giving up drugs.

`treatshort`

has a positive coefficient. This says that if you give a patient a short treatment, they are more likely (other things being equal) to go back to drugs sooner, as compared to the baseline long treatment. That is, a longer treatment is more effective than a shorter one. Given a significant effect of treatment length, this is the way around you would expect it to be.

\(\blacksquare\)

- We have three variables left in our model,
`age`

,`ndrugtx`

and`treat`

. The quartiles of age are 27 and 37, the quartiles of`ndrugtx`

are 1 and 6, and the two possible values of`treat`

are`short`

and`long`

. Create a data frame with variables of these names and all possible combinations of their values (so there should be 8 rows in the resulting data frame). Display the resulting data frame.

Solution

This data frame is going to be used for prediction, so I will call it `new`

and construct it in pieces as I did before (thus meaning that I don’t have to think too hard about what I’m doing):

```
<- c(27, 37)
ages <- c(1, 6)
ndrugtxs <- c("short", "long")
treats <- datagrid(model = drugusers.2, age = ages, ndrugtx = ndrugtxs, treat = treats)
new new
```

8 rows as promised.

\(\blacksquare\)

- Obtain predicted survival probabilities for each of the values of
`age`

,`ndrugtx`

and`treat`

used in the previous part. You don’t need to display it (we are going to plot it shortly).

Solution

`survfit`

is the survival analysis version of `predict`

and works the same way, so this is all you need:

`<- survfit(drugusers.2, new, data = drugusers) pp `

Make sure that you use your best model, ie. the second one. The `data=`

is needed for the plot below, not in itself for the prediction.

\(\blacksquare\)

- Plot your predicted survival curves.

Solution

This:

`ggsurvplot(pp, conf.int = F)`

The only thing to remember is that you plot your *predictions*, not the model from which they came.

If your plot didn’t come out, you may have to go back and re-do the `survfit`

with the `data=`

at the end.

For reference in a minute:

` new`

\(\blacksquare\)

- Which of your combinations of values is predicted to take the longest to return to drug use? Which is predicted to take the shortest time? Explain briefly.

Solution

Remember that “up and to the right” is the best survival curve: that is, the people on this survival curve are predicted to take the longest to return to drug use. On my plot, this is the pale blue survival curve, stratum 5. Going back to my combinations data frame, this is 37-year-olds with only one previous drug treatment and the longer drug treatment this time.

The worst is my green survival curve, stratum 4, which is the exact opposite of this: 27-year-olds, 6 previous drug treatments, shorter treatment this time.

“Returning to drug use” is like “death” in that you want it to be a long time before it happens, so “best” is top right on the plot of survival curves. In other circumstances, you might want the event to happen *sooner*, in which case the lower-left survival curve would be the “best” one.

\(\blacksquare\)

- Are your survival curve plot and your conclusions from part (here) consistent, or not? Explain briefly.

Solution

The survival curves say that being older, having fewer previous treatments and being on the long treatment are better in terms of taking longer to return to drug use. Our analysis of whether the slope coefficients in `drugusers.2`

were positive or negative came to exactly the same conclusion. So the survival curves and part (here) are entirely consistent.

On my plot with the legend, you can assess the effects of the individual variables: for example, to assess the effect of age, find two combos that differ only in age, say strata 1 and 5, the red and light blue ones. Of these, the light blue survival curve is higher, so age 37 is better in terms of survival than age 27. This will work whichever such pair you pick: for example, strata 3 and 7, the olive green and purple curves, compare the same way.

Extra: more comparisons for you to do: to assess the effect of number of previous treatments, compare eg. strata 1 and 3, red and olive green, and to assess the effect of treatment length, compare eg. strata 5 and 6, light blue and darker blue.

All this struggling to identify colours makes me think of link, in which the guy behind the webcomic XKCD did a survey where he showed people a whole bunch of different colours and asked the people to name the colours.

\(\blacksquare\)

## 30.7 Multiple myeloma

Multiple myeloma is a kind of cancer. It forms in a plasma cell (which is a type of white blood cell). It causes cancer cells to accumulate in the bone marrow, where they crowd out healthy blood cells. Plasma cells make antibodies (to help fight infections), while the cancer cells don’t: they produce abnormal proteins that can cause kidney problems. (This adapted from link.) The variables are:

`time`

: survival time from diagnosis (months)`vstatus`

: 0=alive, 1=dead at end of study`logbun`

: log of BUN test score (BUN test is a test of kidney function, not to be confused with cha siu bao^{13}).`hgb`

: hemoglobin (at diagnosis).`platelet`

: platelets: 1=normal, 0=abnormal (at diagnosis).`age`

at diagnosis, in years`logwbc`

: log of WBC (white blood cell count, at diagnosis)`frac`

: fractures at diagnosis (0=absent, 1=present)`logpbm`

: log of percent of plasma cells in bone marrow`protein`

: proteinuria (protein in urine) at diagnosis. Most people have very little, so a larger than normal amount indicates illness of some kind.`scalc`

: serum calcium at diagnosis.

The data, on 65 patients with multiple myeloma, are in link. Some of the variables are logs because they could take very large values.

There are a lot of parts here, but each part is supposed to be short.

- Read in the data and display (some of) the values. Confirm that you have the right number of observations and the right variables.

Solution

The usual:

```
<- "http://ritsokiguess.site/datafiles/myeloma.csv"
my_url <- read_csv(my_url) myeloma
```

```
Rows: 65 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (11): time, vstatus, logbun, hgb, platelet, age, logwbc, frac, logpbm, p...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

` myeloma`

65 observations, and all the variables listed. If you want to go further (not necessary here), you can check that the variables `vstatus`

, `platelet`

and `frac`

that should be zero and one actually *are* zero and one, at least for the values shown (they are), and the ages look like ages (they do).

The tidyverse also offers:

`glimpse(myeloma)`

```
Rows: 65
Columns: 11
$ time <dbl> 1.25, 1.25, 2.00, 2.00, 2.00, 3.00, 5.00, 5.00, 6.00, 6.00, 6…
$ vstatus <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ logbun <dbl> 2.2175, 1.9395, 1.5185, 1.7482, 1.3010, 1.5441, 2.2355, 1.681…
$ hgb <dbl> 9.4, 12.0, 9.8, 11.3, 5.1, 6.7, 10.1, 6.5, 9.0, 10.2, 9.7, 10…
$ platelet <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1…
$ age <dbl> 67, 38, 81, 75, 57, 46, 50, 74, 77, 70, 60, 67, 48, 61, 53, 5…
$ logwbc <dbl> 3.6628, 3.9868, 3.8751, 3.8062, 3.7243, 4.4757, 4.9542, 3.732…
$ frac <dbl> 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1…
$ logpbm <dbl> 1.9542, 1.9542, 2.0000, 1.2553, 2.0000, 1.9345, 1.6628, 1.732…
$ protein <dbl> 12, 20, 2, 0, 3, 12, 4, 5, 1, 1, 0, 0, 5, 1, 1, 0, 0, 1, 1, 0…
$ scalc <dbl> 10, 18, 15, 12, 9, 10, 9, 9, 8, 8, 10, 8, 10, 10, 13, 12, 10,…
```

which gives a bit more of a picture of the values.^{14} Or if you were serious about checking, you could do

`summary(myeloma)`

```
time vstatus logbun hgb
Min. : 1.25 Min. :0.0000 Min. :0.7782 Min. : 4.9
1st Qu.: 7.00 1st Qu.:0.0000 1st Qu.:1.1461 1st Qu.: 8.8
Median :15.00 Median :1.0000 Median :1.3222 Median :10.2
Mean :24.01 Mean :0.7385 Mean :1.3929 Mean :10.2
3rd Qu.:35.00 3rd Qu.:1.0000 3rd Qu.:1.5682 3rd Qu.:12.0
Max. :92.00 Max. :1.0000 Max. :2.2355 Max. :14.6
platelet age logwbc frac
Min. :0.0000 Min. :38.00 Min. :3.362 Min. :0.0000
1st Qu.:1.0000 1st Qu.:51.00 1st Qu.:3.643 1st Qu.:1.0000
Median :1.0000 Median :60.00 Median :3.732 Median :1.0000
Mean :0.8615 Mean :60.15 Mean :3.769 Mean :0.7538
3rd Qu.:1.0000 3rd Qu.:67.00 3rd Qu.:3.875 3rd Qu.:1.0000
Max. :1.0000 Max. :82.00 Max. :4.954 Max. :1.0000
logpbm protein scalc
Min. :0.4771 Min. : 0.000 Min. : 7.00
1st Qu.:1.3617 1st Qu.: 0.000 1st Qu.: 9.00
Median :1.6232 Median : 1.000 Median :10.00
Mean :1.5497 Mean : 3.615 Mean :10.12
3rd Qu.:1.8451 3rd Qu.: 4.000 3rd Qu.:10.00
Max. :2.0000 Max. :27.000 Max. :18.00
```

which gives means and five-number summaries for each of the variables (the numeric ones, but they all are here, even the ones coded as 0 or 1 that are really categorical).

\(\blacksquare\)

- Create a suitable response variable for a Cox proportional-hazards survival model, bearing in mind that the “event” here is death. Display your response variable, and explain briefly what the
`+`

signs attached to some of the values mean, without using a technical term.

Solution

I seem to call my response variables `y`

, but you can call yours whatever you like. Two things to consider: the survival times, here `time`

, and the indicator of the event, here `vstatus`

being 1.

The modern way is to define the response variable right back into the dataframe, thus:

```
%>% mutate(y = Surv(time, vstatus == 1)) -> myeloma
myeloma myeloma
```

You see that the new column (on the end) is of type `Surv`

, to reflect the fact that it is not just a number (a survival time) but also encodes whether or not the individual was observed to die at that time.

The old way (that still works) is to create `y`

*outside* the dataframe, like this:

```
<- with(myeloma, Surv(time, vstatus == 1))
y y
```

```
[1] 1.25 1.25 2.00 2.00 2.00 3.00 5.00 5.00 6.00 6.00
[11] 6.00 6.00 7.00 7.00 7.00 9.00 11.00 11.00 11.00 11.00
[21] 11.00 13.00 14.00 15.00 16.00 16.00 17.00 17.00 18.00 19.00
[31] 19.00 24.00 25.00 26.00 32.00 35.00 37.00 41.00 41.00 51.00
[41] 52.00 54.00 58.00 66.00 67.00 88.00 89.00 92.00 4.00+ 4.00+
[51] 7.00+ 7.00+ 8.00+ 12.00+ 11.00+ 12.00+ 13.00+ 16.00+ 19.00+ 19.00+
[61] 28.00+ 41.00+ 53.00+ 57.00+ 77.00+
```

Or use `myeloma$`

(twice) before the variable names.

The values of `y`

that have a `+`

by them go with patients who were never observed to die (or were still alive at the end of the study). There were 17 of these, listed at the end of the data frame. Usually, these values of the response will be higher than the others, but they weren’t here. (Maybe some of these patients were withdrawn from the study, or they joined it late.)

The reason we used to have to do it this way is that `tibble`

s didn’t until recently have the ability to store a thing like the above `y`

as a column, because it wasn’t just a number. Now, it is stored like a list-column, and all that matters is that it is a column of *something*, one survival-time-plus-censorship-status for each observation.

\(\blacksquare\)

- What is the technical term for those patients that have a
`+`

by their values for the response variable?

Solution

Censored. A quick one. I was trying to dissuade you from using the word “censored” in the previous part, since I wanted you to demonstrate that you understood what it *meant*. But you should know the technical term as well, which is why I asked you for it here. Grading note: if this part and the previous one contain, somewhere, the word “censored” *and* a clear explanation of what “censored” means, then I don’t mind what is where.

\(\blacksquare\)

- Fit a Cox proportional-hazards survival model predicting your response variable from all the other variables (except for the ones that you used to make the response variable). Display the
`summary`

of your model.

Solution

The obvious way to do this is to list all the other variables on the right side of the squiggle, but a faster way is this:

```
.1 <- coxph(y ~ . - time - vstatus, data = myeloma)
ysummary(y.1)
```

```
Call:
coxph(formula = y ~ . - time - vstatus, data = myeloma)
n= 65, number of events= 48
coef exp(coef) se(coef) z Pr(>|z|)
logbun 1.85557 6.39536 0.65628 2.827 0.00469 **
hgb -0.12629 0.88136 0.07212 -1.751 0.07994 .
platelet -0.25488 0.77501 0.51194 -0.498 0.61858
age -0.01306 0.98702 0.01957 -0.668 0.50439
logwbc 0.35389 1.42460 0.71576 0.494 0.62101
frac 0.34232 1.40821 0.40725 0.841 0.40059
logpbm 0.38165 1.46470 0.48743 0.783 0.43364
protein 0.01302 1.01311 0.02612 0.498 0.61817
scalc 0.12976 1.13856 0.10502 1.236 0.21659
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
logbun 6.3954 0.1564 1.7670 23.147
hgb 0.8814 1.1346 0.7652 1.015
platelet 0.7750 1.2903 0.2841 2.114
age 0.9870 1.0131 0.9499 1.026
logwbc 1.4246 0.7020 0.3503 5.794
frac 1.4082 0.7101 0.6339 3.128
logpbm 1.4647 0.6827 0.5634 3.808
protein 1.0131 0.9871 0.9625 1.066
scalc 1.1386 0.8783 0.9268 1.399
Concordance= 0.675 (se = 0.051 )
Likelihood ratio test= 17.62 on 9 df, p=0.04
Wald test = 17.93 on 9 df, p=0.04
Score (logrank) test = 18.97 on 9 df, p=0.03
```

The `.`

in this model formula means “all the columns in the data frame” (except for the response variable if it was in the data frame, which here it was not). I used `time`

and `vstatus`

to make `y`

, so I had to exclude them explicitly.

If you forget to exclude `time`

and `vstatus`

, you are in danger of having a model that fits perfectly:

`.00 <- coxph(y ~ ., data = myeloma) y`

```
Warning in coxph.fit(X, Y, istrat, offset, init, control, weights = weights, :
Ran out of iterations and did not converge
```

`summary(y.00)`

```
Call:
coxph(formula = y ~ ., data = myeloma)
n= 65, number of events= 48
coef exp(coef) se(coef) z Pr(>|z|)
time -1.109e+01 1.526e-05 9.799e-01 -11.318 <2e-16 ***
vstatus 1.543e+01 5.031e+06 6.233e+02 0.025 0.98
logbun 1.759e-04 1.000e+00 7.904e-01 0.000 1.00
hgb 6.861e-06 1.000e+00 9.143e-02 0.000 1.00
platelet 9.226e-05 1.000e+00 6.221e-01 0.000 1.00
age -4.311e-06 1.000e+00 2.245e-02 0.000 1.00
logwbc -1.220e-06 1.000e+00 8.940e-01 0.000 1.00
frac -6.694e-05 9.999e-01 6.262e-01 0.000 1.00
logpbm 2.400e-04 1.000e+00 7.307e-01 0.000 1.00
protein 3.472e-05 1.000e+00 6.776e-02 0.001 1.00
scalc -1.447e-05 1.000e+00 1.091e-01 0.000 1.00
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
time 1.526e-05 6.554e+04 2.236e-06 0.0001041
vstatus 5.031e+06 1.988e-07 0.000e+00 Inf
logbun 1.000e+00 9.998e-01 2.124e-01 4.7087742
hgb 1.000e+00 1.000e+00 8.359e-01 1.1962633
platelet 1.000e+00 9.999e-01 2.955e-01 3.3849492
age 1.000e+00 1.000e+00 9.569e-01 1.0449881
logwbc 1.000e+00 1.000e+00 1.734e-01 5.7673011
frac 9.999e-01 1.000e+00 2.930e-01 3.4119518
logpbm 1.000e+00 9.998e-01 2.389e-01 4.1885763
protein 1.000e+00 1.000e+00 8.757e-01 1.1420792
scalc 1.000e+00 1.000e+00 8.075e-01 1.2384242
Concordance= 1 (se = 0 )
Likelihood ratio test= 277 on 11 df, p=<2e-16
Wald test = 128.1 on 11 df, p=<2e-16
Score (logrank) test = 87.4 on 11 df, p=5e-14
```

The warning at the top is your clue that something has gone wrong. This kind of warning *can* happen with real data, but not often: it is usually an indication that something is wrong with the way you specified the model. If you look at the output, you’ll realize that predicting survival time from survival time makes no sense at all.

There is of course nothing wrong with typing out all the variable names, except that the first time you type them out, you will likely make a typo (unless you are more careful than I usually am).

\(\blacksquare\)

- In your model, which explanatory variables have a P-value less than 0.10? Fit a model containing only those and display the results.

Solution

Only `logbun`

and `hgb`

; the other P-values are larger, usually much larger. Because there are so many variables to remove, I am frightened away from `update`

here (which I would normally try to use in this situation). I’m going to copy-and-paste my code for `y.1`

and edit it:

```
.2 <- coxph(y ~ logbun + hgb, data = myeloma)
ysummary(y.2)
```

```
Call:
coxph(formula = y ~ logbun + hgb, data = myeloma)
n= 65, number of events= 48
coef exp(coef) se(coef) z Pr(>|z|)
logbun 1.71597 5.56209 0.61855 2.774 0.00553 **
hgb -0.11966 0.88722 0.05742 -2.084 0.03717 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
logbun 5.5621 0.1798 1.6547 18.6961
hgb 0.8872 1.1271 0.7928 0.9929
Concordance= 0.675 (se = 0.043 )
Likelihood ratio test= 12.27 on 2 df, p=0.002
Wald test = 12.51 on 2 df, p=0.002
Score (logrank) test = 13.07 on 2 df, p=0.001
```

That’s all I wanted, but you can note that `hgb`

has become significant at \(\alpha=0.05\). I suspect it was somewhat correlated with a variable that we removed, so that its value to the regression has become clearer.

\(\blacksquare\)

- Do a test to compare the two models that you fit. Why do you prefer the second model? Explain briefly.

Solution

Comparing two models is `anova`

, which also works here. The right `test`

is `Chisq`

:

`anova(y.2, y.1, test = "Chisq")`

The usual logic here: this is far from significant, so the null hypothesis (that the two models are equally good) is not rejected, so we prefer the smaller model `y.2`

because it is simpler.

I wasn’t sure about the `Model 2`

line of my `anova`

output (what are `time`

and `vstatus`

doing there?), but the test has 7 degrees of freedom, which is correct since we started with 9 explanatory variables and finished with 2, so that we took out 7 of them. I checked what went off the right side of the page: there is a `-time-vstatus`

on the end, so that it is correct. What happened is that the `.`

got expanded out into all the variables separated by `+`

, and then whatever else (the “minus” variables) were on the end.

In case you are curious, `step`

also works on models like these:

```
.3 <- step(y.1, direction = "backward", trace = 0)
ysummary(y.3)
```

```
Call:
coxph(formula = y ~ logbun + hgb, data = myeloma)
n= 65, number of events= 48
coef exp(coef) se(coef) z Pr(>|z|)
logbun 1.71597 5.56209 0.61855 2.774 0.00553 **
hgb -0.11966 0.88722 0.05742 -2.084 0.03717 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
logbun 5.5621 0.1798 1.6547 18.6961
hgb 0.8872 1.1271 0.7928 0.9929
Concordance= 0.675 (se = 0.043 )
Likelihood ratio test= 12.27 on 2 df, p=0.002
Wald test = 12.51 on 2 df, p=0.002
Score (logrank) test = 13.07 on 2 df, p=0.001
```

The same model as the one we found by brute force. You can change the value of `trace`

to see the progress, but in this case it’s not very illuminating, since `<none>`

and the variables we end up keeping are always at the bottom of the list to remove.

`step`

is built on `add1`

and `drop1`

. In this case, `drop1`

is run repeatedly and the variable with lowest AIC is removed. We had all numeric variables in this one, but if our model had something categorical like `treatment`

with, let’s say, 4 levels, `drop1`

would contemplate dropping all four of these in one shot, the same way it works with a categorical variable in a regression of any other kind.

\(\blacksquare\)

- There should be two explanatory variables left in your model. These are both numerical variables. Find their first and third quartiles, any way you like.

Solution

The obvious way is probably this:

`quantile(myeloma$logbun)`

```
0% 25% 50% 75% 100%
0.7782 1.1461 1.3222 1.5682 2.2355
```

`quantile(myeloma$hgb)`

```
0% 25% 50% 75% 100%
4.9 8.8 10.2 12.0 14.6
```

So the quartiles are 1.15 and 1.57 for `logbun`

, and 8.8 and 12.0 for `hgb`

.

There are (at least) three other ways to do it. This is the easiest:

`summary(myeloma)`

```
time vstatus logbun hgb
Min. : 1.25 Min. :0.0000 Min. :0.7782 Min. : 4.9
1st Qu.: 7.00 1st Qu.:0.0000 1st Qu.:1.1461 1st Qu.: 8.8
Median :15.00 Median :1.0000 Median :1.3222 Median :10.2
Mean :24.01 Mean :0.7385 Mean :1.3929 Mean :10.2
3rd Qu.:35.00 3rd Qu.:1.0000 3rd Qu.:1.5682 3rd Qu.:12.0
Max. :92.00 Max. :1.0000 Max. :2.2355 Max. :14.6
platelet age logwbc frac
Min. :0.0000 Min. :38.00 Min. :3.362 Min. :0.0000
1st Qu.:1.0000 1st Qu.:51.00 1st Qu.:3.643 1st Qu.:1.0000
Median :1.0000 Median :60.00 Median :3.732 Median :1.0000
Mean :0.8615 Mean :60.15 Mean :3.769 Mean :0.7538
3rd Qu.:1.0000 3rd Qu.:67.00 3rd Qu.:3.875 3rd Qu.:1.0000
Max. :1.0000 Max. :82.00 Max. :4.954 Max. :1.0000
logpbm protein scalc
Min. :0.4771 Min. : 0.000 Min. : 7.00
1st Qu.:1.3617 1st Qu.: 0.000 1st Qu.: 9.00
Median :1.6232 Median : 1.000 Median :10.00
Mean :1.5497 Mean : 3.615 Mean :10.12
3rd Qu.:1.8451 3rd Qu.: 4.000 3rd Qu.:10.00
Max. :2.0000 Max. :27.000 Max. :18.00
y.time y.status
Min. : 1.25000 Min. :0.0000000
1st Qu.: 7.00000 1st Qu.:0.0000000
Median :15.00000 Median :1.0000000
Mean :24.00769 Mean :0.7384615
3rd Qu.:35.00000 3rd Qu.:1.0000000
Max. :92.00000 Max. :1.0000000
```

from which you pick out the ones you need. Or, you `select`

the ones you need first:

`%>% select(logbun, hgb) %>% summary() myeloma `

```
logbun hgb
Min. :0.7782 Min. : 4.9
1st Qu.:1.1461 1st Qu.: 8.8
Median :1.3222 Median :10.2
Mean :1.3929 Mean :10.2
3rd Qu.:1.5682 3rd Qu.:12.0
Max. :2.2355 Max. :14.6
```

The obvious `tidyverse`

way is actually a bit inelegant, because you have to calculate two things for two variables:^{15}

```
%>% summarize(
myeloma logbun.q1 = quantile(logbun, 0.25),
logbun.q3 = quantile(logbun, 0.75),
hgb.q1 = quantile(hgb, 0.25),
hgb.q3 = quantile(hgb, 0.75)
)
```

Next is the tidyverse-approved way to get both quartiles for both variables at once. Use `across`

to select the variables to use, and then something with a squiggle and a dot to say “do this on each of the columns selected in the `across`

”. If you have a cleverer way to select those two columns without naming them, go for it. Read this in English as “for each of the columns `logbun`

and `hgb`

, work out the first and third quantiles of it”, where the dot is read as “it”:

```
%>%
myeloma summarize(across(c(logbun, hgb),
quantile(x, c(0.25, 0.75)))) \(x)
```

```
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
```

We have lost which quartile is which, but of course the lower one must be Q1 and the higher one Q3 for each variable.^{16}

\(\blacksquare\)

- Create a data frame containing all possible combinations of the two quartiles for each of the two variables, and display the result.

Solution

This is `datagrid`

. My best model is the one I called `y.2`

(I had to scroll back a ways to find it), so:

`<- datagrid(model = y.2, logbun = c(1.14561, 1.5682), hgb = c(8.8, 12.0)) new `

```
Warning: Matrix columns are not supported as predictors and are therefore
omitted. This may prevent computation of the quantities of interest. You
can construct your own prediction dataset and supply it explicitly to
the `newdata` argument.
```

` new`

Or anything equivalent to that.

The place you have to get to in the end is a data frame with columns called `logbun`

and `hgb`

, and the right four combinations of values. If you want to round the `logbun`

values off more, for example to two decimals, that’s fine; it won’t affect the graph that’s coming up.

\(\blacksquare\)

- Obtain predicted survival probabilities for each of the combinations of variables you created above. You don’t need to look at the results (they are rather long).

Solution

This seems as if it ought to be `predictions`

, but the `survival`

version of it is called `survfit`

:

`<- survfit(y.2, new, data = myeloma) s `

It works the same as `predictions`

: a fitted model object (your smaller survival model), and a data frame of values to predict for. The `data=`

is not strictly needed here, but if you want `ggsurvplot`

to work right, you *do* need it to be here.

\(\blacksquare\)

- Obtain a graph of the predicted survival curves for each combination of your variables.

Solution

This is easier than you think: it’s just `ggsurvplot`

from `survminer`

:

`ggsurvplot(s, conf.int = FALSE)`

\(\blacksquare\)

- Is it better to have high or low values for each of the variables in your prediction? Explain briefly.

Solution

Those four “strata” are the four rows in your prediction data frame (the four combinations). They are in the same order that they were in `new`

(or whatever name you used):

` new`

The best survival curve is the top-right green one. This is stratum^{17} 2, from the legend at the top. In `new`

, this goes with a low value of `logbun`

and a *high* value of `hgb`

.

You can check this by looking at the *worst* survival curve, which should be diametrically opposed. This is the blue one, stratum 3, which is *high* `logbun`

and *low* `hgb`

, indeed exactly the opposite of the best one.

Things that are tests, like `logbun`

, are often set up so that a high value is the abnormal one (so that an abnormal one will be easy to spot). Things that are measurements, like `hgb`

, might have an ideal range, but the better value could be high or low, depending on what is being measured.

\(\blacksquare\)

## 30.8 Ovarian cancer

R’s `survival`

package contains several data sets. One of these is called `ovarian`

; it comes from a study of 26 ovarian cancer patients. The major purpose of this study was to compare the effects of two treatments on survival time.

- Obtain and display (all of) the data set. This is as simple as loading the package and typing the data set’s name.

Solution

Thus. You may need to start with `library(survival)`

:

` ovarian`

There are indeed 26 rows. This is a `data.frame`

rather than a `tibble`

, so you might see the whole thing, in case you were expecting something like this:

`%>% as_tibble() ovarian `

which doesn’t change anything in `ovarian`

, but changes what kind of thing it is (and thus how it displays). Usually when you read something in from a file, you use something like `read_delim`

that makes a `tibble`

, but this one wasn’t read in from a file. It was stored in the package as an old-fashioned `data.frame`

, and so that’s how it stays.

\(\blacksquare\)

- The columns of interest to us are:

`futime`

: the time for which a patient was followed-up: the number of days until either they died or the study ended (or they withdrew from the study for some other reason).`fustat`

: follow-up status: 1 if the patient died of ovarian cancer, 0 if they were still alive when the study ended.`age`

: of patient, at diagnosis, in years`rx`

: treatment, numbered 1 or 2, but really labels for the two treatments.

Create and display a suitable response variable `y`

for a Cox proportional-hazards model.

Solution

The idea is to use the appropriate one(s) of these columns in `Surv`

. Remember that the response variable in a survival model encodes two things: the survival time, and whether or not the event (here death) actually happened to that patient or not. I always forget whether the second thing in `Surv`

has to be 1 or 0 if the event happened. The help says that it needs to be 1 or `TRUE`

if the event (death) happened, which is what `fustat`

is, so we can use it as it is:

```
<- with(ovarian, Surv(futime, fustat))
y y
```

```
[1] 59 115 156 421+ 431 448+ 464 475 477+ 563 638 744+
[13] 769+ 770+ 803+ 855+ 1040+ 1106+ 1129+ 1206+ 1227+ 268 329 353
[25] 365 377+
```

This creates a separate variable `y`

outside of any data frame. This is how, in the past, it had to be done, because although `y`

looks as if it is 26 long (one per patient), it’s actually more complicated than that. But adding it as a column to a dataframe now works just fine:

```
%>% mutate(y = Surv(futime, fustat)) -> ov2
ovarian ov2
```

Internally `y`

is a `matrix`

with two columns:

`print.default(y)`

```
time status
[1,] 59 1
[2,] 115 1
[3,] 156 1
[4,] 421 0
[5,] 431 1
[6,] 448 0
[7,] 464 1
[8,] 475 1
[9,] 477 0
[10,] 563 1
[11,] 638 1
[12,] 744 0
[13,] 769 0
[14,] 770 0
[15,] 803 0
[16,] 855 0
[17,] 1040 0
[18,] 1106 0
[19,] 1129 0
[20,] 1206 0
[21,] 1227 0
[22,] 268 1
[23,] 329 1
[24,] 353 1
[25,] 365 1
[26,] 377 0
attr(,"type")
[1] "right"
attr(,"class")
[1] "Surv"
```

but everything is handled properly when you add it to a dataframe (note the `Surv`

at the top of the column to indicated that it is a survival-time object, encoding a survival time and survivorship status both).

\(\blacksquare\)

- In the display of your response variable, some values are marked with a
`+`

. Why is that? Explain briefly. (If you use a technical term, you should explain what it means.)

Solution

These are the censored observations. You can say this, but you also need to say what that means (this is the “technical term” referred to in the question). The observations with a `+`

are individuals who were never observed to die, or who were still alive at the end of the study.

I want you to demonstrate that you know what censored *means*, not just that you know when you have a censored observation.

Extra: in a study like this, patients are typically “recruited” into the study at various different times. Patients who happened to be in the study near the beginning and who survived can have a large (censored) value of `y`

(like those values over 1000 days). But a patient might join the study later on; if they survive, they might produce a censored observation with a small survival time, like the last value 377. I’m sure the doctor would have liked to follow them for longer, but the funding ran out, and the doctor had a paper to write. (There is *some* information in these small censored values, but not much, because most of the patients, even the ones who eventually died, survived for longer than 377 days.)

The other thing that might have happened is that a patient with the 377-censored value died *from something else* unrelated to ovarian cancer. The study is only concerned with deaths from ovarian cancer, so such a patient is treated as censored at their death time. After this point we cannot assess how long this patient survived *ovarian cancer*.

\(\blacksquare\)

- Fit a Cox proportional-hazards model for predicting survival time from age and treatment. Note that the numeric values for treatment make sense only as labels for the two treatments, so in your model formula make treatment into a factor. Display the results.

Solution

The hint suggests something like this:

```
.1 <- coxph(y ~ age + factor(rx), data = ovarian)
timesummary(time.1)
```

```
Call:
coxph(formula = y ~ age + factor(rx), data = ovarian)
n= 26, number of events= 12
coef exp(coef) se(coef) z Pr(>|z|)
age 0.14733 1.15873 0.04615 3.193 0.00141 **
factor(rx)2 -0.80397 0.44755 0.63205 -1.272 0.20337
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.1587 0.863 1.0585 1.268
factor(rx)2 0.4475 2.234 0.1297 1.545
Concordance= 0.798 (se = 0.076 )
Likelihood ratio test= 15.89 on 2 df, p=4e-04
Wald test = 13.47 on 2 df, p=0.001
Score (logrank) test = 18.56 on 2 df, p=9e-05
```

Alternatively, define the factor version of `rx`

in the data frame first. This is the slick way to do that:

```
.1a <- ovarian %>%
timemutate(rxf = factor(rx)) %>%
coxph(y ~ age + rxf, data = .)
summary(time.1a)
```

```
Call:
coxph(formula = y ~ age + rxf, data = .)
n= 26, number of events= 12
coef exp(coef) se(coef) z Pr(>|z|)
age 0.14733 1.15873 0.04615 3.193 0.00141 **
rxf2 -0.80397 0.44755 0.63205 -1.272 0.20337
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.1587 0.863 1.0585 1.268
rxf2 0.4475 2.234 0.1297 1.545
Concordance= 0.798 (se = 0.076 )
Likelihood ratio test= 15.89 on 2 df, p=4e-04
Wald test = 13.47 on 2 df, p=0.001
Score (logrank) test = 18.56 on 2 df, p=9e-05
```

The answer is the same either way. The `data=.`

means “use the data frame that came out of the previous step, the one with `rxf`

in it.”

Alternatively, use the dataframe I called `ov2`

with `y`

in it:

```
.1b <- coxph(y ~ age + factor(rx), data = ov2)
timesummary(time.1b)
```

```
Call:
coxph(formula = y ~ age + factor(rx), data = ov2)
n= 26, number of events= 12
coef exp(coef) se(coef) z Pr(>|z|)
age 0.14733 1.15873 0.04615 3.193 0.00141 **
factor(rx)2 -0.80397 0.44755 0.63205 -1.272 0.20337
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.1587 0.863 1.0585 1.268
factor(rx)2 0.4475 2.234 0.1297 1.545
Concordance= 0.798 (se = 0.076 )
Likelihood ratio test= 15.89 on 2 df, p=4e-04
Wald test = 13.47 on 2 df, p=0.001
Score (logrank) test = 18.56 on 2 df, p=9e-05
```

\(\blacksquare\)

- Is there a significant difference between the treatments in terms of their effects on survival (from ovarian cancer)?

Solution

Look at the P-value for my `factor(rx)2`

, 0.203. This is not small, so there is no evidence of a difference between treatments.

Extra: the reason for the odd label is that we have turned treatment into a categorical variable; treatment 1 is used as the baseline, and the negative slope says that the “hazard of death” is lower for treatment 2 than for treatment 1: that is, people survive longer on treatment 2, but the difference is not big enough to be significant (we also have a smallish sample size).

Since there are only two treatments, it would in fact have been OK to leave them as numbers (with two numbers one unit apart, the slope would have been the same size as here), but I think it’s a good idea to treat categorical variables as categorical. My own habit is to use letters or something non-numerical to distinguish categories. I might have used `t1`

and `t2`

in this case, or the names of the different treatments.

\(\blacksquare\)

- Is there a significant effect of age? If there is, describe the effect that age has on survival.

Solution

The P-value for age is 0.0014, small, so age definitely has a significant effect on survival. As to what kind of effect, look at the slope coefficient, 0.15, positive, which means that increased age-at-diagnosis goes with an *increased* hazard of death, or that older patients do not survive as long.

I would like you to get to the plain-English words at the end. Part of your job as a statistician is explaining what you got to people who are doctors, managers, etc., who won’t understand the terminology.

Thus, one mark for assessing significance via P-value, one for looking at the slope coefficient and noting that it is positive, and one for getting to “older patients do not survive as long”, or “older patients have a larger chance of dying sooner”. (Strictly, this is also “all else equal” as usual, since survival time might also have depended on treatment, but the point of this question is for you to get to “older patients do not survive as long”.)

(The interpretation of the slope may seem backwards: a positive slope means a *shorter* survival time for a larger age. This is why I talk about “hazard of death”, since that guides us to the correct interpretation.)

Extra: I was curious about what would happen if I just included `rx`

in the model:

```
.2 <- update(time.1, . ~ . - age)
timesummary(time.2)
```

```
Call:
coxph(formula = y ~ factor(rx), data = ovarian)
n= 26, number of events= 12
coef exp(coef) se(coef) z Pr(>|z|)
factor(rx)2 -0.5964 0.5508 0.5870 -1.016 0.31
exp(coef) exp(-coef) lower .95 upper .95
factor(rx)2 0.5508 1.816 0.1743 1.74
Concordance= 0.608 (se = 0.07 )
Likelihood ratio test= 1.05 on 1 df, p=0.3
Wald test = 1.03 on 1 df, p=0.3
Score (logrank) test = 1.06 on 1 df, p=0.3
```

Still not significant, but this model is a *lot worse* because I took out the significant `age`

. What this is doing is mixing up all the people of different ages (and we know that age has an effect on survival) and trying (and failing) to discern an effect of treatment.

We could have been badly misled by this model if one of the treatments had predominantly older patients. We know that older patients have worse survival, so the treatment with older patients would have looked worse, even if it actually wasn’t. The model `time.1`

which contained `age`

properly adjusted for the effect of age, so that was the best way to see whether there was a difference between treatments.

What you often see early on in a paper on this kind of stuff is a graph showing that the treatment groups are similar in terms of important things like `age`

. Here, that could be a boxplot:

`ggplot(ovarian, aes(x = factor(rx), y = age)) + geom_boxplot()`

0.14733

I needed to do `factor(rx)`

because `geom_boxplot`

needs a genuine categorical variable, not just a numerical variable masquerading as one. If you just leave it as `rx`

, as I discovered, you get *one* boxplot of all the ages together regardless of treatment. The key, for you as user of software, is not (necessarily) to get it right the first time, but to know what to do to fix up the errors you will inevitably get. If you have worked through the boxplot examples in C32 and D29, you will have enough experience to remember that a boxplot has to have a categorical `x`

(text will do, but definitely not numbers). This is why I give you so many things to work through: so that you gain the experience to know how to fix up problems.

Treatment 1 has a larger spread of ages and treatment 2 has a low outlier age, but the median ages are very similar.

\(\blacksquare\)

- Make a martingale residual plot for this model. Do you see any problems? Explain briefly.

Solution

The plot is just the same idea as the one in the notes. Make sure you have `survminer`

installed and loaded:

`ggcoxdiagnostics(time.1) + geom_smooth()`

```
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
```

Make a call about whether you think the smooth trend deviates too much from the red dotted line going across at zero. Martingale residuals can get very negative (and that is OK), so that residual of \(-2\) is not a problem, and this is pulling the smooth trend down a bit (it’s the reason for the dip on the right side of the smooth trend). So I’d call this OK, but you can make whatever call you like as long as it’s supported by what you see here.

I observe that the obvious fixable thing is where you have a curve here, one that looks like a parabola (at which point you add squared terms to your explanatory variables and see if that helps, as for `bmi`

in one of the other problems). This one is too wiggly to be a parabola (it bends *twice*), and so is more like a cubic than anything.

The other thing you can note is that the grey envelope is “not significantly different from zero”, since 0 is clearly within the grey envelope all the way across.

\(\blacksquare\)

- Find the quartiles of
`age`

, and make a data frame containing all combinations of those two ages and the two treatments. Display what you have. (Feel free to copy the values by hand, rather than trying to save them and use them.)

Solution

I imagine you can guess what we are going to be doing with these: predictions, so we’ll call the data frame `new`

when we get there.

Quartiles first:

`quantile(ovarian$age)`

```
0% 25% 50% 75% 100%
38.89320 50.16712 56.84660 62.37810 74.50410
```

or, if you prefer,

```
%>%
ovarian summarize(
q1 = quantile(age, 0.25),
q3 = quantile(age, 0.75)
)
```

The quartiles are 50.17 and 62.38 (rounding slightly).

Either way is good.

Then follow my standard procedure (or one of your own devising), remembering that “treatment” is called `rx`

here:

```
<- c(50.17, 62.38)
ages <- c(1, 2)
rxs <- datagrid(model = time.1, age = ages, rx = rxs) new
```

\(\blacksquare\)

- Obtain predicted survival probabilities for each of your age-treatment combinations, for each of a variety of survival times. (This is only one thing, despite it sounding like a lot.)

Solution

The magic word here is `survfit`

. The structure is the same as for `predictions`

:

```
<- survfit(time.1, new, data = ovarian)
s summary(s)
```

```
Call: survfit(formula = time.1, newdata = new, data = ovarian)
time n.risk n.event survival1 survival2 survival3 survival4
59 26 1 0.993 0.997 0.959 0.981
115 25 1 0.985 0.993 0.911 0.959
156 24 1 0.973 0.988 0.846 0.928
268 23 1 0.959 0.981 0.777 0.893
329 22 1 0.932 0.969 0.653 0.826
353 21 1 0.905 0.956 0.548 0.764
365 20 1 0.877 0.943 0.452 0.701
431 17 1 0.843 0.926 0.356 0.630
464 15 1 0.805 0.908 0.271 0.557
475 14 1 0.768 0.888 0.202 0.489
563 12 1 0.701 0.853 0.117 0.382
638 11 1 0.634 0.816 0.064 0.292
```

I didn’t ask you to display it, so doing so is optional. Also, you don’t need (here) that `data=ovarian`

(the predictions will work just fine without it), but the plot coming up *will not*. So my recommendation is to put it in.

Extra: The four columns `survival1`

through `survival4`

are the survival probabilities for the times shown in the left-hand column (these are numbers of days), for the four *rows* of my `new`

, in the same order.

There are only a few different times, because these are the numbers of days at which somebody in the data set died, and the estimated survival probability does not change at the times in between these. (You’ll see this on a plot in a minute.)

These survival probabilities are pretty easy to eyeball: the best survival is in stratum 2, which is the younger patients in treatment 2. This we’ll come back to.

\(\blacksquare\)

- Draw a plot that compares the survival probabilities at the different times.

Solution

Thus. The `conf.int=F`

means to skip the confidence interval “envelopes” that I find make the plot rather messy:

`ggsurvplot(s, conf.int = FALSE)`

\(\blacksquare\)

- According to your plot, how would you describe the effects of treatment and of age? 50.17 1

50.17 2

Solution

The best survival curve, in terms of surviving longest, is up and to the right, so the green one is best and the blue one is worst.^{18} To figure out which those are, we have to go back to the data frame we created:

` new`

To see the effect of age, compare strata 1 and 3 (or 2 and 4). This means comparing the red and blue curves; the red one is clearly better (in the sense of longer survival time), which means that age has a big effect on survival, with younger people living longer, other things being equal (that we saw earlier was significant). You could equally well compare the green and purple survival curves and come to the same conclusion.

To assess the effect of treatment, compare strata 1 and 2 (red and green), or strata 3 and 4 (blue and purple). In both cases, the stratum corresponding to treatment 2 has slightly better survival (has a higher chance of living for longer), but there is not as big an effect as for age. (You’ll recall that the treatment difference was not significant).

State or imply that you know which stratum is which, say something about the effects of age and of treatment, including which one is better.

Extra: recall the output from the Cox model:

`summary(time.1)`

```
Call:
coxph(formula = y ~ age + factor(rx), data = ovarian)
n= 26, number of events= 12
coef exp(coef) se(coef) z Pr(>|z|)
age 0.14733 1.15873 0.04615 3.193 0.00141 **
factor(rx)2 -0.80397 0.44755 0.63205 -1.272 0.20337
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.1587 0.863 1.0585 1.268
factor(rx)2 0.4475 2.234 0.1297 1.545
Concordance= 0.798 (se = 0.076 )
Likelihood ratio test= 15.89 on 2 df, p=4e-04
Wald test = 13.47 on 2 df, p=0.001
Score (logrank) test = 18.56 on 2 df, p=9e-05
```

The slope coefficient for treatment 2 (as compared to the baseline treatment 1) was \(-0.83097\), *negative*, which meant that patients on treatment 2 had a *lower* hazard of death than patients on treatment 1: that is, that treatment 2 was better for survival than treatment 1. That is what the plot said also (and the relatively small difference is consistent with that difference not being significant).

\(\blacksquare\)

Heart attack.↩︎

Worcester is pronounced, by locals,

*Woo-stuh*.↩︎Barbecued pork in a bun. A staple of Chinese dim sum and Chinese bakeries, such as Ding Dong bakery on Spadina.↩︎

Heart attack.↩︎

Worcester is pronounced, by locals,

*Woo-stuh*.↩︎Not exactly the same as that output, because it is doing a test that would be the same if you had an infinitely large sample, but is slightly different with an ordinary finite number of observations.↩︎

Our categorical variable

*gender*has only two levels.↩︎It doesn’t work with text columns, but it

*does*work if you temporarily turn the text columns into factors, eg. by using`mutate`

with`where`

. However, we don’t have any text columns here, so what we do here is good for this data set.↩︎`case_when`

is much clearer than using nested`if-else`

s when you have three or more categories, as for`herco`

.↩︎Some people define the response variable right inside the

`coxph`

, in the same way as putting something like`log(y)`

as a response in an`lm`

, but I think, especially while you’re getting used to the process, it’s better to create the response variable first and look at it to make sure it’s the right thing.↩︎The researchers were probably relieved that there was not quite a significant effect of

`site`

.↩︎Not at the 0.05 level, anyway.↩︎

Barbecued pork in a bun. A staple of Chinese dim sum and Chinese bakeries, such as Ding Dong bakery on Spadina.↩︎

Don’t confuse this with

`glance`

from`broom`

, which gives a one-line summary of a*model*, containing things like R-squared and a test for the overall model significance.↩︎Because

*summarize*will only allow you to have a single-number answer.↩︎The way, as we have seen elsewhere, is to use

`tidy(quantile)`

or`enframe(quantile)`

, which produce a two-column data frame with the percentiles shown.↩︎Strata is plural; the singular is

*stratum*. Like data and datum.↩︎In other cases, having the event happen sooner is better. For example, you might be poisoning rats, in which case you want them to die quicker. Or the event might be something desirable like becoming qualified to fly an airplane. In those cases, down and to the left is better.↩︎