Worksheet 5

Published

February 6, 2025

Packages

library(tidyverse)
library(marginaleffects)
library(survival)

Hypernephroma

Hypernephroma, also known as renal cell carcinoma or kidney cancer, is a type of cancer that starts in the kidneys. It’s one of the most common types of kidney cancer in adults. Source.

The 33 patients in http://ritsokiguess.site/datafiles/lee_hypernephroma.csv all had hypernephromia and were all treated with chemotherapy, immunotherapy, and hormonal therapy. There are a lot of columns in our data, among them:

age of patient in years
gender of patient, noted as F or M.
date of treatment_start, as text (month - day - year)
date of treatment_end (last followup or date of death), as text
status of patient when last seen
the last five columns are the results of skin tests taken at the start of treatment.

The researchers wanted to see whether any of the skin test results, as well as the age and gender of the patient, helped in predicting survival time after the start of treatment.

Read in and display (some of) the data.

my_url <- "http://ritsokiguess.site/datafiles/lee_hypernephroma.csv"
hypernephroma <- read_csv(my_url)

Rows: 33 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): gender, treatment_start, treatment_end, status
dbl (8): patient, age, response, monilia, mumps, PPD, PHA, SK_SD

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

hypernephroma

I gave my dataframe a long name, but I don’t have to type it out every time; I type something like hyp, wait, and watch for the autocomplete to tell me the rest of it. Hitting Enter will select it, or I can move up and down if there are several choices.

There are 33 patients and 12 variables, including the ones mentioned above.

If you can’t see all the column names (that you will need later), you have the glimpse trick¹ at your disposal:

glimpse(hypernephroma)

Rows: 33
Columns: 12
$ patient         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ age             <dbl> 53, 61, 53, 48, 55, 62, 57, 53, 45, 58, 61, 61, 77, 55…
$ gender          <chr> "F", "M", "F", "M", "M", "F", "M", "M", "M", "M", "F",…
$ treatment_start <chr> "3/31/77", "6/18/76", "2/1/77", "12/19/74", "11/10/75"…
$ response        <dbl> 1, 0, 3, 2, 0, 2, 0, 2, 0, 3, 3, 1, 0, 2, 3, 0, 0, 3, …
$ treatment_end   <chr> "10/1/77", "8/21/76", "10/1/77", "1/15/76", "1/15/76",…
$ status          <chr> "alive", "dead", "alive", "dead", "dead", "dead", "dea…
$ monilia         <dbl> 7, 10, 0, 0, 12, 10, 15, 0, 6, 13, 0, 9, 0, 0, 0, 11, …
$ mumps           <dbl> 23, 15, 7, 0, 0, 5, 15, 0, 4, 13, 8, 12, 0, 0, 14, 7, …
$ PPD             <dbl> 0, 0, 0, 0, 10, 0, 0, 0, 0, 22, 17, 0, 0, 15, 5, 0, 0,…
$ PHA             <dbl> 25, 13, 25, 0, 8, 7, 0, 12, 0, 23, 11, 20, 0, 10, 32, …
$ SK_SD           <dbl> 0, 9, 0, 0, 5, 5, 10, 0, 0, 0, 0, 0, 0, 0, 21, 0, 0, 1…

Extra (long): this data set came from a book that I have as a pdf, so I did something very like what I did on the last worksheet with the NBA data to make a dataframe of it. The table of data was split over two pages, so I needed to process two pages in the same way. With that in mind, I wrote a function saying which page I wanted to extract and how I wanted to pull out the data (and how many columns to extract it to). Then I ran the function twice, and glued the results together:

library(pdftools)
lee_data <- function(my_url, wanted, n_col) {
  lee <- pdf_text(my_url)
  lee_strings <- strsplit(lee, "\n")
  lee_strings[wanted] %>% 
    unlist() %>%  
    trimws(which = "left") %>% 
    str_split_fixed(" {2,}", n_col) %>% 
    as_tibble() %>% 
    filter(str_detect(V4, "/")) %>% 
    separate_wider_delim(V11, delim = " ", names = c("V11a", "V11b"), too_few = "align_start") %>% 
    mutate(across(V8:V11b, \(x) ifelse(x == "ND", 0, parse_number(x)))) %>% 
    mutate(V11 = V11a,
           V12 = coalesce(V11b, V12)) %>% 
    mutate(V7 = ifelse(V7 == "1", "dead", "alive")) %>% 
    mutate(V2 = as.numeric(V2)) %>% 
    select(
      patient = V1,
      age = V2,
      gender = V3,
      treatment_start = V4,
      response = V5,
      treatment_end = V6,
      status = V7, 
      monilia = V8,
      mumps = V9,
      PPD = V10,
      PHA = V11,
      SK_SD = V12
    )
}
a1 <- lee_data(my_url, 61, 12)
a2 <- lee_data(my_url, 62, 12)
hypernephroma <- bind_rows(a1, a2)
hypernephroma

The function takes the filename of the whole book (which is not really a URL, but that’s what it got called), the page I want to extract data from, and the number of columns of data there are. The function is a rather terrifying pipe, but I built it up one line at a time, returned what I had made so far, and tested it as I went, just as you would. Some of the ideas are exactly as for the NBA data on a previous worksheet. Here’s how it goes:

using pdftools, extract the pdf into a vector of pages.
turn the \n in the vector of pages into genuine newlines, creating a list of pages, with the text of each line on a page as an element of a vector within that page.
grab the page I want
remove any list attributes it has, so that it’s now just a vector of lines
remove any whitespace on the left of each line
break into as many columns as I asked for, splitting on two or more spaces
turn into a dataframe. This far is exactly what I did with the NBA data; in fact, I copied and pasted my code and made some small edits.
there are some header and footer rows. In fact, just as for the NBA data, the lines I want have dates in them. I found that choosing the rows with a slash in the treatment start date (V4) was enough; I could have used a regular expression like /7.$, since the years are all in the 1970s, but I didn’t need to be that clever.
Some of the skin test results that should have been in columns V11 and V12 ended up both in column V11, with V12 being empty. (I think they were separated by only one space rather than two, but I didn’t want to have a “Golden State” problem, as for the NBA data.) They were separated in column V11 by a space, so I split them up into new columns V11a and V11b. If there was actually only one thing in V11, it ends up in V11a with a missing in V11b, which is what the too_few thing does.
if the value in any of V8 through V12 is not recorded (recorded as ND for “not done”), arbitrarily replace it by zero so that you wouldn’t have missing values to deal with; otherwise, make sure it’s a number.
put V11 and V12 back as they should be: V11 should be whatever ended up in V11a, and V12 should be whatever is in V11b if it is not missing, and left alone otherwise (there is a value already in V12, so use that). coalesce takes several columns; the first one that is not missing is the answer that is used.
turn the number codes in V7 into what they actually are.
turn age into a number (it seems mysteriously to be text).
finally, the great long select grabs the columns I want and gives them sensible names. I used rename for this before, but I wasn’t sure I wanted all the columns this time, and you can do renaming in select also. (It seems that I did use all the columns in the end.)

After defining and testing the function (as I said, one line at a time), I ran it on what I had previously found to be pages 61 and 62 of the pdf (by searching for the text hypernephroma, which only appears twice in the book), glued those two dataframes together above and below (they have the same column names), and saved the result for you.

Convert the treatment start and treatment end dates into actual dates.

The obvious way is to do two mutates, one for the treatment start date, and the other for the treatment end date. The treatment dates are written as month-day-year with two-digit years (which will get mapped to the right century), thus (I overwrote my dates, but you can create new columns if you prefer):

hypernephroma %>% 
  mutate(treatment_start = mdy(treatment_start)) %>% 
  mutate(treatment_end = mdy(treatment_end))

These are indeed now dates, and display in ISO format with the year first. Save your dataframe, since we are going to do more with it.

I actually did it differently: I saw that the two dates both start with treatment, so I can turn them both into dates at once:

hypernephroma %>% 
  mutate(across(starts_with("treatment"), \(tr) mdy(tr))) -> hypernephroma
hypernephroma

In English, that says “for each of the columns whose name starts with treatment, replace it by the mdy version of itself”. (mutate with across by default changes the columns in place rather than creating new columns.)

Work out the number of days between the start and the end of the treatment. Check that your result is indeed a number of days. Turn it, if necessary, into an actual number (with as.numeric), for plots later.

Back near the beginning of dates and times, I said that you could do arithmetic with dates. You can add a number of days to a date and get another date, or you can subtract two dates and get a number of days. Here, subtracting the two dates is what we want:

hypernephroma %>% 
  mutate(surv_time = treatment_end - treatment_start) -> hypernephroma
hypernephroma %>% select(treatment_start, treatment_end, surv_time)

The column I called surv_time is indeed a number of days because it says so. You can also eyeball the two dates and see that the number of days is about right. The first is about six months, the second is just over two months, and the fourth one is about a month more than a year.²

Using a duration like this works all right for modelling, but plot_predictions later needs an actual number, so we need to make it one:

hypernephroma %>% 
  mutate(surv_time = as.numeric(surv_time)) -> hypernephroma
hypernephroma %>% select(treatment_start, treatment_end, surv_time)

Extra: in this case, we got lucky in that the time difference got counted as a number of days, and that was what we wanted. This was unlike the example in the lecture notes, where it came out as a number of hours. If you want to make sure you control the units, the idea in the lecture notes is what you need: turn it into a period, then divide by the number of seconds in one of your chosen time unit:

hypernephroma %>% 
  mutate(surv_time = as.period(treatment_start %--% treatment_end) / days(1)) %>% 
  select(treatment_start, treatment_end, surv_time)

This time, surv_time is a number. We know it is a number of days because we set it up to be that. Except: our treatment start and end are dates, so surv_time should be a whole number of days. Why isn’t it? I dunno. Dates and times are tricky. (They do seem to round to the right thing, but they are some number of eighths of days off.)

Create a suitable response variable y for a Cox proportional-hazards model, and display it. (You don’t need to save it.) Does it distinguish correctly between patients whose treatment_end was their date of death, and the patients who were still alive at this point?

This uses Surv from the survival package (which you remembered to install and load). Surv requires two things: a column with survival times (which can be numbers of something like days or months, or a date difference like ours³), together with something that will be TRUE if the event (here death) has happened:

hypernephroma %>% 
  mutate(y = Surv(surv_time, status == "dead")) %>% 
  select(surv_time, status, y)

It is a fairly recent development that it displays properly in a dataframe; the Surv[,2] at the top of the column indicates that it’s actually two columns in one. (This bit is not, as I write, in the lecture notes. I need to remember to put it in before class today.)

Some of the values of the response variable have a + next to them; these are the patients that were still alive the last time they were seen. In the jargon of survival analysis, these are “censored” observations. All we know about the first patient is that they lived at least 184 days, but we don’t know how long they lived after that. The second patient, on the other hand, lived for exactly 64 days after the start of treatment, and then they died.

This is more detailed than you need to be: say something about how the values of the response for the patients who were still alive are distinguished from the ones that died (the + sign), and you are good.

Fit a Cox proportional-hazards model, predicting survival time from age, gender, and the five skin test results. Display the summary of the model. (Hint: copy your Surv from above into your modelling function.)

My model has a long name too (you could give yours a shorter name). The way to fit these in the lecture notes is to repeat the Surv code in the coxph:

hypernephroma.1 <- coxph(Surv(surv_time, status == "dead") ~ age + gender + 
                           monilia + mumps + PPD + PHA + SK_SD, data = hypernephroma)
summary(hypernephroma.1)

Call:
coxph(formula = Surv(surv_time, status == "dead") ~ age + gender + 
    monilia + mumps + PPD + PHA + SK_SD, data = hypernephroma)

  n= 33, number of events= 18 

             coef exp(coef)  se(coef)      z Pr(>|z|)  
age      0.046159  1.047241  0.029412  1.569   0.1166  
genderM  0.047637  1.048790  0.664885  0.072   0.9429  
monilia  0.087752  1.091717  0.057014  1.539   0.1238  
mumps   -0.153552  0.857656  0.067896 -2.262   0.0237 *
PPD      0.001477  1.001479  0.044425  0.033   0.9735  
PHA      0.028134  1.028534  0.039535  0.712   0.4767  
SK_SD    0.034313  1.034909  0.074009  0.464   0.6429  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0472     0.9549    0.9886    1.1094
genderM    1.0488     0.9535    0.2849    3.8605
monilia    1.0917     0.9160    0.9763    1.2208
mumps      0.8577     1.1660    0.7508    0.9797
PPD        1.0015     0.9985    0.9180    1.0926
PHA        1.0285     0.9723    0.9518    1.1114
SK_SD      1.0349     0.9663    0.8952    1.1965

Concordance= 0.704  (se = 0.062 )
Likelihood ratio test= 10.83  on 7 df,   p=0.1
Wald test            = 10.13  on 7 df,   p=0.2
Score (logrank) test = 10.98  on 7 df,   p=0.1

This tends to make the model formula long, so you might need to split it over two lines, as I have.

Note that only a few of the explanatory variables appear to be significant. You could also use drop1 here:

drop1(hypernephroma.1, test = "Chisq")

The P-values for summary and drop1 are almost the same but not quite. See the Extra to the next part for some more discussion about this.

(2 points) Use step to remove explanatory variables that do not help to predict survival time. Save and display the model that comes out of step. (Some of the explanatory variables will only be significant at 0.10, not 0.05. Keep those.)

This is actually not difficult at all:

hypernephroma.2 <- step(hypernephroma.1)

Start:  AIC=108.93
Surv(surv_time, status == "dead") ~ age + gender + monilia + 
    mumps + PPD + PHA + SK_SD

          Df    AIC
- PPD      1 106.93
- gender   1 106.93
- SK_SD    1 107.12
- PHA      1 107.43
<none>       108.93
- monilia  1 109.34
- age      1 109.41
- mumps    1 113.40

Step:  AIC=106.93
Surv(surv_time, status == "dead") ~ age + gender + monilia + 
    mumps + PHA + SK_SD

          Df    AIC
- gender   1 104.93
- SK_SD    1 105.12
- PHA      1 105.46
<none>       106.93
- monilia  1 107.40
- age      1 107.48
- mumps    1 111.49

Step:  AIC=104.93
Surv(surv_time, status == "dead") ~ age + monilia + mumps + PHA + 
    SK_SD

          Df    AIC
- SK_SD    1 103.13
- PHA      1 103.48
<none>       104.93
- monilia  1 105.41
- age      1 105.49
- mumps    1 109.57

Step:  AIC=103.13
Surv(surv_time, status == "dead") ~ age + monilia + mumps + PHA

          Df    AIC
- PHA      1 101.62
<none>       103.13
- monilia  1 103.81
- age      1 103.84
- mumps    1 107.58

Step:  AIC=101.62
Surv(surv_time, status == "dead") ~ age + monilia + mumps

          Df    AIC
<none>       101.62
- monilia  1 102.36
- age      1 102.45
- mumps    1 106.58

summary(hypernephroma.2)

Call:
coxph(formula = Surv(surv_time, status == "dead") ~ age + monilia + 
    mumps, data = hypernephroma)

  n= 33, number of events= 18 

            coef exp(coef) se(coef)      z Pr(>|z|)  
age      0.04664   1.04774  0.02808  1.661   0.0968 .
monilia  0.09374   1.09828  0.05625  1.667   0.0956 .
mumps   -0.12328   0.88401  0.05211 -2.366   0.0180 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.048     0.9544    0.9916    1.1070
monilia     1.098     0.9105    0.9836    1.2263
mumps       0.884     1.1312    0.7982    0.9791

Concordance= 0.718  (se = 0.055 )
Likelihood ratio test= 10.14  on 3 df,   p=0.02
Wald test            = 9.67  on 3 df,   p=0.02
Score (logrank) test = 10.21  on 3 df,   p=0.02

or, for the last line,

drop1(hypernephroma.2, test = "Chisq")

Two of the explanatory variables are only significant at 0.10 rather than 0.05. This tends to happen with step. We will keep them.

The step is easier to follow than the one on the last assignment:

drop PPD
then drop gender (it is common in these kinds of study for women to have better health outcomes than men, but not here)
then drop SK_SD
then drop PHA
then stop.

Age is left, along with two of the skin tests.

Extra: Another thing to notice is that the P-values in the summary output and the drop1 output are not the same, though they will usually be close. The tests in summary are what is called Wald tests: the fitting works out standard errors for the coefficients, and then says “if the sample size is large, the coefficient divided by its standard error will be approximately normal under the null hypothesis” and works out a P-value on that basis. This is like what you would do in ordinary regression, where there are formulas that you learn in B27.⁴ The P-values from drop1 are based on likelihood ratio tests (hence the LRT at the top of the column); this time the test is based on the difference in AIC⁵ between the model with and the model without the term listed. In ordinary regression, these two ways of testing are identical, but in a Cox model they would only be identical if you had an infinite sample size, and for actual real-life sample sizes they will be slightly different (and for small samples they may be noticeably different).

As far as I am concerned, here you can pick either summary or drop1 and go with it. The place where it makes a difference is if you have a categorical variable with more than two categories, in which case only drop1 would give you a test for keeping or dropping the categorical variable as a whole. An example in this context would be if each patient was randomly assigned to one of the skin tests; then you would have a categorical variable skin_test with five levels monilia through SK_SD, and you would have to use drop1 to see whether you could drop it as a whole or needed to keep it. This, though, is not what happened; the skin tests were testing for different things, and the doctor ordered all (or most: see the Extra to (a)) of them for each patient.

Plot predicted survival probabilities over time for five representative ages. Hint: your procedure will use representative values for the other variables, so you do not need to supply values for those.

This is plot_predictions. The things to keep in mind are that the $x$-axis is time (which was called surv_time in this data set), and that the “strata” whose survival probabilities we are comparing are the five values of age, distinguished by colour, so they go second. (Any other variables make facets, and they go on the end, but there are none here.)

So:

plot_predictions(model = hypernephroma.2, condition = c("surv_time", "age"),
                 type = "survival")

Describe the effect of increasing age on your plot, and explain briefly how this is consistent with the summary output from your model.

As age increases, the survival curves move down, indicating that an older patient has a lower chance of surviving longer (worse survival).

Looking again at the summary:

summary(hypernephroma.2)

Call:
coxph(formula = Surv(surv_time, status == "dead") ~ age + monilia + 
    mumps, data = hypernephroma)

  n= 33, number of events= 18 

            coef exp(coef) se(coef)      z Pr(>|z|)  
age      0.04664   1.04774  0.02808  1.661   0.0968 .
monilia  0.09374   1.09828  0.05625  1.667   0.0956 .
mumps   -0.12328   0.88401  0.05211 -2.366   0.0180 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.048     0.9544    0.9916    1.1070
monilia     1.098     0.9105    0.9836    1.2263
mumps       0.884     1.1312    0.7982    0.9791

Concordance= 0.718  (se = 0.055 )
Likelihood ratio test= 10.14  on 3 df,   p=0.02
Wald test            = 9.67  on 3 df,   p=0.02
Score (logrank) test = 10.21  on 3 df,   p=0.02

the coefficient for age is positive, meaning that an older patient has a higher hazard of death, which also points to worse survival.

Repeat the previous two questions for mumps skin test values.

Copy, paste, and edit:

plot_predictions(model = hypernephroma.2, condition = c("surv_time", "mumps"),
                 type = "survival")

This time, mumps varies, and representative (mean) values are used for the other variables. Here, a higher score here is associated with a better chance of living longer.

Looking back at the summary, mumps has a negative coefficient, so if mumps is higher, the hazard of death is lower (and thus survival is better).

Footnotes

Not to be confused with glance from the broom package.↩︎
I don’t know what drtn stands for; I know of this kind of thing as a “difftime”, especially when it’s a time as well as a date. A little research reveals that it’s short for “duration”.↩︎
For plotting survival curves, which we usually want to do after fitting a model, the survival times actually need to be numbers.↩︎
Those formulas have estimated standard deviations in them, so the distribution you get P-values from is $t$ rather than normal, but the idea is the same.↩︎
which is itself based on the likelihood.↩︎