STAD29 Assignment 5

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Hypernephroma

Hypernephroma, also known as renal cell carcinoma or kidney cancer, is a type of cancer that starts in the kidneys. It’s one of the most common types of kidney cancer in adults. Source.

The 33 patients in http://ritsokiguess.site/datafiles/lee_hypernephroma.csv all had hypernephromia and were all treated with chemotherapy, immunotherapy, and hormonal therapy. There are a lot of columns in our data, among them:

  • age of patient in years
  • gender of patient, noted as F or M.
  • date of treatment_start, as text (month - day - year)
  • date of treatment_end (last followup or date of death), as text
  • status of patient when last seen
  • the last five columns are the results of skin tests taken at the start of treatment.

The researchers wanted to see whether any of the skin test results, as well as the age and gender of the patient, helped in predicting survival time after the start of treatment.

(a) (1 point) Read in and display (some of) the data.

my_url <- "http://ritsokiguess.site/datafiles/lee_hypernephroma.csv"
hypernephroma <- read_csv(my_url)
Rows: 33 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): gender, treatment_start, treatment_end, status
dbl (8): patient, age, response, monilia, mumps, PPD, PHA, SK_SD

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hypernephroma

I gave my dataframe a long name, but I don’t have to type it out every time; I type something like hyp, wait, and watch for the autocomplete to tell me the rest of it. Hitting Enter will select it, or I can move up and down if there are several choices.

There are 33 patients and 12 variables, including the ones mentioned above.

If you can’t see all the column names (that you will need later), you have the glimpse trick1 at your disposal:

glimpse(hypernephroma)
Rows: 33
Columns: 12
$ patient         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ age             <dbl> 53, 61, 53, 48, 55, 62, 57, 53, 45, 58, 61, 61, 77, 55…
$ gender          <chr> "F", "M", "F", "M", "M", "F", "M", "M", "M", "M", "F",…
$ treatment_start <chr> "3/31/77", "6/18/76", "2/1/77", "12/19/74", "11/10/75"…
$ response        <dbl> 1, 0, 3, 2, 0, 2, 0, 2, 0, 3, 3, 1, 0, 2, 3, 0, 0, 3, …
$ treatment_end   <chr> "10/1/77", "8/21/76", "10/1/77", "1/15/76", "1/15/76",…
$ status          <chr> "alive", "dead", "alive", "dead", "dead", "dead", "dea…
$ monilia         <dbl> 7, 10, 0, 0, 12, 10, 15, 0, 6, 13, 0, 9, 0, 0, 0, 11, …
$ mumps           <dbl> 23, 15, 7, 0, 0, 5, 15, 0, 4, 13, 8, 12, 0, 0, 14, 7, …
$ PPD             <dbl> 0, 0, 0, 0, 10, 0, 0, 0, 0, 22, 17, 0, 0, 15, 5, 0, 0,…
$ PHA             <dbl> 25, 13, 25, 0, 8, 7, 0, 12, 0, 23, 11, 20, 0, 10, 32, …
$ SK_SD           <dbl> 0, 9, 0, 0, 5, 5, 10, 0, 0, 0, 0, 0, 0, 0, 21, 0, 0, 1…

Extra: this data set came from a book that I have as a pdf, so I did something very like what I did on the last assignment with the NBA data to make a dataframe of it. The table of data was split over two pages, so I needed to process two pages in the same way. With that in mind, I wrote a function saying which page I wanted to extract and how I wanted to pull out the data (and how many columns to extract it to). Then I ran the function twice, and glued the results together:

library(pdftools)
lee_data <- function(my_url, wanted, n_col) {
  lee <- pdf_text(my_url)
  lee_strings <- strsplit(lee, "\n")
  lee_strings[wanted] %>% 
    unlist() %>%  
    trimws(which = "left") %>% 
    str_split_fixed(" {2,}", n_col) %>% 
    as_tibble() %>% 
    filter(str_detect(V4, "/")) %>% 
    separate_wider_delim(V11, delim = " ", names = c("V11a", "V11b"), too_few = "align_start") %>% 
    mutate(across(V8:V11b, \(x) ifelse(x == "ND", 0, parse_number(x)))) %>% 
    mutate(V11 = V11a,
           V12 = coalesce(V11b, V12)) %>% 
    mutate(V7 = ifelse(V7 == "1", "dead", "alive")) %>% 
    mutate(V2 = as.numeric(V2)) %>% 
    select(
      patient = V1,
      age = V2,
      gender = V3,
      treatment_start = V4,
      response = V5,
      treatment_end = V6,
      status = V7, 
      monilia = V8,
      mumps = V9,
      PPD = V10,
      PHA = V11,
      SK_SD = V12
    )
}
a1 <- lee_data(my_url, 61, 12)
a2 <- lee_data(my_url, 62, 12)
hypernephroma <- bind_rows(a1, a2)
hypernephroma

The function takes the filename of the whole book (which is not really a URL, but that’s what it got called), the page I want to extract data from, and the number of columns of data there are. The function is a rather terrifying pipe, but I built it up one line at a time, returned what I had made so far, and tested it as I went, just as you would. Some of the ideas are exactly as for the NBA data on a previous assignment. Here’s how it goes:

  • using pdftools, extract the pdf into a vector of pages.
  • turn the \n in the vector of pages into genuine newlines, creating a list of pages, with the text of each line on a page as an element of a vector within that page.
  • grab the page I want
  • remove any list attributes it has, so that it’s now just a vector of lines
  • remove any whitespace on the left of each line
  • break into as many columns as I asked for, splitting on two or more spaces
  • turn into a dataframe. This far is exactly what I did with the NBA data; in fact, I copied and pasted my code and made some small edits.
  • there are some header and footer rows. In fact, just as for the NBA data, the lines I want have dates in them. I found that choosing the rows with a slash in the treatment start date (V4) was enough; I could have used a regular expression like /7.$, since the years are all in the 1970s, but I didn’t need to be that clever.
  • Some of the skin test results that should have been in columns V11 and V12 ended up both in column V11, with V12 being empty. (I think they were separated by only one space rather than two, but I didn’t want to have a “Golden State” problem, as for the NBA data.) They were separated in column V11 by a space, so I split them up into new columns V11a and V11b. If there was actually only one thing in V11, it ends up in V11a with a missing in V11b, which is what the too_few thing does.
  • if the value in any of V8 through V12 is not recorded (recorded as ND for “not done”), arbitrarily replace it by zero so that you wouldn’t have missing values to deal with; otherwise, make sure it’s a number.
  • put V11 and V12 back as they should be: V11 should be whatever ended up in V11a, and V12 should be whatever is in V11b if it is not missing, and left alone otherwise (there is a value already in V12, so use that). coalesce takes several columns; the first one that is not missing is the answer that is used.
  • turn the number codes in V7 into what they actually are.
  • turn age into a number (it seems mysteriously to be text).
  • finally, the great long select grabs the columns I want and gives them sensible names. I used rename for this before, but I wasn’t sure I wanted all the columns this time, and you can do renaming in select also. (It seems that I did use all the columns in the end.)

After defining and testing the function (as I said, one line at a time), I ran it on what I had previously found to be pages 61 and 62 of the pdf (by searching for the text hypernephroma, which only appears twice in the book), glued those two dataframes together above and below (they have the same column names), and saved the result for you.

(b) (2 points) Convert the treatment start and treatment end dates into actual dates.

The obvious way is to do two mutates, one for the treatment start date, and the other for the treatment end date. The treatment dates are written as month-day-year with two-digit years (which will get mapped to the right century), thus (I overwrote my dates, but you can create new columns if you prefer):

hypernephroma %>% 
  mutate(treatment_start = mdy(treatment_start)) %>% 
  mutate(treatment_end = mdy(treatment_end))

These are indeed now dates, and display in ISO format with the year first. Save your dataframe, since we are going to do more with it.

I actually did it differently: I saw that the two dates both start with treatment, so I can turn them both into dates at once:

hypernephroma %>% 
  mutate(across(starts_with("treatment"), \(tr) mdy(tr))) -> hypernephroma
hypernephroma

In English, that says “for each of the columns whose name starts with treatment, replace it by the mdy version of itself”. (mutate with across by default changes the columns in place rather than creating new columns.)

(c) (2 points) Work out the number of days between the start and the end of the treatment. Check that your result is indeed a number of days.

Back near the beginning of dates and times, I said that you could do arithmetic with dates. You can add a number of days to a date and get another date, or you can subtract two dates and get a number of days. Here, subtracting the two dates is what we want:

hypernephroma %>% 
  mutate(surv_time = treatment_end - treatment_start) -> hypernephroma
hypernephroma %>% select(treatment_start, treatment_end, surv_time)

The column I called surv_time is indeed a number of days because it says so. You can also eyeball the two dates and see that the number of days is about right. The first is about six months, the second is just over two months, and the fourth one is about a month more than a year.2

(d) (2 points) Create a suitable response variable for a Cox proportional-hazards model, and display it. Does it distinguish correctly between patients whose treatment_end was their date of death, and the patients who were still alive at this point?

This uses Surv from the survival package (which you remembered to install and load). Surv requires two things: a column with survival times (which can be numbers of something like days or months, or a date difference like ours), together with something that will be TRUE if the event (here death) has happened:

hypernephroma %>% 
  mutate(y = Surv(surv_time, status == "dead")) -> hypernephroma
hypernephroma %>% select(surv_time, status, y)

I called my response variable y. It is a fairly recent development that it displays properly in a dataframe; the Surv[,2] at the top of the column indicates that it’s actually two columns in one.

Some of the values of the response variable have a + next to them; these are the patients that were still alive the last time they were seen. In the jargon of survival analysis, these are “censored” observations. All we know about the first patient is that they lived at least 184 days, but we don’t know how long they lived after that. The second patient, on the other hand, lived for exactly 64 days after the start of treatment, and then they died.

This is more detailed than you need to be: say something about how the values of the response for the patients who were still alive are distinguished from the ones that died (the + sign), and you are good.

(e) (3 points) Fit a Cox proportional-hazards model, predicting survival time from age, gender, and the five skin test results. Display the summary of the model.

My model has a long name too (you could call yours y.1 or whatever name you used for your response column):

hypernephroma.1 <- coxph(y ~ age + gender + monilia + mumps + PPD + PHA + SK_SD, data = hypernephroma)
summary(hypernephroma.1)
Call:
coxph(formula = y ~ age + gender + monilia + mumps + PPD + PHA + 
    SK_SD, data = hypernephroma)

  n= 33, number of events= 18 

             coef exp(coef)  se(coef)      z Pr(>|z|)  
age      0.046159  1.047241  0.029412  1.569   0.1166  
genderM  0.047637  1.048790  0.664885  0.072   0.9429  
monilia  0.087752  1.091717  0.057014  1.539   0.1238  
mumps   -0.153552  0.857656  0.067896 -2.262   0.0237 *
PPD      0.001477  1.001479  0.044425  0.033   0.9735  
PHA      0.028134  1.028534  0.039535  0.712   0.4767  
SK_SD    0.034313  1.034909  0.074009  0.464   0.6429  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0472     0.9549    0.9886    1.1094
genderM    1.0488     0.9535    0.2849    3.8605
monilia    1.0917     0.9160    0.9763    1.2208
mumps      0.8577     1.1660    0.7508    0.9797
PPD        1.0015     0.9985    0.9180    1.0926
PHA        1.0285     0.9723    0.9518    1.1114
SK_SD      1.0349     0.9663    0.8952    1.1965

Concordance= 0.704  (se = 0.062 )
Likelihood ratio test= 10.83  on 7 df,   p=0.1
Wald test            = 10.13  on 7 df,   p=0.2
Score (logrank) test = 10.98  on 7 df,   p=0.1

Note that only a few of the explanatory variables appear to be significant. You could also use drop1 here:

drop1(hypernephroma.1, test = "Chisq")

The P-values for summary and drop1 are almost the same but not quite. See the Extra to the next part for some more discussion about this.

(f) (2 points) Use step to remove explanatory variables that do not help to predict survival time. Save and display the model that comes out of step.

This is actually not difficult at all:

hypernephroma.2 <- step(hypernephroma.1)
Start:  AIC=108.93
y ~ age + gender + monilia + mumps + PPD + PHA + SK_SD

          Df    AIC
- PPD      1 106.93
- gender   1 106.93
- SK_SD    1 107.12
- PHA      1 107.43
<none>       108.93
- monilia  1 109.34
- age      1 109.41
- mumps    1 113.40

Step:  AIC=106.93
y ~ age + gender + monilia + mumps + PHA + SK_SD

          Df    AIC
- gender   1 104.93
- SK_SD    1 105.12
- PHA      1 105.46
<none>       106.93
- monilia  1 107.40
- age      1 107.48
- mumps    1 111.49

Step:  AIC=104.93
y ~ age + monilia + mumps + PHA + SK_SD

          Df    AIC
- SK_SD    1 103.13
- PHA      1 103.48
<none>       104.93
- monilia  1 105.41
- age      1 105.49
- mumps    1 109.57

Step:  AIC=103.13
y ~ age + monilia + mumps + PHA

          Df    AIC
- PHA      1 101.62
<none>       103.13
- monilia  1 103.81
- age      1 103.84
- mumps    1 107.58

Step:  AIC=101.62
y ~ age + monilia + mumps

          Df    AIC
<none>       101.62
- monilia  1 102.36
- age      1 102.45
- mumps    1 106.58
summary(hypernephroma.2)
Call:
coxph(formula = y ~ age + monilia + mumps, data = hypernephroma)

  n= 33, number of events= 18 

            coef exp(coef) se(coef)      z Pr(>|z|)  
age      0.04664   1.04774  0.02808  1.661   0.0968 .
monilia  0.09374   1.09828  0.05625  1.667   0.0956 .
mumps   -0.12328   0.88401  0.05211 -2.366   0.0180 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.048     0.9544    0.9916    1.1070
monilia     1.098     0.9105    0.9836    1.2263
mumps       0.884     1.1312    0.7982    0.9791

Concordance= 0.718  (se = 0.055 )
Likelihood ratio test= 10.14  on 3 df,   p=0.02
Wald test            = 9.67  on 3 df,   p=0.02
Score (logrank) test = 10.21  on 3 df,   p=0.02

or, for the last line,

drop1(hypernephroma.2, test = "Chisq")

Two of the explanatory variables are only significant at 0.10 rather than 0.05. This tends to happen with step. We will keep them.

The step is easier to follow than the one on the last assignment:

  • drop PPD
  • then drop gender (it is common in these kinds of study for women to have better health outcomes than men, but not here)
  • then drop SK_SD
  • then drop PHA
  • then stop.

Age is left, along with two of the skin tests.

Extra: Another thing to notice is that the P-values in the summary output and the drop1 output are not the same, though they will usually be close. The tests in summary are what is called Wald tests: the fitting works out standard errors for the coefficients, and then says “if the sample size is large, the coefficient divided by its standard error will be approximately normal under the null hypothesis” and works out a P-value on that basis. This is like what you would do in ordinary regression, where there are formulas that you learn in B27.3 The P-values from drop1 are based on likelihood ratio tests (hence the LRT at the top of the column); this time the test is based on the difference in AIC4 between the model with and the model without the term listed. In ordinary regression, these two ways of testing are identical, but in a Cox model they would only be identical if you had an infinite sample size, and for actual real-life sample sizes they will be slightly different (and for small samples they may be noticeably different).

As far as I am concerned, here you can pick either summary or drop1 and go with it. The place where it makes a difference is if you have a categorical variable with more than two categories, in which case only drop1 would give you a test for keeping or dropping the categorical variable as a whole. An example in this context would be if each patient was randomly assigned to one of the skin tests; then you would have a categorical variable skin_test with five levels monilia through SK_SD, and you would have to use drop1 to see whether you could drop it as a whole or needed to keep it. This, though, is not what happened; the skin tests were testing for different things, and the doctor ordered all (or most: see the Extra to (a)) of them for each patient.

(g) (3 points) Obtain and plot predicted survival probabilities over time for ages 40 through 80 in steps of 10. Hint: your procedure will use representative values for the other variables, so you do not need to supply values for those.

Remember that predictions for a survival model are obtained using survfit and plotted using ggsurvplot (from the survminer package). The first thing is to use datagrid to select those ages and representative values (means) for the other variables, as you can see by looking at what I called new1.5 After that, run survfit, supplying the model, the “new data”, and the original dataframe. Then plot them using ggsurvplot. For these, it is clearer to omit the confidence interval envelopes, which overlap each other.

new1 <- datagrid(model = hypernephroma.2, age = seq(40, 80, 10))
Warning: Matrix columns are not supported as predictors and are therefore
  omitted. This may prevent computation of the quantities of interest. You
  can construct your own prediction dataset and supply it explicitly to
  the `newdata` argument.
new1
s1 <- survfit(hypernephroma.2, newdata = new1, data = hypernephroma)
ggsurvplot(s1, conf.int = FALSE)
Warning: `gather_()` was deprecated in tidyr 1.2.0.
ℹ Please use `gather()` instead.
ℹ The deprecated feature was likely used in the survminer package.
  Please report the issue at <https://github.com/kassambara/survminer/issues>.

List out the five age values using c if you prefer.

There are two warnings:

  • the one below new1 is talking about what I called y before, which is not something we want to predict for, so we can ignore this.
  • the one above the plot is directed at the person who wrote ggsurvplot, not at us.

(h) (2 points) Describe the effect of increasing age on your plot, and explain briefly how this is consistent with the summary output from your model.

The strata have the same numbers as the rows of what I called new1, with higher numbers going with increasing age. As age increases, the survival curves move down, indicating that an older patient has a lower chance of surviving longer (worse survival).

Looking again at the summary:

summary(hypernephroma.2)
Call:
coxph(formula = y ~ age + monilia + mumps, data = hypernephroma)

  n= 33, number of events= 18 

            coef exp(coef) se(coef)      z Pr(>|z|)  
age      0.04664   1.04774  0.02808  1.661   0.0968 .
monilia  0.09374   1.09828  0.05625  1.667   0.0956 .
mumps   -0.12328   0.88401  0.05211 -2.366   0.0180 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.048     0.9544    0.9916    1.1070
monilia     1.098     0.9105    0.9836    1.2263
mumps       0.884     1.1312    0.7982    0.9791

Concordance= 0.718  (se = 0.055 )
Likelihood ratio test= 10.14  on 3 df,   p=0.02
Wald test            = 9.67  on 3 df,   p=0.02
Score (logrank) test = 10.21  on 3 df,   p=0.02

the coefficient for age is positive, meaning that an older patient has a higher hazard of death, which also points to worse survival.

(i) (3 points) Repeat the previous two parts for mumps skin test values of 0, 6, and 12.

Copy, paste, and edit:

new2 <- datagrid(model = hypernephroma.2, mumps = c(0, 6, 12))
Warning: Matrix columns are not supported as predictors and are therefore
  omitted. This may prevent computation of the quantities of interest. You
  can construct your own prediction dataset and supply it explicitly to
  the `newdata` argument.
new2
s2 <- survfit(hypernephroma.2, newdata = new2, data = hypernephroma)
ggsurvplot(s2, conf.int = FALSE)

This time, mumps varies, and representative (mean) values are used for the other variables. Here, the best survival goes with the blue curve, which is stratum 3 or a mumps test score of 12, so a higher score here is associated with a better chance of living longer.

Looking back at the summary, mumps has a negative coefficient, so if mumps is higher, the hazard of death is lower (and thus survival is better).

Footnotes

  1. Not to be confused with glance from the broom package.↩︎

  2. I don’t know what drtn stands for; I know of this kind of thing as a “difftime”, especially when it’s a time as well as a date. A little research reveals that it’s short for “duration”.↩︎

  3. Those formulas have estimated standard deviations in them, so the distribution you get P-values from is \(t\) rather than normal, but the idea is the same.↩︎

  4. which is itself based on the likelihood.↩︎

  5. I will be doing some more predictions in a moment, and I wanted to use different names for those.↩︎