Worksheet 12

Published

March 27, 2025

Packages

library(tidyverse)

Intoxicant use according to gender and race

In a survey, 2,276 high-school students were classified according to whether or not they have ever used alcohol, cigarettes, or marijuana (responses). In the survey, each student’s race and gender (as they reported them) was also recorded (explanatory). The data are in http://ritsokiguess.site/datafiles/intoxicant.csv. The columns are labelled by the initial letter of each of these, with a column count that says how many students fell into that combination of categories.

  1. Read in and display some of the data. How do you know you have the correct total number of students?
  1. Fit a log-linear model with up to two-way associations to these data. To do this, use (a+c+m+r+g)^2 on the right side of your model formula (instead of the a*c*r*m*g that you were probably expecting). Run a suitable drop1 on this model.
  1. Build a better model. Why did you stop where you did?
  1. For each of your significant associations, draw a graph to explore them, and say what you conclude. Note that there is a logical distinction between associations that contain both a response variable and an explanatory one, and those that contain two variables of the same type.
  1. We can also use step to do the model-building (rather than removing terms one by one). Starting from all three-way interactions, run step on this model, saving the result, and then run drop1 on that result. Is everything remaining significant? (Hint: copy and paste your code from question 2, and change the 2 to a 3.)
  1. In your final model from the previous question, are there any significant terms that you did not see previously? If so, in each case draw a suitable graph and say what it means.

Vaccination and severe COVID cases

In Israel in August 2021, each person in the country was classified according to whether they were aged under or over 50, had been vaccinated against COVID-19 or not, and whether or not they had a severe case of COVID-19 requiring hospitalization. The data are in http://ritsokiguess.site/datafiles/israel-covid.csv, in tidy format with a column for each categorical variable and a column of frequencies.

  1. Read in and display the data.
  1. Build a suitable log-linear model for the relationships among age, vaccination status and severity.
  1. For each of your significant effects, describe the nature of the significance, and thus what the significance means for the data. (Hint: the number of severe cases is very small, so you may need to “zoom in” to see the relative sizes of the proportions of severe cases. To do this, add to your graph coord_cartesian(ylim = v) where v is a vector of two values, the lower and upper limits of the \(y\)-scale you want to zoom in to.)
  1. Find the total number of severe cases among vaccinated people and among unvaccinated people.
  1. Somebody says to you “there are more severe cases among the vaccinated than among the unvaccinated, therefore it is more dangerous to get vaccinated.” How would you argue against this? Do a calculation that better compares the numbers of severe cases in the two groups. (Note: cases of rare diseases are often measured “per 100,000” to make the numbers human-sized.)
  1. The author of the study from which I got these numbers defines the “vaccine efficacy” as \(1-V/N\), where \(V\) is the number of severe cases per 100,000 for vaccinated people and \(N\) is the number of severe cases per 100,000 for non-vaccinated people. The author calculates the overall vaccine efficacy to be 67.5%. The author then calculates the vaccine efficacy for younger people to be 91.8% and for older people to be 85.2%. Explain as briefly as possible how these numbers, which are correct, seem to make no sense, but actually have a rational explanation in the light of what we have seen.