library(tidyverse)
Worksheet 12
Packages
Intoxicant use according to gender and race
In a survey, 2,276 high-school students were classified according to whether or not they have ever used alcohol, cigarettes, or marijuana (responses). In the survey, each student’s race and gender (as they reported them) was also recorded (explanatory). The data are in http://ritsokiguess.site/datafiles/intoxicant.csv. The columns are labelled by the initial letter of each of these, with a column count
that says how many students fell into that combination of categories.
- Read in and display some of the data. How do you know you have the correct total number of students?
- Fit a log-linear model with up to two-way associations to these data. To do this, use
(a+c+m+r+g)^2
on the right side of your model formula (instead of thea*c*r*m*g
that you were probably expecting). Run a suitabledrop1
on this model.
- Build a better model. Why did you stop where you did?
- For each of your significant associations, draw a graph to explore them, and say what you conclude. Note that there is a logical distinction between associations that contain both a response variable and an explanatory one, and those that contain two variables of the same type.
- We can also use
step
to do the model-building (rather than removing terms one by one). Starting from all three-way interactions, runstep
on this model, saving the result, and then rundrop1
on that result. Is everything remaining significant? (Hint: copy and paste your code from question 2, and change the 2 to a 3.)
- In your final model from the previous question, are there any significant terms that you did not see previously? If so, in each case draw a suitable graph and say what it means.
Vaccination and severe COVID cases
In Israel in August 2021, each person in the country was classified according to whether they were aged under or over 50, had been vaccinated against COVID-19 or not, and whether or not they had a severe case of COVID-19 requiring hospitalization. The data are in http://ritsokiguess.site/datafiles/israel-covid.csv, in tidy format with a column for each categorical variable and a column of frequencies.
- Read in and display the data.
- Build a suitable log-linear model for the relationships among age, vaccination status and severity.
- For each of your significant effects, describe the nature of the significance, and thus what the significance means for the data. (Hint: the number of severe cases is very small, so you may need to “zoom in” to see the relative sizes of the proportions of severe cases. To do this, add to your graph
coord_cartesian(ylim = v)
wherev
is a vector of two values, the lower and upper limits of the \(y\)-scale you want to zoom in to.)
- Find the total number of severe cases among vaccinated people and among unvaccinated people.
- Somebody says to you “there are more severe cases among the vaccinated than among the unvaccinated, therefore it is more dangerous to get vaccinated.” How would you argue against this? Do a calculation that better compares the numbers of severe cases in the two groups. (Note: cases of rare diseases are often measured “per 100,000” to make the numbers human-sized.)
- The author of the study from which I got these numbers defines the “vaccine efficacy” as \(1-V/N\), where \(V\) is the number of severe cases per 100,000 for vaccinated people and \(N\) is the number of severe cases per 100,000 for non-vaccinated people. The author calculates the overall vaccine efficacy to be 67.5%. The author then calculates the vaccine efficacy for younger people to be 91.8% and for older people to be 85.2%. Explain as briefly as possible how these numbers, which are correct, seem to make no sense, but actually have a rational explanation in the light of what we have seen.