library(tidyverse)
Worksheet 12
Packages
Intoxicant use according to gender and race
In a survey, 2,276 high-school students were classified according to whether or not they have ever used alcohol, cigarettes, or marijuana (responses). In the survey, each student’s race and gender (as they reported them) was also recorded (explanatory). The data are in http://ritsokiguess.site/datafiles/intoxicant.csv. The columns are labelled by the initial letter of each of these, with a column count
that says how many students fell into that combination of categories.
- Read in and display some of the data. How do you know you have the correct total number of students?
The usual, to kick off with:
<- "http://ritsokiguess.site/datafiles/intoxicant.csv"
my_url <- read_csv(my_url) use
Rows: 32 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): a, c, m, r, g
dbl (1): count
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
use
The first five columns are respectively whether the students have ever used alcohol, cigarettes, or marijuana, then their race (classified as “white” or “other”), then their reported gender (classified as “male” or “female” only).
There are \(32 = 2^5\) rows, one for each of the two levels of each of the five categorical variables. But each row contains a lot of students, namely, the number in count
, so to find out how many students here are altogether, sum up the values in count
:
%>% summarize(total = sum(count)) use
which checks with the question.
Extra: the tidyverse has a function amusingly called uncount
that does the opposite of what count
does:
%>% uncount(count) use
This produces one row for each individual student (you will have to scroll down a long way to get past the 405 rows of white female students who have used all three intoxicants), and hence 2,276 rows altogether.
- Fit a log-linear model with up to two-way associations to these data. To do this, use
(a+c+m+r+g)^2
on the right side of your model formula (instead of thea*c*r*m*g
that you were probably expecting). Run a suitabledrop1
on this model.
.1 <- glm(count ~ (a+c+m+r+g)^2, data = use, family = "poisson")
usedrop1(use.1, test = "Chisq")
You see that in the drop1
output are all the two-way interactions (“associations” for this kind of model). This is what ^
does in a model formula, and is why we had to use I(x^2)
to add an \(x\)-squared term in a regression, meaning “literally \(x\)-squared, not any kind of an interaction”.
The reason for doing it this way here is that if you start with a*c*r*m*g
, you will have to eliminate the five-way interaction, then all the four-way interactions, then all the three-way interactions, before you even get down to business, where “remove” means “remove and re-fit, one by one”. I didn’t want to put you through that.
- Build a better model. Why did you stop where you did?
Just as in regression, remove the least significant term, re-fit, then repeat until everything is significant and there is nothing left to drop.
First, remove c:r
(P-value 0.51), then see what to do next:
.2 <- update(use.1, . ~ . - c:r)
usedrop1(use.2, test = "Chisq")
Next is r:g
, P-value 0.37:
.3 <- update(use.2, . ~ . - r:g)
usedrop1(use.3, test = "Chisq")
Then c:g
, P-value 0.33:
.4 <- update(use.3, . ~ . - c:g)
usedrop1(use.4, test = "Chisq")
Those P-values are looking pretty small now. m:r
, with P-value 0.084, seems to be the last one to remove:
.5 <- update(use.4, . ~ . - m:r)
usedrop1(use.5, test = "Chisq")
and so it proves. Everything remaining is significant, with the largest remaining P-value being 0.019 and the others being a lot smaller than that. Thus, this is where we stop.
Extra: we have a lot of data here (over 2000 observations), so the effects, though significant, might not all be very big. This is something to bear in mind in the next question.
- For each of your significant associations, draw a graph to explore them, and say what you conclude. Note that there is a logical distinction between associations that contain both a response variable and an explanatory one, and those that contain two variables of the same type.
The point of the log-linear analysis is to see which associations are worth looking at. If the association is not significant, there is no effect and that association does nothing to explain what is going on.
I begin by classifying the remaining associations by what type of variables they are:
a:c
: response and responsea:m
: response and responsea:r
: response and explanatorya:g
: response and explanatoryc:m
: response and responsem:g
: response and explanatory
What we are looking for in this kind of analysis is how a response variable depends on explanatory variable(s), so the third, fourth, and sixth associations above are the most interesting. Let’s focus on them first.
In drawing a graph of a response and an explanatory, what you want to be able to say is “for each category of the explanatory variable, what happens to the response?”, so for those graphs you want the explanatory as x
and the response as fill
. So I’ll start with these. I think the best graph is stacked bars but scaled to have the same height so that you can compare proportions.1 The (relative) height of the bars is something we already have (in count
), so this is geom_col
with count
in y
, and not geom_bar
which would count things for us.
In a:r
, alcohol and race, the former is a response, so it goes as fill
:
ggplot(use, aes(x = r, y = count, fill = a)) + geom_col(position = "fill")
The right-hand blue piece is slightly bigger, so if a student is white, they are more likely to have used alcohol than if they were a student of some other race. (This difference, though small, is significant, which is why we are looking at it.)
a:g
is done the same way as a:r
:
ggplot(use, aes(x = g, y = count, fill = a)) + geom_col(position = "fill")
If a student is female, they are very slightly more likely to have used alcohol than if they are male. This tiny difference is in fact significant, though the P-value is 0.019, which is not very small. Because there is so much data, significant effects can be small, and this appears to be one of those.
m:g
is done the same way again:
ggplot(use, aes(x = g, y = count, fill = m)) + geom_col(position = "fill")
If a student is male, they are (a little) more likely to use marijuana than if the student is female. Again, a small but significant difference.
When you have two variables of the same type (in this problem, both responses), it doesn’t matter which one is x
and which one is fill
, as long as your explanation matches up: for each category of the thing in x
, what happens to the thing in fill
?
a:c
:
ggplot(use, aes(x = a, y = count, fill = c)) + geom_col(position = "fill")
Students that have used alcohol are more likely to have used cigarettes as well (and the ones that haven’t used alcohol probably haven’t used cigarettes either). That is to say, alcohol and cigarette use go together. This is interesting, but it is a sort of secondary conclusion; you might imagine that the survey people were really interested in a question like “what kinds of young people tend to use alcohol or cigarettes or marijuana?” and this graph doesn’t help in answering that question.
Or, if you prefer:
ggplot(use, aes(x = c, y = count, fill = a)) + geom_col(position = "fill")
and the relevant conclusion is that if a student has used cigarettes, they have almost certainly used alcohol as well (and if they haven’t used cigarettes, they are much less likely to have used alcohol, even though the chance is not small).
The implication here is that these patterns are consistent over race. If they were not, we would have an a:c:r
interaction, implying that race affects the combination of alcohol and cigarette use. But see the next question.
a:m
are also both responses:
ggplot(use, aes(x = a, y = count, fill = m)) + geom_col(position = "fill")
Almost all the students who have not used alcohol have not used marijuana either, but of the students that have used alcohol, about half of them have used marijuana as well. Once again, students who have used one are more likely to have used the other as well.
Once again, the implication is that this pattern holds over all genders and races, otherwise we would have had a higher-order interaction in the model.2
Finally, c:m
:
ggplot(use, aes(x = c, y = count, fill = m)) + geom_col(position = "fill")
A student who has not used cigarettes will most likely not have used marijuana either, but a student who has used cigarettes has a slightly better than even chance of having used marijuana: once again, the categorical-variable version of a “positive correlation”.
I wouldn’t give you a question this long on an exam, but on a worksheet, there is no problem with giving you plenty of practice.
- We can also use
step
to do the model-building (rather than removing terms one by one). Starting from all three-way interactions, runstep
on this model, saving the result, and then rundrop1
on that result. Is everything remaining significant? (Hint: copy and paste your code from question 2, and change the 2 to a 3.)
Following the hint(s). The output is rather long, but (i) you are not handing this in, and (ii) we have no problems with letting R do the work:
.6 <- glm(count ~ (a+c+m+r+g)^3, data = use, family = "poisson")
use.7 <- step(use.6) use
Start: AIC=193.51
count ~ (a + c + m + r + g)^3
Df Deviance AIC
- c:m:r 1 5.2820 191.52
- m:r:g 1 5.3000 191.54
- a:m:g 1 5.3924 191.63
- a:m:r 1 5.4349 191.67
- a:c:g 1 5.5868 191.82
- a:c:m 1 5.6749 191.91
- c:r:g 1 6.4173 192.66
- c:m:g 1 6.7775 193.02
<none> 5.2720 193.51
- a:r:g 1 7.8942 194.13
- a:c:r 1 7.9474 194.19
Step: AIC=191.52
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r +
c:g + m:r + m:g + r:g + a:c:m + a:c:r + a:c:g + a:m:r + a:m:g +
a:r:g + c:m:g + c:r:g + m:r:g
Df Deviance AIC
- m:r:g 1 5.3091 189.55
- a:m:g 1 5.4028 189.64
- a:m:r 1 5.4545 189.69
- a:c:g 1 5.5977 189.84
- a:c:m 1 5.6939 189.93
- c:r:g 1 6.4173 190.66
- c:m:g 1 6.7824 191.02
<none> 5.2820 191.52
- a:r:g 1 7.8976 192.14
- a:c:r 1 8.0316 192.27
Step: AIC=189.55
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r +
c:g + m:r + m:g + r:g + a:c:m + a:c:r + a:c:g + a:m:r + a:m:g +
a:r:g + c:m:g + c:r:g
Df Deviance AIC
- a:m:g 1 5.4343 187.67
- a:m:r 1 5.4746 187.71
- a:c:g 1 5.6324 187.87
- a:c:m 1 5.7215 187.96
- c:r:g 1 6.5189 188.76
- c:m:g 1 6.8129 189.05
<none> 5.3091 189.55
- a:r:g 1 8.0975 190.34
- a:c:r 1 8.1046 190.34
Step: AIC=187.67
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r +
c:g + m:r + m:g + r:g + a:c:m + a:c:r + a:c:g + a:m:r + a:r:g +
c:m:g + c:r:g
Df Deviance AIC
- a:m:r 1 5.5688 185.81
- a:c:g 1 5.7041 185.94
- a:c:m 1 5.8192 186.06
- c:r:g 1 6.6417 186.88
- c:m:g 1 6.8615 187.10
<none> 5.4343 187.67
- a:r:g 1 8.2153 188.45
- a:c:r 1 8.2529 188.49
Step: AIC=185.81
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r +
c:g + m:r + m:g + r:g + a:c:m + a:c:r + a:c:g + a:r:g + c:m:g +
c:r:g
Df Deviance AIC
- a:c:g 1 5.8372 184.07
- a:c:m 1 5.9021 184.14
- c:r:g 1 6.7800 185.02
- c:m:g 1 7.0006 185.24
<none> 5.5688 185.81
- m:r 1 7.7021 185.94
- a:r:g 1 8.3839 186.62
- a:c:r 1 8.6868 186.93
Step: AIC=184.08
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r +
c:g + m:r + m:g + r:g + a:c:m + a:c:r + a:r:g + c:m:g + c:r:g
Df Deviance AIC
- a:c:m 1 6.1515 182.39
- c:r:g 1 7.1650 183.40
- c:m:g 1 7.4487 183.69
<none> 5.8372 184.07
- m:r 1 7.9676 184.21
- a:r:g 1 8.9101 185.15
- a:c:r 1 9.1890 185.43
Step: AIC=182.39
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r +
c:g + m:r + m:g + r:g + a:c:r + a:r:g + c:m:g + c:r:g
Df Deviance AIC
- c:r:g 1 7.475 181.71
- c:m:g 1 7.791 182.03
<none> 6.151 182.39
- m:r 1 8.319 182.56
- a:r:g 1 9.214 183.45
- a:c:r 1 9.529 183.77
- a:m 1 97.016 271.25
Step: AIC=181.71
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r +
c:g + m:r + m:g + r:g + a:c:r + a:r:g + c:m:g
Df Deviance AIC
- c:m:g 1 9.215 181.45
<none> 7.475 181.71
- a:r:g 1 9.476 181.71
- m:r 1 9.613 181.85
- a:c:r 1 11.589 183.83
- a:m 1 98.342 270.58
Step: AIC=181.45
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r +
c:g + m:r + m:g + r:g + a:c:r + a:r:g
Df Deviance AIC
- c:g 1 10.31 180.54
<none> 9.22 181.45
- a:r:g 1 11.26 181.50
- m:r 1 11.35 181.59
- a:c:r 1 13.39 183.63
- m:g 1 19.04 189.28
- a:m 1 99.62 269.86
- c:m 1 506.21 676.45
Step: AIC=180.54
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r +
m:r + m:g + r:g + a:c:r + a:r:g
Df Deviance AIC
- a:r:g 1 12.24 180.48
<none> 10.31 180.54
- m:r 1 12.44 180.68
- a:c:r 1 14.36 182.60
- m:g 1 19.22 187.46
- a:m 1 100.73 268.97
- c:m 1 506.39 674.63
Step: AIC=180.48
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r +
m:r + m:g + r:g + a:c:r
Df Deviance AIC
- r:g 1 13.05 179.29
<none> 12.24 180.48
- m:r 1 14.48 180.71
- a:c:r 1 16.32 182.56
- a:g 1 17.53 183.77
- m:g 1 21.32 187.56
- a:m 1 102.60 268.84
- c:m 1 508.34 674.58
Step: AIC=179.29
count ~ a + c + m + r + g + a:c + a:m + a:r + a:g + c:m + c:r +
m:r + m:g + a:c:r
Df Deviance AIC
<none> 13.05 179.29
- m:r 1 15.15 179.39
- a:c:r 1 17.13 181.37
- a:g 1 18.56 182.80
- m:g 1 21.95 186.19
- a:m 1 103.43 267.67
- c:m 1 509.16 673.39
drop1(use.7, test = "Chisq")
As you see, the process is rather lengthy, which is why I didn’t have you do it by hand.
There is one term in use.7
, m:r
, that is not significant (P-value 0.15). You could remove it if you wish.
- In your final model from the previous question, are there any significant terms that you did not see previously? If so, in each case draw a suitable graph and say what it means.
There is one remaining three-way interaction a:c:r
. One way of interpreting this is that race is associated with the alcohol-cigarettes combination.
With three variables, we have to use x
, fill
and facets. I think the one explanatory variable r
belongs in facets (but you can experiment and see which you prefer):
ggplot(use, aes(x = a, y = count, fill = c)) + geom_col(position = "fill") +
facet_wrap(~ r)
The pattern in the two facets is more or less the same: if a student has used alcohol, they are more likely to have used cigarettes as well. Perhaps the reason for the interaction term being significant is that the no-no pattern is a bit more pronounced for the white students than it is for students of other races: if a white student has not used alcohol (left bar in the right facet), they are more likely to have not used cigarettes either, compared to a student of another race. (The three-way association is only just significant, with a P-value of 0.043, so we would not expect the effect to be very large, and indeed it is not.)