gender contacts glasses none
female 121 32 129
male 42 37 85
Is there association between eyewear and gender?
Normally answer this with chisquare test (based on observed and expected frequencies from null hypothesis of no association).
Two categorical variables and a frequency.
We assess in way that generalizes to more categorical variables.
gender contacts glasses none
female 121 32 129
male 42 37 85
This is not tidy!
Two variables are gender and eyewear, and those numbers all frequencies.
pivot_wider
glm
with poisson
family.nothing!
drop1
says what we can remove at this step. Significant = must stay.
Cannot remove anything.
Frequency depends on gender-eye wear
combination, cannot be simplified further.
Gender and eyewear are associated.
For modelling, stop here.
fill
clearer than dodge
.frequency
), so use geom_col
rather than geom_bar
.geom_col
takes a y
that should be the frequency.x
(here gender
); eyewear
is response and goes in fill
.eyes.2 <- glm(frequency ~ gender * eyewear,
data = eyes2,
family = "poisson"
)
drop1(eyes.2, test = "Chisq")
No longer any association. Take out interaction.
More females (gender effect) over all eyewear
fewer glasses-wearers (eyewear effect) over both genders
no association (no interaction).
In a hospital emergency department, 176 subjects who attended for acute chest pain took part in a study.
Each subject had a normal or abnormal electrocardiogram reading (ECG), were overweight (as judged by BMI) or not, and were a smoker or not.
How are these three variables related, or not?
In modelling-friendly format:
ecg bmi smoke count
abnormal overweight yes 47
abnormal overweight no 10
abnormal normalweight yes 8
abnormal normalweight no 6
normal overweight yes 25
normal overweight no 15
normal normalweight yes 35
normal normalweight no 30
my_url <- "http://ritsokiguess.site/datafiles/ecg.txt"
chest <- read_delim(my_url, " ")
chest.1 <- glm(count ~ ecg * bmi * smoke,
data = chest,
family = "poisson"
)
drop1(chest.1, test = "Chisq")
That 3-way interaction comes out.
At \(\alpha=0.05\), bmi:smoke
comes out.
bmi:smoke
ecg:smoke
has become significant. So we have to stop.
ecg
is associated with both bmi
and smoke
, but separately (it doesn’t depend on the combination of bmi
and smoke
).
ecg
is response (patients came into the study being smokers or overweight) so use as fill
in both graphs.y
is the frequency column.ecg:bmi
ecg:smoke
Most nonsmokers have a normal ECG, but smokers are about 50–50 normal and abnormal ECG.
Don’t look at smoke:bmi
since not significant.
Alaska Airlines America West
Airport On time Delayed On time Delayed
Los Angeles 497 62 694 117
Phoenix 221 12 4840 415
San Diego 212 20 383 65
San Francisco 503 102 320 129
Seattle 1841 305 201 61
Total 3274 501 6438 787
Use status
as variable name for “on time/delayed”.
Alaska: 13.3% flights delayed (\(501/(3274+501)\)).
America West: 10.9% (\(787/(6438+787)\)).
America West more punctual, right?
airport aa_ontime aa_delayed aw_ontime aw_delayed
LosAngeles 497 62 694 117
Phoenix 221 12 4840 415
SanDiego 212 20 383 65
SanFrancisco 503 102 320 129
Seattle 1841 305 201 61
pivot_longer
:punctual
We now have three categorical variables, so use one of the explanatories (for me, airport) as facets:
America West more punctual overall,
but worse at every single airport!
How is that possible?
Log-linear analysis sheds some light.
Stop here, and draw graphs to understand significant results.
airline:status
:We did this one before.
Slightly more of Alaska Airlines’ flights delayed overall.
airport:status
:Flights into San Francisco (and maybe Seattle) are often late, and flights into Phoenix are usually on time.
Considerable variation among airports.
airport:airline
:What fraction of each airline’s flights are to each airport.
Most of Alaska Airlines’ flights to Seattle and San Francisco.
Most of America West’s flights to Phoenix.
Most of America West’s flights to Phoenix, where it is easy to be on time.
Most of Alaska Airlines’ flights to San Francisco and Seattle, where it is difficult to be on time.
Overall comparison looks bad for Alaska because of this.
But, comparing like with like, if you compare each airline’s performance to the same airport, Alaska does better.
Aggregating over the very different airports was a (big) mistake: that was the cause of the Simpson’s paradox.
Alaska Airlines is more punctual when you do the proper comparison.
Retrospective study of ovarian cancer done in 1973.
Information about 299 women operated on for ovarian cancer 10 years previously.
Recorded:
stage of cancer (early or advanced)
type of operation (radical or limited)
X-ray treatment received (yes or no)
10-year survival (yes or no)
Survival looks like response (suggests logistic regression).
Log-linear model finds any associations at all.
after tidying:
stage operation xray survival freq
early radical no no 10
early radical no yes 41
early radical yes no 17
early radical yes yes 64
early limited no no 1
early limited no yes 13
early limited yes no 3
early limited yes yes 9
advanced radical no no 38
advanced radical no yes 6
advanced radical yes no 64
advanced radical yes yes 11
advanced limited no no 3
advanced limited no yes 1
advanced limited yes no 13
advanced limited yes yes 5
hopefully looking familiar by now:
See what we can remove:
Non-significant interaction can come out.
Least significant term is stage:xray:survival
: remove.
stage:xray:survival
operation:xray:survival
comes out next.
operation:xray:survival
stage:operation:xray
has now become significant, so won’t remove that.
Shows value of removing terms one at a time.
There are no higher-order interactions containing both xray
and survival
, so now we get to test (and remove) xray:survival
.
xray:survival
stage:operation:survival
Remove operation:survival
.
Finally done!
What matters is things associated with survival
(survival
is “response”).
Only significant such term is stage:survival
.
Most people in early stage of cancer survived, and most people in advanced stage did not survive.
This true regardless of type of operation or whether or not X-ray treatment was received. These things have no impact on survival.
The association is between stage
and xray
only for those who had the limited operation.
For those who had the radical operation, there was no association between stage
and xray
.
This is of less interest than associations with survival
.
Start with “complete model” including all possible interactions.
drop1
gives highest-order interaction(s) remaining, remove least non-significant.
Repeat as necessary until everything significant.
Look at graphs of significant interactions.
Main effects not usually very interesting.
Interactions with “response” usually of most interest: show association with response.
Comments