Chapter 41 Frequency table analysis

Packages for this chapter:

41.1 College plans

5199 male high school seniors in Wisconsin38 were classified by socio-economic status (low, lower-middle, upper-middle, high), by the degree that their parents encouraged them in their education (low or high), and whether or not they had plans to go to college (yes or no). How, if at all, are these categorical variables related? The data can be found at link.

  1. Read in the data and check that you have a column for each variable and a column of frequencies.

  2. Fit a log-linear model containing all possible interactions. You don’t need to examine it yet.

  3. Find out which terms (interactions) could be removed. Do you think removing any of them is a good idea?

  4. Remove anything you can, and fit again. Hint: update.

  5. Continue to examine what can be removed, and if reasonable, remove it, until you need to stop. Which terms are left in your final model?

  6. Make two-way tables of any remaining two-way interactions, and describe any relationships that you see.

41.2 Predicting voting

1257 British voters were classified according to their social class, age (categorized), sex and the political party they voted for (Labour or Conservative). Which, if any, of these factors influences the party that someone votes for? The data are in link, one voter per line.

  1. Read in the data and display (some of) the data frame.

  2. There is no frequency column here, because each row of the data frame only represents one voter. Count up the frequencies for each combo of the categorical variables, and save it (this is the data frame that we will use for the analysis). Display the first few rows of the result. Do you now have something that you need?

  3. Fit a log-linear model with the appropriate interaction (as a starting point).

  4. Refine your model by taking out suitable non-significant terms, in multiple steps. What model do you finish with?

  5. If we think of the party someone votes for as the final outcome (that depends on all the other things), what does our final model say that someone’s vote depends on?

  6. Obtain sub-tables that explain how vote depends on any of the things it’s related to.

41.3 Brand M laundry detergent

A survey was carried out comparing respondents’ preferences of a laundry detergent M compared to a mystery brand X. For each respondent, the researchers recorded the temperature of the laundry load (low or high), whether or not they previously used brand M (yes or no), and the softness of the water used for the laundry load(hard, medium or soft). The aim of the survey was to find out what was associated with the respondents preferring brand M. The data are in http://ritsokiguess.site/datafiles/brand_m.csv.

  1. Read in and display (some of) the data. Explain briefly how the data is laid out appropriately to fit a log-linear model.

  2. Using backward elimination, build a suitable log-linear model for the associations between the variables. (Do not use step; do the elimination yourself).

  3. What is associated with the brand a respondent prefers? By obtaining suitable frequency tables, describe the nature of these associations.

My solutions follow:

41.4 College plans

5199 male high school seniors in Wisconsin39 were classified by socio-economic status (low, lower-middle, upper-middle, high), by the degree that their parents encouraged them in their education (low or high), and whether or not they had plans to go to college (yes or no). How, if at all, are these categorical variables related? The data can be found at link.

  1. Read in the data and check that you have a column for each variable and a column of frequencies.

Solution

Delimited by one space:

## 
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   social.stratum = col_character(),
##   encouragement = col_character(),
##   college.plans = col_character(),
##   frequency = col_double()
## )
## # A tibble: 16 x 4
##    social.stratum encouragement college.plans frequency
##    <chr>          <chr>         <chr>             <dbl>
##  1 lower          low           no                  749
##  2 lower          low           yes                  35
##  3 lower          high          no                  233
##  4 lower          high          yes                 133
##  5 lowermiddle    low           no                  627
##  6 lowermiddle    low           yes                  38
##  7 lowermiddle    high          no                  330
##  8 lowermiddle    low           no                  303
##  9 uppermiddle    low           no                  627
## 10 uppermiddle    low           yes                  38
## 11 uppermiddle    high          no                  374
## 12 uppermiddle    high          yes                 467
## 13 higher         low           no                  153
## 14 higher         low           yes                  26
## 15 higher         high          no                  266
## 16 higher         high          yes                 800

As promised. We only have 16 observations, because we have all possible combinations of categorical variable combinations, 4 social strata, times 2 levels of encouragement, times 2 levels of college plans.

Each line of the data file summarizes a number of students, not just one. For example, the first line says that 749 students were in the lower social stratum, received low encouragement and have no college plans. If we sum up the frequencies, we should get 5199 because there were that many students altogether:

## # A tibble: 1 x 1
##     tot
##   <dbl>
## 1  5199

\(\blacksquare\)

  1. Fit a log-linear model containing all possible interactions. You don’t need to examine it yet.

Solution

\(\blacksquare\)

  1. Find out which terms (interactions) could be removed. Do you think removing any of them is a good idea?

Solution

This is drop1. If you forget the test=, you won’t get any P-values:

## Single term deletions
## 
## Model:
## frequency ~ social.stratum * encouragement * college.plans
##                                            Df Deviance    AIC   LRT Pr(>Chi)
## <none>                                          115.28 259.52               
## social.stratum:encouragement:college.plans  2   118.98 259.22 3.697   0.1575

This P-value is not small, so the three-way interaction can be removed.

\(\blacksquare\)

  1. Remove anything you can, and fit again. Hint: update.

Solution

In this kind of modelling, it’s easier to describe what changes should be made to get from one model to another, rather than writing out the whole thing from scratch again. Anyway, the three-way interaction can come out:

\(\blacksquare\)

  1. Continue to examine what can be removed, and if reasonable, remove it, until you need to stop. Which terms are left in your final model?

Solution

Start with drop1:

## Single term deletions
## 
## Model:
## frequency ~ social.stratum + encouragement + college.plans + 
##     social.stratum:encouragement + social.stratum:college.plans + 
##     encouragement:college.plans
##                              Df Deviance     AIC    LRT  Pr(>Chi)    
## <none>                            118.98  259.22                     
## social.stratum:encouragement  3   379.18  513.42 260.20 < 2.2e-16 ***
## social.stratum:college.plans  3   331.86  466.10 212.88 < 2.2e-16 ***
## encouragement:college.plans   1  1024.69 1162.94 905.72 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These are all strongly significant, so they have to stay. There is nothing else we can remove. All the two-way interactions have to stay in the model.

\(\blacksquare\)

  1. Make two-way tables of any remaining two-way interactions, and describe any relationships that you see.

Solution

We have three two-way tables to make.

My first one is social stratum by parental encouragement. Neither of these is really a response, but I thought that social stratum would influence parental encouragement rather than the other way around, hence:

##               encouragement
## social.stratum      high       low
##    higher      0.8562249 0.1437751
##    lower       0.3182609 0.6817391
##    lowermiddle 0.2542373 0.7457627
##    uppermiddle 0.5584329 0.4415671

This says that there tends to be more parental encouragement, the higher the social stratum. Next, this:

##               college.plans
## social.stratum         no        yes
##    higher      0.33654618 0.66345382
##    lower       0.85391304 0.14608696
##    lowermiddle 0.97072419 0.02927581
##    uppermiddle 0.66467463 0.33532537

In this one (and the next), college.plans is the response, in columns, so we want to have the rows adding up to 1.

The higher the social stratum, the more likely is a male high school senior to have plans to go to college. (The social stratum is not in order, so you’ll have to jump from the second row to the third to the fourth to the first to assess this. Lower and lower middle are not in order, but the others are.)

Finally, this:

##              college.plans
## encouragement        no       yes
##          high 0.4621590 0.5378410
##          low  0.9472265 0.0527735

And here you see an enormous effect of parental encouragement on college plans: if it is low, the high-school senior is very unlikely to be considering college.

Nothing, in all honesty, that is very surprising here. But the two-way interactions are easier to interpret than a three-way one would have been.

Here, we think of college plans as being a response, and this analysis has shown that whether or not a student has plans to go to college depends separately on the socio-economic status and the level of parental encouragement (rather than on the combination of both, as would have been the case had the three-way interaction been significant).

\(\blacksquare\)

41.5 Predicting voting

1257 British voters were classified according to their social class, age (categorized), sex and the political party they voted for (Labour or Conservative). Which, if any, of these factors influences the party that someone votes for? The data are in link, one voter per line.

  1. Read in the data and display (some of) the data frame.

Solution

Space-delimited:

## 
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   id = col_double(),
##   class = col_character(),
##   age = col_character(),
##   sex = col_character(),
##   vote = col_character()
## )
## # A tibble: 1,257 x 5
##       id class        age   sex    vote        
##    <dbl> <chr>        <chr> <chr>  <chr>       
##  1     1 upper middle >75   male   conservative
##  2     2 upper middle >75   male   conservative
##  3     3 upper middle >75   male   conservative
##  4     4 upper middle >75   male   conservative
##  5     5 upper middle >75   female conservative
##  6     6 upper middle >75   female conservative
##  7     7 upper middle >75   female conservative
##  8     8 upper middle >75   female conservative
##  9     9 upper middle >75   female conservative
## 10    10 upper middle >75   female conservative
## # … with 1,247 more rows

I gave it a “disposable” name, since we make the “real” data set shortly.

\(\blacksquare\)

  1. There is no frequency column here, because each row of the data frame only represents one voter. Count up the frequencies for each combo of the categorical variables, and save it (this is the data frame that we will use for the analysis). Display the first few rows of the result. Do you now have something that you need?

Solution

I changed my mind about how to do this from last year. Using count is alarmingly more direct than the method I had before:

## # A tibble: 58 x 5
##    class        age   sex    vote             n
##    <chr>        <chr> <chr>  <chr>        <int>
##  1 lower middle <26   female conservative    13
##  2 lower middle <26   female labour           7
##  3 lower middle <26   male   conservative     9
##  4 lower middle <26   male   labour           9
##  5 lower middle >75   female conservative     9
##  6 lower middle >75   female labour           2
##  7 lower middle >75   male   conservative     8
##  8 lower middle >75   male   labour           4
##  9 lower middle 26-40 female conservative    17
## 10 lower middle 26-40 female labour          13
## # … with 48 more rows

Exactly the right thing now: note the new column n with frequencies in it. (Without a column of frequencies we can’t fit a log-linear model.) There are now only 58 combinations of the four categorical variables, as opposed to 1247 rows in the original data set (with, inevitably, a lot of repeats).

\(\blacksquare\)

  1. Fit a log-linear model with the appropriate interaction (as a starting point).

Solution

\(\blacksquare\)

  1. Refine your model by taking out suitable non-significant terms, in multiple steps. What model do you finish with?

Solution

Alternating drop1 and update until everything remaining is significant:

## Single term deletions
## 
## Model:
## n ~ class * age * sex * vote
##                    Df Deviance    AIC   LRT Pr(>Chi)
## <none>                   0.000 381.49               
## class:age:sex:vote  7    8.086 375.58 8.086   0.3251

Not anywhere near significant, so out it comes:

## Single term deletions
## 
## Model:
## n ~ class + age + sex + vote + class:age + class:sex + age:sex + 
##     class:vote + age:vote + sex:vote + class:age:sex + class:age:vote + 
##     class:sex:vote + age:sex:vote
##                Df Deviance    AIC     LRT Pr(>Chi)  
## <none>               8.086 375.58                   
## class:age:sex   8   11.244 362.74  3.1583  0.92404  
## class:age:vote  7   21.962 375.46 13.8759  0.05343 .
## class:sex:vote  2   10.142 373.64  2.0564  0.35765  
## age:sex:vote    4   14.239 373.73  6.1528  0.18802  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Take out the first one, since it has the highest P-value:

## Single term deletions
## 
## Model:
## n ~ class + age + sex + vote + class:age + class:sex + age:sex + 
##     class:vote + age:vote + sex:vote + class:age:vote + class:sex:vote + 
##     age:sex:vote
##                Df Deviance    AIC     LRT Pr(>Chi)  
## <none>              11.244 362.74                   
## class:age:vote  7   25.171 362.66 13.9262  0.05251 .
## class:sex:vote  2   12.794 360.29  1.5498  0.46074  
## age:sex:vote    4   19.248 362.74  8.0041  0.09143 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

class:sex:vote:

## Single term deletions
## 
## Model:
## n ~ class + age + sex + vote + class:age + class:sex + age:sex + 
##     class:vote + age:vote + sex:vote + class:age:vote + age:sex:vote
##                Df Deviance    AIC     LRT Pr(>Chi)  
## <none>              12.794 360.29                   
## class:sex       2   13.477 356.97  0.6830  0.71070  
## class:age:vote  7   26.698 360.19 13.9036  0.05292 .
## age:sex:vote    4   21.211 360.71  8.4172  0.07744 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

class:sex:

## Single term deletions
## 
## Model:
## n ~ class + age + sex + vote + class:age + age:sex + class:vote + 
##     age:vote + sex:vote + class:age:vote + age:sex:vote
##                Df Deviance    AIC     LRT Pr(>Chi)  
## <none>              13.477 356.97                   
## class:age:vote  7   27.633 357.13 14.1555  0.04848 *
## age:sex:vote    4   22.081 357.57  8.6037  0.07181 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I don’t like having three-way interactions, so I’m going to yank age:sex:vote now, even though its P-value is smallish:

## Single term deletions
## 
## Model:
## n ~ class + age + sex + vote + class:age + age:sex + class:vote + 
##     age:vote + sex:vote + class:age:vote
##                Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>              22.081 357.57                      
## age:sex         4   22.918 350.41  0.8372 0.9333914    
## sex:vote        1   33.018 366.51 10.9376 0.0009423 ***
## class:age:vote  7   36.236 357.73 14.1555 0.0484843 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The age-sex interaction can go, but we must be near the end now:

## Single term deletions
## 
## Model:
## n ~ class + age + sex + vote + class:age + class:vote + age:vote + 
##     sex:vote + class:age:vote
##                Df Deviance    AIC    LRT  Pr(>Chi)    
## <none>              22.918 350.41                     
## sex:vote        1   33.808 359.30 10.890 0.0009667 ***
## class:age:vote  7   37.073 350.57 14.155 0.0484843 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And that’s it. The age and sex main effects are not included in the list of droppable things because both variables are part of higher-order interactions that are still in the model.

If you want to, you can look at the summary of your final model:

## 
## Call:
## glm(formula = n ~ class + age + sex + vote + class:age + class:vote + 
##     age:vote + sex:vote + class:age:vote, family = "poisson", 
##     data = votes)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.82417  -0.39708  -0.00015   0.41445   1.41435  
## 
## Coefficients: (1 not defined because of singularities)
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                            2.50737    0.21613  11.601  < 2e-16 ***
## classupper middle                     -0.45199    0.34188  -1.322 0.186151    
## classworking                           0.37469    0.27696   1.353 0.176088    
## age>75                                -0.25783    0.32292  -0.798 0.424622    
## age26-40                               0.34294    0.27877   1.230 0.218619    
## age41-50                               0.93431    0.25162   3.713 0.000205 ***
## age51-75                               0.89794    0.25293   3.550 0.000385 ***
## sexmale                               -0.23242    0.08016  -2.900 0.003737 ** 
## votelabour                            -0.50081    0.33324  -1.503 0.132882    
## classupper middle:age>75               0.25783    0.49713   0.519 0.604013    
## classworking:age>75                    0.01097    0.41896   0.026 0.979113    
## classupper middle:age26-40             0.82466    0.41396   1.992 0.046358 *  
## classworking:age26-40                  0.41083    0.35167   1.168 0.242713    
## classupper middle:age41-50             0.37788    0.39239   0.963 0.335542    
## classworking:age41-50                 -0.28917    0.33310  -0.868 0.385327    
## classupper middle:age51-75             0.43329    0.39277   1.103 0.269954    
## classworking:age51-75                  0.10223    0.32668   0.313 0.754325    
## classupper middle:votelabour          -0.12338    0.53898  -0.229 0.818936    
## classworking:votelabour                1.05741    0.39259   2.693 0.007073 ** 
## age>75:votelabour                     -0.72300    0.57745  -1.252 0.210547    
## age26-40:votelabour                    0.21667    0.41944   0.517 0.605452    
## age41-50:votelabour                   -0.93431    0.43395  -2.153 0.031315 *  
## age51-75:votelabour                   -0.62601    0.41724  -1.500 0.133526    
## sexmale:votelabour                     0.37323    0.11334   3.293 0.000992 ***
## classupper middle:age>75:votelabour         NA         NA      NA       NA    
## classworking:age>75:votelabour        -0.29039    0.68720  -0.423 0.672607    
## classupper middle:age26-40:votelabour -0.53698    0.65445  -0.821 0.411931    
## classworking:age26-40:votelabour      -0.28479    0.49429  -0.576 0.564516    
## classupper middle:age41-50:votelabour -0.01015    0.68338  -0.015 0.988147    
## classworking:age41-50:votelabour       1.06121    0.50772   2.090 0.036603 *  
## classupper middle:age51-75:votelabour -0.06924    0.65903  -0.105 0.916328    
## classworking:age51-75:votelabour       0.16608    0.49036   0.339 0.734853    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 797.594  on 57  degrees of freedom
## Residual deviance:  22.918  on 27  degrees of freedom
## AIC: 350.41
## 
## Number of Fisher Scoring iterations: 4

These tend to be rather unwieldy, and we’ll see a better way of understanding the results below, but you can look for the very significant results, bearing in mind that the first category is the baseline, for example, more of the males in the survey voted Labour (than Conservative).

\(\blacksquare\)

  1. If we think of the party someone votes for as the final outcome (that depends on all the other things), what does our final model say that someone’s vote depends on?

Solution

Find out which of the surviving terms are interactions with vote. Here, there are two things, that vote depends on separately:

  • sex

  • The age-class interaction.

\(\blacksquare\)

  1. Obtain sub-tables that explain how vote depends on any of the things it’s related to.

Solution

This is xtabs again. The 3-way interaction is a bit tricky, so we’ll do the simple one first:

##               sex
## vote              female      male
##   conservative 0.5474339 0.4543974
##   labour       0.4525661 0.5456026

The female voters slightly preferred to vote Conservative and the male voters slightly preferred to vote Labour. This is a small effect, but I guess the large number of voters made it big enough to be significant.

I took it this way around because vote is the outcome, and therefore I want to address things like “if a voter is female, how likely are they to vote Labour”, rather than conditioning the other way around (which would be “if a voter voted Labour, how likely are they to be female”, which doesn’t make nearly so much sense).

Then the tricky one:

## , , class = lower middle
## 
##               age
## vote           <26 >75 26-40 41-50 51-75
##   conservative  22  17    31    56    54
##   labour        16   6    28    16    21
## 
## , , class = upper middle
## 
##               age
## vote           <26 >75 26-40 41-50 51-75
##   conservative  14  14    45    52    53
##   labour         9   0    21    13    17
## 
## , , class = working
## 
##               age
## vote           <26 >75 26-40 41-50 51-75
##   conservative  32  25    68    61    87
##   labour        67  19   133   145   115

Doing it this way has produced different subtables for each class. This is actually OK, because we can say “if a voter was of lower middle class” and then talk about the relationship between age and vote, as if we were looking at a simple effect:

  • If a voter was of lower-middle-class, they strongly favour voting Conservative in all age groups except for <26 and 26–40.

  • If a voter was of upper-middle-class, they even more strongly favour voting Conservative in all age groups except for “under 26” and maybe 26–40.

  • If a voter was of Working class, they strongly favour voting Labour, except in the “over 75” age group (and maybe 51–75 as well).

If the anomalous age group(s) had been the same one every time, there would no longer have been an interaction between age and class in their effect on vote. But the anomalous age groups were different for each class (“different pattern”), and that explains why there was a vote:age:class interaction: " the way someone votes depends on the combination of age and social class".

For prop.table in three dimensions, as we have here, we have to be a little more careful about what to make add up to 1. For example, to make the social classes each add up to 1, which is the third dimension:

## , , class = lower middle
## 
##               age
## vote                  <26        >75      26-40      41-50      51-75
##   conservative 0.08239700 0.06367041 0.11610487 0.20973783 0.20224719
##   labour       0.05992509 0.02247191 0.10486891 0.05992509 0.07865169
## 
## , , class = upper middle
## 
##               age
## vote                  <26        >75      26-40      41-50      51-75
##   conservative 0.05882353 0.05882353 0.18907563 0.21848739 0.22268908
##   labour       0.03781513 0.00000000 0.08823529 0.05462185 0.07142857
## 
## , , class = working
## 
##               age
## vote                  <26        >75      26-40      41-50      51-75
##   conservative 0.04255319 0.03324468 0.09042553 0.08111702 0.11569149
##   labour       0.08909574 0.02526596 0.17686170 0.19281915 0.15292553

What happened here is that each of the three subtables adds up to 1, so that we have a “joint distribution” in each table. We can put two variables into prop.table, and see what happens then:

## , , class = lower middle
## 
##               age
## vote                 <26       >75     26-40     41-50     51-75
##   conservative 0.5789474 0.7391304 0.5254237 0.7777778 0.7200000
##   labour       0.4210526 0.2608696 0.4745763 0.2222222 0.2800000
## 
## , , class = upper middle
## 
##               age
## vote                 <26       >75     26-40     41-50     51-75
##   conservative 0.6086957 1.0000000 0.6818182 0.8000000 0.7571429
##   labour       0.3913043 0.0000000 0.3181818 0.2000000 0.2428571
## 
## , , class = working
## 
##               age
## vote                 <26       >75     26-40     41-50     51-75
##   conservative 0.3232323 0.5681818 0.3383085 0.2961165 0.4306931
##   labour       0.6767677 0.4318182 0.6616915 0.7038835 0.5693069

This is making each class-age combination add up to 1, so that we can clearly see what fraction of voters voted for each party in each case.40 In the first two subtables, the two youngest subgroups are clearly different from the others, with a smaller proportion of people voting Conservative rather than Labour than for the older subgroups. If that same pattern persisted for the third subtable, with the two youngest age groups being different from the three older ones, then we would have an age by vote interaction rather than the age by class by vote interaction that we actually have. So the third class group should be different. It is: it seems that the first three age groups are different from the other two, with ages 41–50 being more inclined to vote Labour, like the younger groups. That’s where the interaction came from.

The Labour Party in the UK is like the NDP here, in that it has strong ties with “working people”, trades unions in particular. The Conservatives are like the Conservatives here (indeed, the nickname “Tories” comes from the UK; the Conservatives there were officially known as the Tories many years ago). Many people are lifelong voters for their party, and would never think of voting for the “other side”, in the same way that many Americans vote either Democrat or Republican without thinking about it too much. Our parliamentary system comes from the UK system (vote for a candidate in a riding, the leader of the party with the most elected candidates becomes Prime Minister), and a “landslide” victory often comes from persuading enough of the voters open to persuasion to switch sides. In the UK, as here, the parties’ share of the popular vote doesn’t change all that much from election to election, even though the number of seats in Parliament might change quite a lot.

\(\blacksquare\)

41.6 Brand M laundry detergent

A survey was carried out comparing respondents’ preferences of a laundry detergent M compared to a mystery brand X. For each respondent, the researchers recorded the temperature of the laundry load (low or high), whether or not they previously used brand M (yes or no), and the softness of the water used for the laundry load(hard, medium or soft). The aim of the survey was to find out what was associated with the respondents preferring brand M. The data are in http://ritsokiguess.site/datafiles/brand_m.csv.

  1. Read in and display (some of) the data. Explain briefly how the data is laid out appropriately to fit a log-linear model.

Solution

The reading-in is entirely familiar:

## 
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   softness = col_character(),
##   m_user = col_character(),
##   temperature = col_character(),
##   prefer = col_character(),
##   frequency = col_double()
## )
## # A tibble: 24 x 5
##    softness m_user temperature prefer frequency
##    <chr>    <chr>  <chr>       <chr>      <dbl>
##  1 hard     no     low         x             68
##  2 hard     no     low         m             42
##  3 hard     no     high        x             42
##  4 hard     no     high        m             30
##  5 hard     yes    low         x             37
##  6 hard     yes    low         m             52
##  7 hard     yes    high        x             24
##  8 hard     yes    high        m             43
##  9 medium   no     low         x             66
## 10 medium   no     low         m             50
## # … with 14 more rows

This is good because we have each “observation” (frequency, here) in one row, or, said differently, we have a column of frequencies and each of the factors in a column of its own. (See the Extra for the kind of layout we might have had to deal with.)

Extra: as you might expect, this is very much not how the data came to me. It was originally in a textbook, laid out like this:

This is a common layout for frequency data, because it saves a lot of space. Multiple header rows are hard to deal with, though, so I combined the three column variables into one with a layout like this (aligned in columns):

softness no_low_x no_low_m no_high_x no_high_m yes_low_x yes_low_m yes_high_x yes_high_m hard 68 42 42 30 37 52 24 43 medium 66 50 33 23 47 55 23 47 soft 63 53 29 27 57 49 19 29

Let’s read this in and then think about what to do with it:

## 
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   softness = col_character(),
##   no_low_x = col_double(),
##   no_low_m = col_double(),
##   no_high_x = col_double(),
##   no_high_m = col_double(),
##   yes_low_x = col_double(),
##   yes_low_m = col_double(),
##   yes_high_x = col_double(),
##   yes_high_m = col_double()
## )
## # A tibble: 3 x 9
##   softness no_low_x no_low_m no_high_x no_high_m yes_low_x yes_low_m yes_high_x yes_high_m
##   <chr>       <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>      <dbl>
## 1 hard           68       42        42        30        37        52         24         43
## 2 medium         66       50        33        23        47        55         23         47
## 3 soft           63       53        29        27        57        49         19         29

We want to get all those frequencies into one column, which suggest some kind of pivot_longer.There are two ways to go about this. One is to try a regular pivot_longerand see what happens. I had to think for a moment about what to call the column that ended up as combo:

## # A tibble: 24 x 3
##    softness combo      frequency
##    <chr>    <chr>          <dbl>
##  1 hard     no_low_x          68
##  2 hard     no_low_m          42
##  3 hard     no_high_x         42
##  4 hard     no_high_m         30
##  5 hard     yes_low_x         37
##  6 hard     yes_low_m         52
##  7 hard     yes_high_x        24
##  8 hard     yes_high_m        43
##  9 medium   no_low_x          66
## 10 medium   no_low_m          50
## # … with 14 more rows

This is the right kind of shape, but those things in the column combo are three variables all smooshed together: respectively, previous user (of brand M), temperature, preference (you can tell by the values, which are all different). These can be split up with separate, thus:

## # A tibble: 24 x 5
##    softness prev_user temperature preference frequency
##    <chr>    <chr>     <chr>       <chr>          <dbl>
##  1 hard     no        low         x                 68
##  2 hard     no        low         m                 42
##  3 hard     no        high        x                 42
##  4 hard     no        high        m                 30
##  5 hard     yes       low         x                 37
##  6 hard     yes       low         m                 52
##  7 hard     yes       high        x                 24
##  8 hard     yes       high        m                 43
##  9 medium   no        low         x                 66
## 10 medium   no        low         m                 50
## # … with 14 more rows

That works, but the combination of pivot_longer and separate is a common one, and so there is an “advanced” version of pivot_longer that does it all at once. The idea is that you enter three columns into names_to and then use names_sep to say what they’re separated by:

## # A tibble: 24 x 5
##    softness m_user temperature prefer frequency
##    <chr>    <chr>  <chr>       <chr>      <dbl>
##  1 hard     no     low         x             68
##  2 hard     no     low         m             42
##  3 hard     no     high        x             42
##  4 hard     no     high        m             30
##  5 hard     yes    low         x             37
##  6 hard     yes    low         m             52
##  7 hard     yes    high        x             24
##  8 hard     yes    high        m             43
##  9 medium   no     low         x             66
## 10 medium   no     low         m             50
## # … with 14 more rows

This data frame is what I saved for you.

\(\blacksquare\)

  1. Using backward elimination, build a suitable log-linear model for the associations between the variables. (Do not use step; do the elimination yourself).

Solution

The first step is to fit a model containing all the interactions between the factors, using frequency as the response, and then to use drop1 with test="Chisq" to see what can come out. Don’t forget the family = "poisson", since that’s what drives the modelling. I tnink it’s easiest to number these models, since there might be a lot of them:

## Single term deletions
## 
## Model:
## frequency ~ softness * m_user * temperature * prefer
##                                    Df Deviance    AIC     LRT Pr(>Chi)
## <none>                                 0.00000 180.41                 
## softness:m_user:temperature:prefer  2  0.73732 177.15 0.73732   0.6917

To our relief, the four-way interaction is not significant and can be removed. (I was not looking forward to the prospect of interpreting that!)

Now write an update line that removes that four-way interaction from your model, as shown below, and copy-paste your drop1 line from above, changing the number of your model to the one coming out of update. Copy-paste the complicated interaction from the drop1 output:

## Single term deletions
## 
## Model:
## frequency ~ softness + m_user + temperature + prefer + softness:m_user + 
##     softness:temperature + m_user:temperature + softness:prefer + 
##     m_user:prefer + temperature:prefer + softness:m_user:temperature + 
##     softness:m_user:prefer + softness:temperature:prefer + m_user:temperature:prefer
##                             Df Deviance    AIC    LRT Pr(>Chi)
## <none>                           0.7373 177.15                
## softness:m_user:temperature  2   2.1146 174.53 1.3773   0.5022
## softness:m_user:prefer       2   5.3086 177.72 4.5713   0.1017
## softness:temperature:prefer  2   0.8991 173.31 0.1618   0.9223
## m_user:temperature:prefer    1   2.9594 177.37 2.2220   0.1361

There are now four three-way interactions that could be removed. You might suspect that they are all going to go eventually, but as in regression, we take them one at a time, starting with the one that has the highest P-value (just in case, for example, the P-value of the second one goes under 0.05 when we remove the others). The easiest way to do the coding is a vigorous amount of copying and pasting. Copy-paste your last code chunk:

Change the interaction in the update to the one you want to remove (from the drop1 table), which is softness:temperature:prefer (you can copy-paste that too), and then increase all three of the model numbers by 1:

## Single term deletions
## 
## Model:
## frequency ~ softness + m_user + temperature + prefer + softness:m_user + 
##     softness:temperature + m_user:temperature + softness:prefer + 
##     m_user:prefer + temperature:prefer + softness:m_user:temperature + 
##     softness:m_user:prefer + m_user:temperature:prefer
##                             Df Deviance    AIC    LRT Pr(>Chi)
## <none>                           0.8991 173.31                
## softness:m_user:temperature  2   2.2506 170.67 1.3514   0.5088
## softness:m_user:prefer       2   5.4952 173.91 4.5961   0.1005
## m_user:temperature:prefer    1   3.1148 173.53 2.2157   0.1366

Then, as they say, rinse and repeat. This one takes a while, but each step is just like the others.

drop softness:m_user:temperature

## Single term deletions
## 
## Model:
## frequency ~ softness + m_user + temperature + prefer + softness:m_user + 
##     softness:temperature + m_user:temperature + softness:prefer + 
##     m_user:prefer + temperature:prefer + softness:m_user:prefer + 
##     m_user:temperature:prefer
##                           Df Deviance    AIC    LRT Pr(>Chi)  
## <none>                         2.2506 170.67                  
## softness:temperature       2   7.8155 172.23 5.5649  0.06189 .
## softness:m_user:prefer     2   7.0585 171.47 4.8080  0.09036 .
## m_user:temperature:prefer  1   4.5184 170.93 2.2678  0.13209  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drop m_user:temperature:prefer

## Single term deletions
## 
## Model:
## frequency ~ softness + m_user + temperature + prefer + softness:m_user + 
##     softness:temperature + m_user:temperature + softness:prefer + 
##     m_user:prefer + temperature:prefer + softness:m_user:prefer
##                        Df Deviance    AIC    LRT Pr(>Chi)  
## <none>                      4.5184 170.93                  
## softness:temperature    2  10.6035 173.02 6.0851  0.04771 *
## m_user:temperature      1   5.2477 169.66 0.7293  0.39311  
## temperature:prefer      1   8.2467 172.66 3.7283  0.05350 .
## softness:m_user:prefer  2   9.8462 172.26 5.3278  0.06967 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drop m_user:temperature

## Single term deletions
## 
## Model:
## frequency ~ softness + m_user + temperature + prefer + softness:m_user + 
##     softness:temperature + softness:prefer + m_user:prefer + 
##     temperature:prefer + softness:m_user:prefer
##                        Df Deviance    AIC    LRT Pr(>Chi)  
## <none>                      5.2477 169.66                  
## softness:temperature    2  11.2951 171.71 6.0474  0.04862 *
## temperature:prefer      1   9.5576 171.97 4.3099  0.03789 *
## softness:m_user:prefer  2  10.5860 171.00 5.3383  0.06931 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drop softness:m_user:prefer

## Single term deletions
## 
## Model:
## frequency ~ softness + m_user + temperature + prefer + softness:m_user + 
##     softness:temperature + softness:prefer + m_user:prefer + 
##     temperature:prefer
##                      Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>                    10.586 171.00                      
## softness:m_user       2   11.543 167.96  0.9569   0.61974    
## softness:temperature  2   16.633 173.05  6.0474   0.04862 *  
## softness:prefer       2   10.798 167.21  0.2120   0.89942    
## m_user:prefer         1   31.049 189.46 20.4633 6.079e-06 ***
## temperature:prefer    1   14.896 173.31  4.3099   0.03789 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

More two-ways.

drop softness:prefer

## Single term deletions
## 
## Model:
## frequency ~ softness + m_user + temperature + prefer + softness:m_user + 
##     softness:temperature + m_user:prefer + temperature:prefer
##                      Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>                    10.798 167.21                      
## softness:m_user       2   11.886 164.30  1.0885   0.58027    
## softness:temperature  2   16.910 169.33  6.1125   0.04706 *  
## m_user:prefer         1   31.393 185.81 20.5949 5.675e-06 ***
## temperature:prefer    1   15.173 169.59  4.3750   0.03647 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

One more?

drop softness:m_user

## Single term deletions
## 
## Model:
## frequency ~ softness + m_user + temperature + prefer + softness:temperature + 
##     m_user:prefer + temperature:prefer
##                      Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>                    11.886 164.30                      
## softness:temperature  2   17.986 166.40  6.0991   0.04738 *  
## m_user:prefer         1   32.468 182.88 20.5815 5.715e-06 ***
## temperature:prefer    1   16.248 166.66  4.3616   0.03676 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And finally we are done! There are three two-way interactions left, which shouldn’t be too hard to interpret. That’s in the next part.

\(\blacksquare\)

  1. What is associated with the brand a respondent prefers? By obtaining suitable frequency tables, describe the nature of these associations.

Solution

To see what is associated with brand preference, look for significant associations with prefer. There are two of them, one with m_user, and, separately, one with temperature. This means that a respondent’s brand preference depends on whether or not they previously used brand M, and also on what temperature the laundry was washed at.

To investigate these, use xtabs to get a frequency table (easiest), and if necessary use prop.table to get row or column proportions as appropriate. The first thing in the model formula in xtabs is the frequency column; the first thing after the squiggle is rows and the second thing is columns. (If there happened to be a third thing, it would be “layers” in the table.) You might find that the table of frequencies is enough to interpret what is going on, but if not, use prop.table to get a table of proportions. I talk about that in a moment.

##       prefer
## m_user   m   x
##    no  225 301
##    yes 275 207

prefer is the response (outcome), which I put in the columns, so I look along the rows to see what is going on. Out of the people who were previous users of Brand M (second row), slightly more of them preferred Brand M; out of the people who were not Brand M users (first row), somewhat more of them preferred Brand X.

This was not hard to see, because the opposite frequencies were bigger each time. But you might find it easier to compute percentages and compare those. In my table, the response was rows, so the right percentages to compute are row ones. That is done like this. 1 is rows, 2 is columns:

##       prefer
## m_user         m         x
##    no  0.4277567 0.5722433
##    yes 0.5705394 0.4294606

Out of the people who were previous Brand M users, 57% preferred Brand M; out of the people who were not previous Brand M users, only 43% of them preferred Brand M.

Advertisers use terms like “brand familiarity” to capture ideas like this: more people prefer Brand M in the survey if they have used it before. Not altogether surprising.

On to the effects of temperature on preference:

##            prefer
## temperature   m   x
##        high 199 170
##        low  301 338
##            prefer
## temperature         m         x
##        high 0.5392954 0.4607046
##        low  0.4710485 0.5289515

Out of the people who used a high-temperature wash, 54% of them preferred brand M, but out of the people who used a low-temperature wash, only 47% of them preferred brand M.

I’m making it seem like this is a big difference, and of course it’s a very small one, but the size of the survey makes even this tiny difference significant.

Those are really the two effects of interest, since they are the ones associated with brand preference. But there was one more association that was significant: between temperature and softness. The softness in this case was of the water used to do the laundry (and not, for example, the softness of the clothes after they come out of the dryer). That goes like this:

##            softness
## temperature hard medium soft
##        high  139    126  104
##        low   199    218  222
##            softness
## temperature      hard    medium      soft
##        high 0.4112426 0.3662791 0.3190184
##        low  0.5887574 0.6337209 0.6809816

I decided to condition on the softness of the water, since that cannot be controlled by the person doing the laundry (the water just is hard or soft, depending on where you live and where your water comes from).

In each case, a majority of the washes were done at low temperature, but the softer the water, the bigger that majority was. Once again, the effect is not all that big, because the association was only just significant.

\(\blacksquare\)


  1. I don’t know why Wisconsin again, but that’s what it is.

  2. I don’t know why Wisconsin again, but that’s what it is.

  3. The reason I thought of doing this is that these two are all the variables except response.