Chapter 27 Logistic regression with ordinal response

library(MASS)
library(tidyverse)

27.1 Do you like your mobile phone?

A phone company commissioned a survey of their customers’ satisfaction with their mobile devices. The responses to the survey were on a so-called Likert scale of “very unsatisfied”, “unsatisfied”, “satisfied”, “very satisfied”. Also recorded were each customer’s gender and age group (under 18, 18–24, 25–30, 31 or older). (A survey of this kind does not ask its respondents for their exact age, only which age group they fall in.) The data, as frequencies of people falling into each category combination, are in link.

* Read in the data and take a look at the format. Use a tool that you know about to arrange the frequencies in just one column, with other columns labelling the response categories that the frequencies belong to. Save the new data frame. (Take a look at it if you like.)
We are going to fit ordered logistic models below. To do that, we need our response variable to be a factor with its levels in the right order. By looking at the data frame you just created, determine what kind of thing your intended response variable currently is.
If your intended response variable is not a factor, create a factor in your data frame with levels in the right order. Hint: look at the order your levels are in the data.
* Fit ordered logistic models to predict satisfaction from (i) gender and age group, (ii) gender only, (iii) age group only. (You don’t need to examine the models.) Don’t forget a suitable weights!
Use drop1 on your model containing both explanatory variables to determine whether you can remove either of them. Use test="Chisq" to obtain P-values.
Use anova to decide whether we are justified in removing gender from a model containing both gender and age.group. Compare your P-value with the one from drop1.
Use anova to see whether we are justified in removing age.group from a model containing both gender and age.group. Compare your P-value with the one from drop1 above.
Which of the models you have fit so far is the most appropriate one? Explain briefly.
Obtain predicted probabilities of a customer falling in the various satisfaction categories, as it depends on gender and age group. To do that, you need to feed predict three things: the fitted model that contains both age group and gender, the data frame that you read in from the file back in part (here) (which contains all the combinations of age group and gender), and an appropriate type.
* Describe any patterns you see in the predictions, bearing in mind the significance or not of the explanatory variables.

27.2 Finding non-missing values

* This is to prepare you for something in the next question. It’s meant to be easy.

In R, the code NA stands for “missing value” or “value not known”. In R, NA should not have quotes around it. (It is a special code, not a piece of text.)

Create a vector v that contains some numbers and some missing values, using c(). Put those values into a one-column data frame.
Obtain a new column containing is.na(v). When is this true and when is this false?
The symbol ! means “not” in R (and other programming languages). What does !is.na(v) do? Create a new column containing that.
Use filter to display just the rows of your data frame that have a non-missing value of v.

27.3 High School and Beyond

A survey called High School and Beyond was given to a large number of American high school seniors (grade 12) by the National Center of Education Statistics. The data set at link is a random sample of 200 of those students.

The variables collected are:

gender: student’s gender, female or male.
race: the student’s race (African-American, Asian,³⁶ Hispanic, White).
ses: Socio-economic status of student’s family (low, middle, or high)
schtyp: School type, public or private.
prog: Student’s program, general, academic, or vocational.
read: Score on standardized reading test.
write: Score on standardized writing test.
math: Score on standardized math test.
science: Score on standardized science test.
socst: Score on standardized social studies test.

Our aim is to see how socio-economic status is related to the other variables.

Read in and display (some of) the data.
Explain briefly why an ordinal logistic regression is appropriate for our aims.
Fit an ordinal logistic regression predicting socio-economic status from the scores on the five standardized tests. (You don’t need to display the results.) You will probably go wrong the first time. What kind of thing does your response variable have to be?
Remove any non-significant explanatory variables one at a time. Use drop1 to decide which one to remove next.
The quartiles of the science test score are 44 and 58. The quartiles of the socst test score are 46 and 61. Make a data frame that has all combinations of those quartiles. If your best regression had any other explanatory variables in it, also put the means of those variables into this data frame.
Use the data frame you created in the previous part, together with your best model, to obtain predicted probabilities of being in each ses category. Display these predicted probabilities so that they are easy to read.
What is the effect of an increased science score on the likelihood of a student being in the different socioeconomic groups, all else equal? Explain briefly. In your explanation, state clearly how you are using your answer to the previous part.

27.4 How do you like your steak?

When you order a steak in a restaurant, the server will ask you how you would like it cooked, or to be precise, how much you would like it cooked: rare (hardly cooked at all), through medium rare, medium, medium well to well (which means “well done”, so that the meat has only a little red to it). Could you guess how a person likes their steak cooked, from some other information about them? The website link commissioned a survey where they asked a number of people how they preferred their steak, along with as many other things as they could think of to ask. (Many of the variables below are related to risk-taking, which was something the people designing the survey thought might have something to do with liking steak rare.) The variables of interest are all factors or true/false:

respondent_ID: a ten-digit number identifying each person who responded to the survey.
lottery_a: true if the respondent preferred lottery A with a small chance to win a lot of money, to lottery B, with a larger chance to win less money.
smoke: true if the respondent is currently a smoker
alcohol: true if the respondent at least occasionally drinks alcohol.
gamble: true if the respondent likes to gamble (eg. betting on horse racing or playing the lottery)
skydiving: true if the respondent has ever been skydiving.
speed: true if the respondent likes to drive fast
cheated: true if the respondent has ever cheated on a spouse or girlfriend/boyfriend
steak: true if the respondent likes to eat steak
steak_prep (response): how the respondent likes their steak cooked (factor, as described above, with 5 levels).
female: true if the respondent is female
age: age group, from 18–29 to 60+.
hhold_income: household income group, from $0–24,999 to $150,000+.
educ: highest level of education attained, from “less than high school” up to “graduate degree”
region: region (of the US) that the respondent lives in (five values).

The data are in link. This is the cleaned-up data from a previous question, with the missing values removed.

Read in the data and display the first few lines.
We are going to predict steak_prep from some of the other variables. Why is the model-fitting function polr from package MASS the best choice for these data (alternatives being glm and multinom from package nnet)?
What are the levels of steak_prep, in the order that R thinks they are in? If they are not in a sensible order, create an ordered factor where the levels are in a sensible order.
Fit a model predicting preferred steak preparation in an ordinal logistic regression from educ, female and lottery_a. This ought to be easy from your previous work, but you have to be careful about one thing. No need to print out the results.
Run drop1 on your fitted model, with test="Chisq". Which explanatory variable should be removed first, if any? Bear in mind that the variable with the smallest AIC should come out first, in case your table doesn’t get printed in order.
Remove the variable that should come out first, using update. (If all the variables should stay, you can skip this part.)
Using the best model that you have so far, predict the probabilities of preferring each different steak preparation (method of cooking) for each combination of the variables that remain. (Some of the variables are TRUE and FALSE rather than factors. Bear this in mind.) Describe the effects of each variable on the predicted probabilities, if any. Note that there is exactly one person in the study whose educational level is “less than high school”.
Is it reasonable to remove all the remaining explanatory variables from your best model so far? Fit a model with no explanatory variables, and do a test. (In R, if the right side of the squiggle is a 1, that means “just an intercept”. Or you can remove whatever remains using update.) What do you conclude? Explain briefly.
In the article for which these data were collected, link, does the author obtain consistent conclusions with yours? Explain briefly. (It’s not a very long article, so it won’t take you long to skim through, and the author’s point is pretty clear.)

27.5 How do you like your steak – the data

This question takes you through the data preparation for one of the other questions. You don’t have to do this* question, but you may find it interesting or useful.

respondent_ID: a ten-digit number identifying each person who responded to the survey.
lottery_a: true if the respondent preferred lottery A with a small chance to win a lot of money, to lottery B, with a larger chance to win less money.
smoke: true if the respondent is currently a smoker
alcohol: true if the respondent at least occasionally drinks alcohol.
gamble: true if the respondent likes to gamble (eg. betting on horse racing or playing the lottery)
skydiving: true if the respondent has ever been skydiving.
speed: true if the respondent likes to drive fast
cheated: true if the respondent has ever cheated on a spouse or girlfriend/boyfriend
steak: true if the respondent likes to eat steak
steak_prep (response): how the respondent likes their steak cooked (factor, as described above, with 5 levels).
female: true if the respondent is female
age: age group, from 18–29 to 60+.
hhold_income: household income group, from $0–24,999 to $150,000+.
educ: highest level of education attained, from “less than high school” up to “graduate degree”
region: region (of the US) that the respondent lives in (five values).

The data are in link.

Read in the data and display the first few lines.
What do you immediately notice about your data frame? Run summary on the entire data frame. Would you say you have a lot of missing values, or only a few?
What does the function drop_na do when applied to a data frame with missing values? To find out, pass the data frame into drop_na(), then into summary again. What has happened?
Write the data into a .csv file, with a name like steak1.csv. Open this file in a spreadsheet and (quickly) verify that you have the right columns and no missing values.

My solutions follow:

27.6 Do you like your mobile phone?

* Read in the data and take a look at the format. Use a tool that you know about to arrange the frequencies in just one column, with other columns labelling the response categories that the frequencies belong to. Save the new data frame. (Take a look at it if you like.)

Solution

my_url <- "http://ritsokiguess.site/datafiles/mobile.txt"
mobile <- read_delim(my_url, " ")

## Rows: 8 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr (2): gender, age.group
## dbl (4): very.unsat, unsat, sat, very.sat
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

mobile

## # A tibble: 8 × 6
##   gender age.group very.unsat unsat   sat very.sat
##   <chr>  <chr>          <dbl> <dbl> <dbl>    <dbl>
## 1 male   0-17               3     9    18       24
## 2 male   18-24              6    13    16       28
## 3 male   25-30              9    13    17       20
## 4 male   31+                5     7    16       16
## 5 female 0-17               4     8    11       25
## 6 female 18-24              8    14    20       18
## 7 female 25-30             10    15    16       12
## 8 female 31+                5    14    12        8

With multiple columns that are all frequencies, this is a job for pivot_longer:

mobile %>% 
  pivot_longer(very.unsat:very.sat, 
               names_to="satisfied", 
               values_to="frequency") -> mobile.long
mobile.long

## # A tibble: 32 × 4
##    gender age.group satisfied  frequency
##    <chr>  <chr>     <chr>          <dbl>
##  1 male   0-17      very.unsat         3
##  2 male   0-17      unsat              9
##  3 male   0-17      sat               18
##  4 male   0-17      very.sat          24
##  5 male   18-24     very.unsat         6
##  6 male   18-24     unsat             13
##  7 male   18-24     sat               16
##  8 male   18-24     very.sat          28
##  9 male   25-30     very.unsat         9
## 10 male   25-30     unsat             13
## # … with 22 more rows

Yep, all good. See how mobile.long contains what it should? (For those keeping track, the original data frame had 8 rows and 4 columns to collect up, and the new one has $8\times 4=32$ rows.)

$\blacksquare$

We are going to fit ordered logistic models below. To do that, we need our response variable to be a factor with its levels in the right order. By looking at the data frame you just created, determine what kind of thing your intended response variable currently is.

Solution

I looked at mobile.long in the previous part, but if you didn’t, look at it here:

mobile.long

## # A tibble: 32 × 4
##    gender age.group satisfied  frequency
##    <chr>  <chr>     <chr>          <dbl>
##  1 male   0-17      very.unsat         3
##  2 male   0-17      unsat              9
##  3 male   0-17      sat               18
##  4 male   0-17      very.sat          24
##  5 male   18-24     very.unsat         6
##  6 male   18-24     unsat             13
##  7 male   18-24     sat               16
##  8 male   18-24     very.sat          28
##  9 male   25-30     very.unsat         9
## 10 male   25-30     unsat             13
## # … with 22 more rows

My intended response variable is what I called satisfied. This is chr or “text”, not the factor that I want.

$\blacksquare$

If your intended response variable is not a factor, create a factor in your data frame with levels in the right order. Hint: look at the order your levels are in the data.

Solution

My intended response satisfied is text, not a factor, so I need to do this part. The hint is to look at the column satisfied in mobile.long and note that the satisfaction categories appear in the data in the order that we want. This is good news, because we can use fct_inorder like this:

mobile.long %>%
  mutate(satis = fct_inorder(satisfied)) -> mobile.long

If you check, by looking at the data frame, satis is a factor, and you can also do this to verify that its levels are in the right order:

with(mobile.long, levels(satis))

## [1] "very.unsat" "unsat"      "sat"        "very.sat"

Success.

Extra: so now you are asking, what if the levels are in the wrong order in the data? Well, below is what you used to have to do, and it will work for this as well. I’ll first find what levels of satisfaction I have. This can be done by counting them, or by finding the distinct ones:

mobile.long %>% count(satisfied)

## # A tibble: 4 × 2
##   satisfied      n
##   <chr>      <int>
## 1 sat            8
## 2 unsat          8
## 3 very.sat       8
## 4 very.unsat     8

mobile.long %>% distinct(satisfied)

## # A tibble: 4 × 1
##   satisfied 
##   <chr>     
## 1 very.unsat
## 2 unsat     
## 3 sat       
## 4 very.sat

If you count them, they come out in alphabetical order. If you ask for the distinct ones, they come out in the order they were in mobile.long, which is the order the columns of those names were in mobile, which is the order we want.

To actually grab those satisfaction levels as a vector (that we will need in a minute), use pluck to pull the column out of the data frame as a vector:

v1 <- mobile.long %>%
  distinct(satisfied) %>%
  pluck("satisfied")
v1

## [1] "very.unsat" "unsat"      "sat"        "very.sat"

which is in the correct order, or

v2 <- mobile.long %>%
  count(satisfied) %>%
  pluck("satisfied")
v2

## [1] "sat"        "unsat"      "very.sat"   "very.unsat"

which is in alphabetical order. The problem with the second one is that we know the correct order, but there isn’t a good way to code that, so we have to rearrange it ourselves. The correct order from v2 is 4, 2, 1, 3, so:

v3 <- c(v2[4], v2[2], v2[1], v2[3])
v3

## [1] "very.unsat" "unsat"      "sat"        "very.sat"

v4 <- v2[c(4, 2, 1, 3)]
v4

## [1] "very.unsat" "unsat"      "sat"        "very.sat"

Either of these will work. The first one is more typing, but is perhaps more obvious. There is a third way, which is to keep things as a data frame until the end, and use slice to pick out the rows in the right order:

v5 <- mobile.long %>%
  count(satisfied) %>%
  slice(c(4, 2, 1, 3)) %>%
  pluck("satisfied")
v5

## [1] "very.unsat" "unsat"      "sat"        "very.sat"

If you don’t see how that works, run it yourself, one line at a time.

The other way of doing this is to physically type them into a vector, but this carries the usual warnings of requiring you to be very careful and that it won’t be reproducible (eg. if you do another survey with different response categories).

So now create the proper response variable thus, using your vector of categories:

mobile.long %>%
  mutate(satis = ordered(satisfied, v1)) -> mobile.long2
mobile.long2

## # A tibble: 32 × 5
##    gender age.group satisfied  frequency satis     
##    <chr>  <chr>     <chr>          <dbl> <ord>     
##  1 male   0-17      very.unsat         3 very.unsat
##  2 male   0-17      unsat              9 unsat     
##  3 male   0-17      sat               18 sat       
##  4 male   0-17      very.sat          24 very.sat  
##  5 male   18-24     very.unsat         6 very.unsat
##  6 male   18-24     unsat             13 unsat     
##  7 male   18-24     sat               16 sat       
##  8 male   18-24     very.sat          28 very.sat  
##  9 male   25-30     very.unsat         9 very.unsat
## 10 male   25-30     unsat             13 unsat     
## # … with 22 more rows

satis has the same values as satisfied, but its label ord means that it is an ordered factor, as we want.

$\blacksquare$

* Fit ordered logistic models to predict satisfaction from (i) gender and age group, (ii) gender only, (iii) age group only. (You don’t need to examine the models.) Don’t forget a suitable weights!

Solution

(i):

library(MASS)
mobile.1 <- polr(satis ~ gender + age.group, weights = frequency, data = mobile.long)

For (ii) and (iii), update is the thing (it works for any kind of model):

mobile.2 <- update(mobile.1, . ~ . - age.group)
mobile.3 <- update(mobile.1, . ~ . - gender)

We’re not going to look at these, because the output from summary is not very illuminating. What we do next is to try to figure out which (if either) of the explanatory variables age.group and gender we need.

$\blacksquare$

Use drop1 on your model containing both explanatory variables to determine whether you can remove either of them. Use test="Chisq" to obtain P-values.

Solution

drop1 takes a fitted model, and tests each term in it in turn, and says which (if any) should be removed. Here’s how it goes:

drop1(mobile.1, test = "Chisq")

## Single term deletions
## 
## Model:
## satis ~ gender + age.group
##           Df    AIC     LRT Pr(>Chi)   
## <none>       1101.8                    
## gender     1 1104.2  4.4089 0.035751 * 
## age.group  3 1109.0 13.1641 0.004295 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The possibilities are to remove gender, to remove age.group or to remove nothing. The best one is “remove nothing”, because it’s the one on the output with the smallest AIC. Both P-values are small, so it would be a mistake to remove either of the explanatory variables.

$\blacksquare$

Use anova to decide whether we are justified in removing gender from a model containing both gender and age.group. Compare your P-value with the one from drop1.

Solution

This is a comparison of the model with both variables (mobile.1) and the model with gender removed (mobile.3). Use anova for this, smaller (fewer-$x$) model first:

anova(mobile.3, mobile.1)

## Likelihood ratio tests of ordinal regression models
## 
## Response: satis
##                Model Resid. df Resid. Dev   Test    Df LR stat.    Pr(Chi)
## 1          age.group       414   1092.213                                 
## 2 gender + age.group       413   1087.804 1 vs 2     1  4.40892 0.03575146

The P-value is (just) less than 0.05, so the models are significantly different. That means that the model with both variables in fits significantly better than the model with only age.group, and therefore that taking gender out is a mistake.

The P-value is identical to the one from drop1 (because they are both doing the same test).

$\blacksquare$

Use anova to see whether we are justified in removing age.group from a model containing both gender and age.group. Compare your P-value with the one from drop1 above.

Solution

Exactly the same idea as the last part. In my case, I’m comparing models mobile.2 and mobile.1:

anova(mobile.2, mobile.1)

## Likelihood ratio tests of ordinal regression models
## 
## Response: satis
##                Model Resid. df Resid. Dev   Test    Df LR stat.     Pr(Chi)
## 1             gender       416   1100.968                                  
## 2 gender + age.group       413   1087.804 1 vs 2     3 13.16411 0.004294811

This one is definitely significant, so I need to keep age.group for sure. Again, the P-value is the same as the one in drop1.

$\blacksquare$

Which of the models you have fit so far is the most appropriate one? Explain briefly.

Solution

I can’t drop either of my variables, so I have to keep them both: mobile.1, with both age.group and gender.

$\blacksquare$

Obtain predicted probabilities of a customer falling in the various satisfaction categories, as it depends on gender and age group. To do that, you need to feed predict three things: the fitted model that contains both age group and gender, the data frame that you read in from the file back in part (here) (which contains all the combinations of age group and gender), and an appropriate type.

Solution

My model containing both $x$s was mobile.1, the data frame read in from the file was called mobile, and I need type="p" to get probabilities:

probs <- predict(mobile.1, mobile, type = "p")
mobile %>%
  select(gender, age.group) %>%
  cbind(probs)

##   gender age.group very.unsat     unsat       sat  very.sat
## 1   male      0-17 0.06070852 0.1420302 0.2752896 0.5219716
## 2   male     18-24 0.09212869 0.1932086 0.3044732 0.4101894
## 3   male     25-30 0.13341018 0.2438110 0.3084506 0.3143283
## 4   male       31+ 0.11667551 0.2252966 0.3097920 0.3482359
## 5 female      0-17 0.08590744 0.1840411 0.3011751 0.4288763
## 6 female     18-24 0.12858414 0.2387296 0.3091489 0.3235374
## 7 female     25-30 0.18290971 0.2853880 0.2920053 0.2396970
## 8 female       31+ 0.16112036 0.2692996 0.3008711 0.2687089

This worked for me, but this might happen to you, with the same commands as above:

## Error in MASS::select(., gender, age.group): unused arguments (gender, age.group)

Oh, this didn’t work. Why not? There don’t seem to be any errors.

This is the kind of thing that can bother you for days. The resolution (that it took me a long time to discover) is that you might have the tidyverse and also MASS loaded, in the wrong order, and MASS also has a select (that takes different inputs and does something different). If you look back at part (here), you might have seen a message there when you loaded MASS that select was “masked”. When you have two packages that both contain a function with the same name, the one that you can see (and that will get used) is the one that was loaded last, which is the MASS select (not the one we actually wanted, which is the tidyverse select). There are a couple of ways around this. One is to un-load the package we no longer need (when we no longer need it). The mechanism for this is shown at the end of part (here). The other is to say explicitly which package you want your function to come from, so that there is no doubt. The tidyverse is actually a collection of packages. The best way to find out which one our select comes from is to go to the Console window in R Studio and ask for the help for select. With both tidyverse and MASS loaded, the help window offers you a choice of both selects; the one we want is “select/rename variables by name”, and the actual package it comes from is dplyr.

There is a third choice, which is the one I prefer now: install and load the package conflicted. When you run your code and it calls for something like select that is in two packages that you have loaded, it gives an error, like this:

Error: [conflicted] `select` found in 2 packages.
Either pick the one you want with `::` 
* MASS::select
* dplyr::select
Or declare a preference with `conflict_prefer()`
* conflict_prefer("select", "MASS")
* conflict_prefer("select", "dplyr")

Fixing this costs you a bit of time upfront, but once you have fixed it, you know that the right thing is being run. What I do is to copy-paste one of those conflict_prefer lines, in this case the second one, and put it before the select that now causes the error. Right after the library(conflicted) is a good place. When you use conflicted, you will probably have to run several times to fix up all the conflicts, which will be a bit frustrating, and you will end up with several conflict_prefer lines, but once you have them there, you won’t have to worry about the right function being called because you have explicitly said which one you want.

This is a non-standard use of cbind because I wanted to grab only the gender and age group columns from mobile first, and then cbind that to the predicted probabilities. The missing first input to cbind is “whatever came out of the previous step”, that is, the first two columns of mobile.

I only included the first two columns of mobile in the cbind, because the rest of the columns of mobile were frequencies, which I don’t need to see. (Having said that, it would be interesting to make a plot using the observed proportions and predicted probabilities, but I didn’t ask you for that.)

This is an unusual predict, because we didn’t have to make a data frame (with my usual name new) containing all the combinations of things to predict for. We were lucky enough to have those already in the original data frame mobile.

The usual way to do this is something like the trick that we did for getting the different satisfaction levels:

genders <- mobile.long %>% distinct(gender) %>% pluck("gender")
age.groups <- mobile.long %>%
  distinct(age.group) %>%
  pluck("age.group")

This is getting perilously close to deserving a function written to do it (strictly speaking, we should, since this is three times we’ve used this idea now).

Then crossing to get the combinations, and then predict:

new <- crossing(gender = genders, age.group = age.groups)
new

## # A tibble: 8 × 2
##   gender age.group
##   <chr>  <chr>    
## 1 female 0-17     
## 2 female 18-24    
## 3 female 25-30    
## 4 female 31+      
## 5 male   0-17     
## 6 male   18-24    
## 7 male   25-30    
## 8 male   31+

pp <- predict(mobile.1, new, type = "p")
cbind(new, pp)

##   gender age.group very.unsat     unsat       sat  very.sat
## 1 female      0-17 0.08590744 0.1840411 0.3011751 0.4288763
## 2 female     18-24 0.12858414 0.2387296 0.3091489 0.3235374
## 3 female     25-30 0.18290971 0.2853880 0.2920053 0.2396970
## 4 female       31+ 0.16112036 0.2692996 0.3008711 0.2687089
## 5   male      0-17 0.06070852 0.1420302 0.2752896 0.5219716
## 6   male     18-24 0.09212869 0.1932086 0.3044732 0.4101894
## 7   male     25-30 0.13341018 0.2438110 0.3084506 0.3143283
## 8   male       31+ 0.11667551 0.2252966 0.3097920 0.3482359

This method is fine if you want to do this question this way; the way I suggested first ought to be easier, but the nice thing about this is its mindlessness: you always do it about the same way.

$\blacksquare$

* Describe any patterns you see in the predictions, bearing in mind the significance or not of the explanatory variables.

Solution

I had both explanatory variables being significant, so I would expect to see both an age-group effect and a gender effect. For both males and females, there seems to be a decrease in satisfaction as the customers get older, at least until age 30 or so. I can see this because the predicted prob. of “very satisfied” decreases, and the predicted prob. of “very unsatisfied” increases. The 31+ age group are very similar to the 25–30 group for both males and females. So that’s the age group effect. What about a gender effect? Well, for all the age groups, the males are more likely to be very satisfied than the females of the corresponding age group, and also less likely to to be very unsatisfied. So the gender effect is that males are more satisfied than females overall. (Or, the males are less discerning. Take your pick.) When we did the tests above, age group was very definitely significant, and gender less so (P-value around 0.03). This suggests that the effect of age group ought to be large, and the effect of gender not so large. This is about what we observed: the age group effect was pretty clear, and the gender effect was noticeable but small: the females were less satisfied than the males, but there wasn’t all that much difference.

$\blacksquare$

27.7 Finding non-missing values

* This is to prepare you for something in the next question. It’s meant to be easy.

In R, the code NA stands for “missing value” or “value not known”. In R, NA should not have quotes around it. (It is a special code, not a piece of text.)

Create a vector v that contains some numbers and some missing values, using c(). Put those values into a one-column data frame.

Solution

Like this. The arrangement of numbers and missing values doesn’t matter, as long as you have some of each:

v <- c(1, 2, NA, 4, 5, 6, 9, NA, 11)
mydata <- tibble(v)
mydata

## # A tibble: 9 × 1
##       v
##   <dbl>
## 1     1
## 2     2
## 3    NA
## 4     4
## 5     5
## 6     6
## 7     9
## 8    NA
## 9    11

This has one column called v.

$\blacksquare$

Obtain a new column containing is.na(v). When is this true and when is this false?

Solution

mydata <- mydata %>% mutate(isna = is.na(v))
mydata

## # A tibble: 9 × 2
##       v isna 
##   <dbl> <lgl>
## 1     1 FALSE
## 2     2 FALSE
## 3    NA TRUE 
## 4     4 FALSE
## 5     5 FALSE
## 6     6 FALSE
## 7     9 FALSE
## 8    NA TRUE 
## 9    11 FALSE

This is TRUE if the corresponding element of v is missing (in my case, the third value and the second-last one), and FALSE otherwise (when there is an actual value there).

$\blacksquare$

The symbol ! means “not” in R (and other programming languages). What does !is.na(v) do? Create a new column containing that.

Solution

Try it and see. Give it whatever name you like. My name reflects that I know what it’s going to do:

mydata <- mydata %>% mutate(notisna = !is.na(v))
mydata

## # A tibble: 9 × 3
##       v isna  notisna
##   <dbl> <lgl> <lgl>  
## 1     1 FALSE TRUE   
## 2     2 FALSE TRUE   
## 3    NA TRUE  FALSE  
## 4     4 FALSE TRUE   
## 5     5 FALSE TRUE   
## 6     6 FALSE TRUE   
## 7     9 FALSE TRUE   
## 8    NA TRUE  FALSE  
## 9    11 FALSE TRUE

This is the logical opposite of is.na: it’s true if there is a value, and false if it’s missing.

$\blacksquare$

Use filter to display just the rows of your data frame that have a non-missing value of v.

Solution

filter takes a column to say which rows to pick, in which case the column should contain something that either is TRUE or FALSE, or something that can be interpreted that way:

mydata %>% filter(notisna)

## # A tibble: 7 × 3
##       v isna  notisna
##   <dbl> <lgl> <lgl>  
## 1     1 FALSE TRUE   
## 2     2 FALSE TRUE   
## 3     4 FALSE TRUE   
## 4     5 FALSE TRUE   
## 5     6 FALSE TRUE   
## 6     9 FALSE TRUE   
## 7    11 FALSE TRUE

or you can provide filter something that can be calculated from what’s in the data frame, and also returns something that is either true or false:

mydata %>% filter(!is.na(v))

## # A tibble: 7 × 3
##       v isna  notisna
##   <dbl> <lgl> <lgl>  
## 1     1 FALSE TRUE   
## 2     2 FALSE TRUE   
## 3     4 FALSE TRUE   
## 4     5 FALSE TRUE   
## 5     6 FALSE TRUE   
## 6     9 FALSE TRUE   
## 7    11 FALSE TRUE

In either case, I only have non-missing values of v.

$\blacksquare$

27.8 High School and Beyond

The variables collected are:

gender: student’s gender, female or male.
race: the student’s race (African-American, Asian,³⁷ Hispanic, White).
ses: Socio-economic status of student’s family (low, middle, or high)
schtyp: School type, public or private.
prog: Student’s program, general, academic, or vocational.
read: Score on standardized reading test.
write: Score on standardized writing test.
math: Score on standardized math test.
science: Score on standardized science test.
socst: Score on standardized social studies test.

Our aim is to see how socio-economic status is related to the other variables.

Read in and display (some of) the data.

Solution

This is a .csv file (I tried to make it easy for you):

my_url <- "http://ritsokiguess.site/datafiles/hsb.csv"
hsb <- read_csv(my_url)

## Rows: 200 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): race, ses, schtyp, prog, gender
## dbl (6): id, read, write, math, science, socst
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

hsb

## # A tibble: 200 × 11
##       id race         ses    schtyp prog   read write  math science socst gender
##    <dbl> <chr>        <chr>  <chr>  <chr> <dbl> <dbl> <dbl>   <dbl> <dbl> <chr> 
##  1    70 white        low    public gene…    57    52    41      47    57 male  
##  2   121 white        middle public voca…    68    59    53      63    61 female
##  3    86 white        high   public gene…    44    33    54      58    31 male  
##  4   141 white        high   public voca…    63    44    47      53    56 male  
##  5   172 white        middle public acad…    47    52    57      53    61 male  
##  6   113 white        middle public acad…    44    52    51      63    61 male  
##  7    50 african-amer middle public gene…    50    59    42      53    61 male  
##  8    11 hispanic     middle public acad…    34    46    45      39    36 male  
##  9    84 white        middle public gene…    63    57    54      58    51 male  
## 10    48 african-amer middle public acad…    57    55    52      50    51 male  
## # … with 190 more rows

$\blacksquare$

Explain briefly why an ordinal logistic regression is appropriate for our aims.

Solution

The response variable ses is categorical, with categories that come in order (low less than middle less than high).

$\blacksquare$

Fit an ordinal logistic regression predicting socio-economic status from the scores on the five standardized tests. (You don’t need to display the results.) You will probably go wrong the first time. What kind of thing does your response variable have to be?

Solution

It has to be an ordered factor, which you can create in the data frame (or outside, if you prefer):

hsb <- hsb %>% mutate(ses = ordered(ses, c("low", "middle", "high")))
hsb

## # A tibble: 200 × 11
##       id race         ses    schtyp prog   read write  math science socst gender
##    <dbl> <chr>        <ord>  <chr>  <chr> <dbl> <dbl> <dbl>   <dbl> <dbl> <chr> 
##  1    70 white        low    public gene…    57    52    41      47    57 male  
##  2   121 white        middle public voca…    68    59    53      63    61 female
##  3    86 white        high   public gene…    44    33    54      58    31 male  
##  4   141 white        high   public voca…    63    44    47      53    56 male  
##  5   172 white        middle public acad…    47    52    57      53    61 male  
##  6   113 white        middle public acad…    44    52    51      63    61 male  
##  7    50 african-amer middle public gene…    50    59    42      53    61 male  
##  8    11 hispanic     middle public acad…    34    46    45      39    36 male  
##  9    84 white        middle public gene…    63    57    54      58    51 male  
## 10    48 african-amer middle public acad…    57    55    52      50    51 male  
## # … with 190 more rows

ses is now ord. Good. Now fit the model:

ses.1 <- polr(ses ~ read + write + math + science + socst, data = hsb)

No errors is good.

$\blacksquare$

Remove any non-significant explanatory variables one at a time. Use drop1 to decide which one to remove next.

Solution

drop1(ses.1, test = "Chisq")

## Single term deletions
## 
## Model:
## ses ~ read + write + math + science + socst
##         Df    AIC    LRT Pr(>Chi)   
## <none>     404.63                   
## read     1 403.09 0.4620 0.496684   
## write    1 403.81 1.1859 0.276167   
## math     1 403.19 0.5618 0.453517   
## science  1 404.89 2.2630 0.132499   
## socst    1 410.08 7.4484 0.006349 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I would have expected the AIC column to come out in order, but it doesn’t. Never mind. Scan for the largest P-value, which belongs to read. (This also has the lowest AIC.) So, remove read:

ses.2 <- update(ses.1, . ~ . - read)
drop1(ses.2, test = "Chisq")

## Single term deletions
## 
## Model:
## ses ~ write + math + science + socst
##         Df    AIC    LRT Pr(>Chi)   
## <none>     403.09                   
## write    1 402.10 1.0124 0.314325   
## math     1 402.04 0.9541 0.328689   
## science  1 404.29 3.1968 0.073782 . 
## socst    1 410.58 9.4856 0.002071 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note how the P-value for science has come down a long way.

A close call, but math goes next. The update doesn’t take long to type:

ses.3 <- update(ses.2, . ~ . - math)
drop1(ses.3, test = "Chisq")

## Single term deletions
## 
## Model:
## ses ~ write + science + socst
##         Df    AIC     LRT  Pr(>Chi)    
## <none>     402.04                      
## write    1 400.60  0.5587 0.4547813    
## science  1 405.41  5.3680 0.0205095 *  
## socst    1 411.07 11.0235 0.0008997 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

science has become significant now (probably because it was strongly correlated with at least one of the variables we removed (at my guess, math). That is, we didn’t need both science and math, but we do need one of them.

I think we can guess what will happen now: write comes out, and the other two variables will stay, so that’ll be where we stop:

ses.4 <- update(ses.3, . ~ . - write)
drop1(ses.4, test = "Chisq")

## Single term deletions
## 
## Model:
## ses ~ science + socst
##         Df    AIC     LRT  Pr(>Chi)    
## <none>     400.60                      
## science  1 403.45  4.8511 0.0276291 *  
## socst    1 409.74 11.1412 0.0008443 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Indeed so. We need just the science and social studies test scores to predict socio-economic status.

Using AIC to decide on which variable to remove next will give the same answer here, but I would like to see the test= part in your drop1 to give P-values (expect to lose something, but not too much, if that’s not there).

Extras: I talked about correlation among the explanatory variables earlier, which I can explore:

hsb %>% dplyr::select(read:socst) %>% cor()

##              read     write      math   science     socst
## read    1.0000000 0.5967765 0.6622801 0.6301579 0.6214843
## write   0.5967765 1.0000000 0.6174493 0.5704416 0.6047932
## math    0.6622801 0.6174493 1.0000000 0.6307332 0.5444803
## science 0.6301579 0.5704416 0.6307332 1.0000000 0.4651060
## socst   0.6214843 0.6047932 0.5444803 0.4651060 1.0000000

The first time I did this, I forgot that I had MASS loaded (for the polr), and so, to get the right select, I needed to say which one I wanted.

Anyway, the correlations are all moderately high. There’s nothing that stands out as being much higher than the others. The lowest two are between social studies and math, and social studies and science. That would be part of the reason that social studies needs to stay. The highest correlation is between math and reading, which surprises me (they seem to be different skills).

So there was not as much insight there as I expected.

The other thing is that you can use step for the variable-elimination task as well:

ses.5 <- step(ses.1, direction = "backward", test = "Chisq")

## Start:  AIC=404.63
## ses ~ read + write + math + science + socst
## 
##           Df    AIC    LRT Pr(>Chi)   
## - read     1 403.09 0.4620 0.496684   
## - math     1 403.19 0.5618 0.453517   
## - write    1 403.81 1.1859 0.276167   
## <none>       404.63                   
## - science  1 404.89 2.2630 0.132499   
## - socst    1 410.08 7.4484 0.006349 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=403.09
## ses ~ write + math + science + socst
## 
##           Df    AIC    LRT Pr(>Chi)   
## - math     1 402.04 0.9541 0.328689   
## - write    1 402.10 1.0124 0.314325   
## <none>       403.09                   
## - science  1 404.29 3.1968 0.073782 . 
## - socst    1 410.58 9.4856 0.002071 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=402.04
## ses ~ write + science + socst
## 
##           Df    AIC     LRT  Pr(>Chi)    
## - write    1 400.60  0.5587 0.4547813    
## <none>       402.04                      
## - science  1 405.41  5.3680 0.0205095 *  
## - socst    1 411.07 11.0235 0.0008997 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=400.6
## ses ~ science + socst
## 
##           Df    AIC     LRT  Pr(>Chi)    
## <none>       400.60                      
## - science  1 403.45  4.8511 0.0276291 *  
## - socst    1 409.74 11.1412 0.0008443 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I would accept you doing it this way, again as long as you have the test= there as well.

$\blacksquare$

The quartiles of the science test score are 44 and 58. The quartiles of the socst test score are 46 and 61. Make a data frame that has all combinations of those quartiles. If your best regression had any other explanatory variables in it, also put the means of those variables into this data frame.

Solution

This is what datagrid does by default (from package marginaleffects):

new <- datagrid(model = ses.5, science = c(44, 58), socst = c(46, 61))
new

##    science socst
## 1:      44    46
## 2:      44    61
## 3:      58    46
## 4:      58    61

You can feed datagrid either a model (probably clearer for you) or a dataset (with newdata):

datagrid(newdata = hsb, science = c(44, 58), socst = c(46, 61))

##       id  race    ses schtyp     prog  read  write   math gender science socst
## 1: 100.5 white middle public academic 52.23 52.775 52.645 female      44    46
## 2: 100.5 white middle public academic 52.23 52.775 52.645 female      44    61
## 3: 100.5 white middle public academic 52.23 52.775 52.645 female      58    46
## 4: 100.5 white middle public academic 52.23 52.775 52.645 female      58    61

This explicitly fills in mean values or most frequent categories for all the other variables in the dataset, even though those other variables are not in the model. The two variables you actually care about are over on the right. Either version of datagrid will work, but the one based on the model is easier for you to check than the one based on the dataset.

Since there are only two variables left, this new data frame has only $2^2=4$ rows.

There is a veiled hint here that these are the two variables that should have remained in your regression. If that was not what you got, the means of the other variables in the model will go automatically into your new:

datagrid(model = ses.1, science = c(44, 58), socst = c(46, 61))

##     read  write   math science socst
## 1: 52.23 52.775 52.645      44    46
## 2: 52.23 52.775 52.645      44    61
## 3: 52.23 52.775 52.645      58    46
## 4: 52.23 52.775 52.645      58    61

so you don’t have to do anything extra.

$\blacksquare$

Use the data frame you created in the previous part, together with your best model, to obtain predicted probabilities of being in each ses category. Display these predicted probabilities so that they are easy to read.

Solution

This is predict, and we’ve done the setup. My best model was called ses.4. Remember that these come out “long”

predictions(ses.4, newdata = new, type = "probs") %>% 
  select(group, predicted, science.x, socst.x) %>% 
  pivot_wider(names_from = group, values_from = predicted)

## 
## Re-fitting to get Hessian

## # A tibble: 4 × 5
##   science.x socst.x   low middle  high
##       <dbl>   <dbl> <dbl>  <dbl> <dbl>
## 1        44      46 0.326  0.506 0.168
## 2        44      61 0.189  0.515 0.296
## 3        58      46 0.228  0.523 0.249
## 4        58      61 0.124  0.467 0.409

The easiest strategy seems to be to run predictions first, see that it comes out long, and then wonder how to fix it. (If you forget the type, the error message will tell you which type to use.) Then pick the columns you care about: the predicted group, the predictions, and the columns for science and social science (which have gained an x on the end for some reason), and then pivot wider.

$\blacksquare$

What is the effect of an increased science score on the likelihood of a student being in the different socioeconomic groups, all else equal? Explain briefly. In your explanation, state clearly how you are using your answer to the previous part.

Solution

Use your predictions; hold the socst score constant (that’s the all else equal part). So compare the first and third rows (or, if you like, the second and fourth rows) of your predictions and see what happens as the science score goes from 44 to 58. What I see is that the probability of being low goes noticeably down as the science score increases, the probability of middle stays about the same, and the probability of high goes up (by about the same amount as the probability of low went down). In other words, an increased science score goes with an increased chance of high (and a decreased chance of low). If your best model doesn’t have science in it, then you need to say something like “science has no effect on socio-economic status”, consistent with what you concluded before: if you took it out, it’s because you thought it had no effect. Extra: the effect of an increased social studies score is almost exactly the same as an increased science score (so I didn’t ask you about that). From a social-science point of view, this makes perfect sense: the higher the social-economic stratum a student comes from, the better they are likely to do in school. I’ve been phrasing this as “association”, because really the cause and effect is the other way around: a student’s family socioeconomic status is explanatory, and school performance is response. But this was the nicest example I could find of an ordinal response data set.

$\blacksquare$

27.9 How do you like your steak?

respondent_ID: a ten-digit number identifying each person who responded to the survey.
lottery_a: true if the respondent preferred lottery A with a small chance to win a lot of money, to lottery B, with a larger chance to win less money.
smoke: true if the respondent is currently a smoker
alcohol: true if the respondent at least occasionally drinks alcohol.
gamble: true if the respondent likes to gamble (eg. betting on horse racing or playing the lottery)
skydiving: true if the respondent has ever been skydiving.
speed: true if the respondent likes to drive fast
cheated: true if the respondent has ever cheated on a spouse or girlfriend/boyfriend
steak: true if the respondent likes to eat steak
steak_prep (response): how the respondent likes their steak cooked (factor, as described above, with 5 levels).
female: true if the respondent is female
age: age group, from 18–29 to 60+.
hhold_income: household income group, from $0–24,999 to $150,000+.
educ: highest level of education attained, from “less than high school” up to “graduate degree”
region: region (of the US) that the respondent lives in (five values).

The data are in link. This is the cleaned-up data from a previous question, with the missing values removed.

Read in the data and display the first few lines.

Solution

The usual:

steak <- read_csv("steak1.csv")

## Rows: 331 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): steak_prep, age, hhold_income, educ, region
## dbl (1): respondent_id
## lgl (9): lottery_a, smoke, alcohol, gamble, skydiving, speed, cheated, steak...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

steak

## # A tibble: 331 × 15
##    respondent_id lottery_a smoke alcohol gamble skydiving speed cheated steak
##            <dbl> <lgl>     <lgl> <lgl>   <lgl>  <lgl>     <lgl> <lgl>   <lgl>
##  1    3234982343 TRUE      FALSE TRUE    FALSE  FALSE     FALSE FALSE   TRUE 
##  2    3234973379 TRUE      FALSE TRUE    TRUE   FALSE     TRUE  TRUE    TRUE 
##  3    3234972383 FALSE     TRUE  TRUE    TRUE   FALSE     TRUE  TRUE    TRUE 
##  4    3234958833 FALSE     FALSE TRUE    FALSE  FALSE     TRUE  TRUE    TRUE 
##  5    3234955240 TRUE      FALSE FALSE   FALSE  FALSE     TRUE  FALSE   TRUE 
##  6    3234955010 TRUE      FALSE TRUE    TRUE   TRUE      TRUE  FALSE   TRUE 
##  7    3234953052 TRUE      TRUE  TRUE    TRUE   FALSE     TRUE  FALSE   TRUE 
##  8    3234951249 FALSE     FALSE TRUE    TRUE   FALSE     FALSE FALSE   TRUE 
##  9    3234948883 FALSE     FALSE TRUE    FALSE  FALSE     TRUE  FALSE   TRUE 
## 10    3234948197 TRUE      FALSE FALSE   TRUE   FALSE     TRUE  FALSE   TRUE 
## # … with 321 more rows, and 6 more variables: steak_prep <chr>, female <lgl>,
## #   age <chr>, hhold_income <chr>, educ <chr>, region <chr>

$\blacksquare$

We are going to predict steak_prep from some of the other variables. Why is the model-fitting function polr from package MASS the best choice for these data (alternatives being glm and multinom from package nnet)?

Solution

It all depends on the kind of response variable. We have a response variable with five ordered levels from Rare to Well. There are more than two levels (it is more than a “success” and “failure”), which rules out glm, and the levels are ordered, which rules out multinom. As we know, polr handles an ordered response, so it is the right choice.

$\blacksquare$

What are the levels of steak_prep, in the order that R thinks they are in? If they are not in a sensible order, create an ordered factor where the levels are in a sensible order.

Solution

This is the most direct way to find out:

preps <- steak %>% distinct(steak_prep) %>% pull(steak_prep)
preps

## [1] "Medium rare" "Rare"        "Medium"      "Medium Well" "Well"

This is almost the right order (distinct uses the order in the data frame). We just need to switch the first two around, and then we’ll be done:

preps1 <- preps[c(2, 1, 3, 4, 5)]
preps1

## [1] "Rare"        "Medium rare" "Medium"      "Medium Well" "Well"

If you used count, there’s a bit more work to do:

preps2 <- steak %>% count(steak_prep) %>% pull(steak_prep)
preps2

## [1] "Medium"      "Medium rare" "Medium Well" "Rare"        "Well"

because count puts them in alphabetical order, so:

preps3 <- preps2[c(4, 2, 1, 3, 5)]
preps3

## [1] "Rare"        "Medium rare" "Medium"      "Medium Well" "Well"

These use the idea in the attitudes-to-abortion question: create a vector of the levels in the right order, then create an ordered factor with ordered(). If you like, you can type the levels in the right order (I won’t penalize you for that here), but it’s really better to get the levels without typing or copying-pasting, so that you don’t make any silly errors copying them (which will mess everything up later).

So now I create my ordered response:

steak <- steak %>% mutate(steak_prep_ord = ordered(steak_prep, preps1))

or using one of the other preps vectors containing the levels in the correct order. As far as polr is concerned, it doesn’t matter whether I start at Rare and go “up”, or start at Well and go “down”. So if you do it the other way around, that’s fine. As long as you get the levels in a sensible order, you’re good.

$\blacksquare$

Fit a model predicting preferred steak preparation in an ordinal logistic regression from educ, female and lottery_a. This ought to be easy from your previous work, but you have to be careful about one thing. No need to print out the results.

Solution

The thing you have to be careful about is that you use the ordered factor that you just created as the response:

steak.1 <- polr(steak_prep_ord ~ educ + female + lottery_a, data = steak)

$\blacksquare$

Run drop1 on your fitted model, with test="Chisq". Which explanatory variable should be removed first, if any? Bear in mind that the variable with the smallest AIC should come out first, in case your table doesn’t get printed in order.

Solution

This:

drop1(steak.1, test = "Chisq")

## Single term deletions
## 
## Model:
## steak_prep_ord ~ educ + female + lottery_a
##           Df    AIC    LRT Pr(>Chi)  
## <none>       910.69                  
## educ       4 912.10 9.4107  0.05162 .
## female     1 908.70 0.0108  0.91715  
## lottery_a  1 909.93 1.2425  0.26498  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

My table is indeed out of order (which is why I warned you about it, in case that happens to you as well). The smallest AIC goes with female, which also has a very non-significant P-value, so this one should come out first.

$\blacksquare$

Remove the variable that should come out first, using update. (If all the variables should stay, you can skip this part.)

Solution

You could type or copy-paste the whole model again, but update is quicker:

steak.2 <- update(steak.1, . ~ . - female)

That’s all.

I wanted to get some support for my drop1 above (since I was a bit worried about those out-of-order rows). Now that we have fitted a model with female and one without, we can compare them using anova:

anova(steak.2, steak.1, test = "Chisq")

## Likelihood ratio tests of ordinal regression models
## 
## Response: steak_prep_ord
##                       Model Resid. df Resid. Dev   Test    Df  LR stat.
## 1          educ + lottery_a       322   890.7028                       
## 2 educ + female + lottery_a       321   890.6920 1 vs 2     1 0.0108221
##     Pr(Chi)
## 1          
## 2 0.9171461

Don’t get taken in by that “LR stat” on the end of the first row of the output table; the P-value wrapped onto the second line, and is in fact exactly the same as in the drop1 output (it is doing exactly the same test). As non-significant as you could wish for.

Extra: I was curious about whether either of the other $x$’s could come out now:

drop1(steak.2, test = "Chisq")

## Single term deletions
## 
## Model:
## steak_prep_ord ~ educ + lottery_a
##           Df    AIC    LRT Pr(>Chi)  
## <none>       908.70                  
## educ       4 910.13 9.4299  0.05121 .
## lottery_a  1 907.96 1.2599  0.26167  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lottery_a should come out, but educ is edging towards significance. We are about to do predictions; in those, the above suggests that there may be some visible effect of education, but there may not be much effect of lottery_a.

All right, so what happens when we remove lottery_a That we find out later.

$\blacksquare$

Using the best model that you have so far, predict the probabilities of preferring each different steak preparation (method of cooking) for each combination of the variables that remain. (Some of the variables are TRUE and FALSE rather than factors. Bear this in mind.) Describe the effects of each variable on the predicted probabilities, if any. Note that there is exactly one person in the study whose educational level is “less than high school”.

Solution

Again, I’m leaving it to you to follow all the steps. My variables remaining are educ and lottery_a.

predictions(steak.2, variables = c("educ", "lottery_a"),
            type = "probs") %>% 
  select(rowid, group, predicted, educ.x, lottery_a.x) %>% 
  pivot_wider(names_from = group, values_from = predicted)

## 
## Re-fitting to get Hessian

## # A tibble: 10 × 8
##    rowid educ.x   lottery_a.x    Rare `Medium rare`  Medium `Medium Well`   Well
##    <int> <chr>    <lgl>         <dbl>         <dbl>   <dbl>         <dbl>  <dbl>
##  1     1 Some co… TRUE        4.43e-2   0.351       3.44e-1     0.193     0.0679
##  2     2 Some co… FALSE       5.50e-2   0.395       3.30e-1     0.165     0.0549
##  3     3 Graduat… TRUE        6.44e-2   0.428       3.16e-1     0.145     0.0468
##  4     4 Graduat… FALSE       7.95e-2   0.469       2.92e-1     0.122     0.0376
##  5     5 Bachelo… TRUE        4.18e-2   0.339       3.47e-1     0.201     0.0719
##  6     6 Bachelo… FALSE       5.19e-2   0.383       3.35e-1     0.172     0.0581
##  7     7 High sc… TRUE        4.25e-2   0.342       3.46e-1     0.198     0.0707
##  8     8 High sc… FALSE       5.28e-2   0.387       3.33e-1     0.170     0.0571
##  9     9 Less th… TRUE        6.78e-8   0.000000886 3.19e-6     0.0000159 1.00  
## 10    10 Less th… FALSE       8.51e-8   0.00000111  4.00e-6     0.0000200 1.00

There are 5 levels of education, 2 levels of lottery_a, and 5 ways in which you might ask for your steak to be cooked, so the original output from predictions has $5 \times 2 \times 5 = 50$ rows, and the output you see above has $5 \times 2 = 10$ rows.

Aside: sometimes, the output from predictions glues an x or a y onto the end of the names of the things in variables, as here. I haven’t yet discovered why this is, but I suspect it has something to do with looking things up in a dataframe. This happens when the two dataframes you are left-joining have a column by the same name that is not part of what you are looking up. Example:

xx <- tribble(~u, ~v,
             "a", 10,
             "a", 11,
             "b", 12)
yy <- tribble(~u, ~v,
             "a", 100,
             "b", 103)

Think of x as containing sales of products a and b, and y as containing some other information about those products like their price. The other columns in the two dataframes happen to have the same names, but they have nothing to do with each other (one is sales and the other is price). So we want to join only on u:

xx %>% left_join(yy, by = "u")

## # A tibble: 3 × 3
##   u       v.x   v.y
##   <chr> <dbl> <dbl>
## 1 a        10   100
## 2 a        11   100
## 3 b        12   103

v.x is the column v that came from the first dataframe (the one called xx that had the sales in it), and v.y is the one that come from the second dataframe (called yy, with prices in it).

So that’s where the names educ.x and lottery_a.x came from (I think), but I still have no idea where they acquired those names.

I find this hard to read, so I’m going to round off those predictions. Three or four decimals seems to be sensible. The time to do this is while they are all in one column, that is, before the pivot_wider. On my screen, the education levels also came out rather long, so I’m going to shorten them as well:

predictions(steak.2, variables = c("educ", "lottery_a"),
            type = "probs") %>% 
  select(rowid, group, predicted, educ.x, lottery_a.x) %>%
  mutate(predicted = round(predicted, 3),
         educ.x = abbreviate(educ.x, 15)) %>% 
  pivot_wider(names_from = group, values_from = predicted)

## 
## Re-fitting to get Hessian

## # A tibble: 10 × 8
##    rowid educ.x       lottery_a.x  Rare `Medium rare` Medium `Medium Well`  Well
##    <int> <chr>        <lgl>       <dbl>         <dbl>  <dbl>         <dbl> <dbl>
##  1     1 SmcllgorAss… TRUE        0.044         0.351  0.344         0.193 0.068
##  2     2 SmcllgorAss… FALSE       0.055         0.395  0.33          0.165 0.055
##  3     3 Graduate de… TRUE        0.064         0.428  0.316         0.145 0.047
##  4     4 Graduate de… FALSE       0.08          0.469  0.292         0.122 0.038
##  5     5 Bachelor de… TRUE        0.042         0.339  0.347         0.201 0.072
##  6     6 Bachelor de… FALSE       0.052         0.383  0.335         0.172 0.058
##  7     7 Highschoold… TRUE        0.043         0.342  0.346         0.198 0.071
##  8     8 Highschoold… FALSE       0.053         0.387  0.333         0.17  0.057
##  9     9 Lssthnhghsc… TRUE        0             0      0             0     1    
## 10    10 Lssthnhghsc… FALSE       0             0      0             0     1

That’s about as much as I can shorten the education levels while still having them readable.

Then, say something about the effect of changing educational level on the predictions, and say something about the effect of favouring Lottery A vs. not. I don’t much mind what: you can say that there is not much effect (of either variable), or you can say something like “people with a graduate degree are slightly more likely to like their steak rare and less likely to like it well done” (for education level) and “people who preferred Lottery A are slightly less likely to like their steak rare and slightly more likely to like it well done” (for effect of Lottery A). You can see these by comparing the odd-numbered rows rows with each other to assess the effect of education while holding attitudes towards lottery_a constant (or the even-numbered rows, if you prefer), and you can compare eg. rows 1 and 2 to assess the effect of Lottery A (compare two lines with the same educational level but different preferences re Lottery A).

I would keep away from saying anything about education level “less than high school”, since this entire level is represented by exactly one person.

$\blacksquare$

Is it reasonable to remove all the remaining explanatory variables from your best model so far? Fit a model with no explanatory variables, and do a test. (In R, if the right side of the squiggle is a 1, that means “just an intercept”. Or you can remove whatever remains using update.) What do you conclude? Explain briefly.

Solution

The fitting part is the challenge, since the testing part is anova again. The direct fit is this:

steak.3 <- polr(steak_prep_ord ~ 1, data = steak)

and the update version is this, about equally long, starting from steak.2 since that is the best model so far:

steak.3a <- update(steak.2, . ~ . - educ - lottery_a)

You can use whichever you like. Either way, the second part is anova, and the two possible answers should be the same:

anova(steak.3, steak.2, test = "Chisq")

## Likelihood ratio tests of ordinal regression models
## 
## Response: steak_prep_ord
##              Model Resid. df Resid. Dev   Test    Df LR stat.    Pr(Chi)
## 1                1       327   901.4467                                 
## 2 educ + lottery_a       322   890.7028 1 vs 2     5 10.74387 0.05670146

anova(steak.3a, steak.2, test = "Chisq")

## Likelihood ratio tests of ordinal regression models
## 
## Response: steak_prep_ord
##              Model Resid. df Resid. Dev   Test    Df LR stat.    Pr(Chi)
## 1                1       327   901.4467                                 
## 2 educ + lottery_a       322   890.7028 1 vs 2     5 10.74387 0.05670146

At the 0.05 level, removing both of the remaining variables is fine: that is, nothing (out of these variables) has any impact on the probability that a diner will prefer their steak cooked a particular way. However, it is a very close call; the P-value is only just bigger than 0.05.

However, with data like this and a rather exploratory analysis, I might think about using a larger $\alpha$ like 0.10, and at this level, taking out both these two variables is a bad idea. You could say that one or both of them is “potentially useful” or “provocative” or something like that.

If you think that removing these two variables is questionable, you might like to go back to that drop1 output I had above:

drop1(steak.2, test = "Chisq")

## Single term deletions
## 
## Model:
## steak_prep_ord ~ educ + lottery_a
##           Df    AIC    LRT Pr(>Chi)  
## <none>       908.70                  
## educ       4 910.13 9.4299  0.05121 .
## lottery_a  1 907.96 1.2599  0.26167  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The smallest AIC goes with lottery_a, so that comes out (it is nowhere near significant):

steak.4 <- update(steak.2, . ~ . - lottery_a)
drop1(steak.4, test = "Chisq")

## Single term deletions
## 
## Model:
## steak_prep_ord ~ educ
##        Df    AIC   LRT Pr(>Chi)  
## <none>    907.96                 
## educ    4 909.45 9.484  0.05008 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and what you see is that educational level is right on the edge of significance, so that may or may not have any impact. Make a call. But if anything, it’s educational level that makes a difference.

$\blacksquare$

In the article for which these data were collected, link, does the author obtain consistent conclusions with yours? Explain briefly. (It’s not a very long article, so it won’t take you long to skim through, and the author’s point is pretty clear.)

Solution

The article says that nothing has anything to do with steak preference. Whether you agree or not depends on what you thought above about dropping those last two variables. So say something consistent with what you said in the previous part. Two points for saying that the author said “nothing has any effect”, and one point for how your findings square with that.

Extra: now that you have worked through this great long question, this is where I tell you that I simplified things a fair bit for you! There were lots of other variables that might have had an impact on how people like their steaks, and we didn’t even consider those. Why did I choose what I did here? Well, I wanted to fit a regression predicting steak preference from everything else, do a big backward elimination, but:

steak.5 <- polr(steak_prep_ord ~ ., data = steak)

## Warning: glm.fit: algorithm did not converge

## Error in polr(steak_prep_ord ~ ., data = steak): attempt to find suitable starting values failed

The . in place of explanatory variables means “all the other variables”, including the nonsensical personal ID. That saved me having to type them all out.

Unfortunately, however, it didn’t work. The problem is a numerical one. Regular regression has a well-defined procedure, where the computer follows through the steps and gets to the answer, every time. Once you go beyond regression, however, the answer is obtained by a step-by-step method: the computer makes an initial guess, tries to improve it, then tries to improve it again, until it can’t improve things any more, at which point it calls it good. The problem here is that polr cannot even get the initial guess! (It apparently is known to suck at this, in problems as big and complicated as this one.)

I don’t normally recommend forward selection, but I wonder whether it works here:

steak.5 <- polr(steak_prep_ord ~ 1, data = steak)
steak.6 <- step(steak.5,
  scope = . ~ lottery_a + smoke + alcohol + gamble + skydiving +
    speed + cheated + female + age + hhold_income + educ + region,
  direction = "forward", test = "Chisq", trace = 0
)
drop1(steak.6, test = "Chisq")

## Single term deletions
## 
## Model:
## steak_prep_ord ~ educ
##        Df    AIC   LRT Pr(>Chi)  
## <none>    907.96                 
## educ    4 909.45 9.484  0.05008 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It does, and it says the only thing to add out of all the variables is education level. So, for you, I picked this along with a couple of other plausible-sounding variables and had you start from there.

Forward selection starts from a model containing nothing and asks “what can we add?”. This is a bit more complicated than backward elimination, because now you have to say what the candidate things to add are. That’s the purpose of that scope piece, and there I had no alternative but to type the names of all the variables. Backward elimination is easier, because the candidate variables to remove are the ones in the model, and you don’t need a scope. The trace=0 says “don’t give me any output” (you can change it to a different value if you want to see what that does), and last, the drop1 looks at what is actually in the final model (with a view to asking what can be removed, but we don’t care about that here).

$\blacksquare$

27.10 How do you like your steak – the data

This question takes you through the data preparation for one of the other questions. You don’t have to do this* question, but you may find it interesting or useful.

respondent_ID: a ten-digit number identifying each person who responded to the survey.
lottery_a: true if the respondent preferred lottery A with a small chance to win a lot of money, to lottery B, with a larger chance to win less money.
smoke: true if the respondent is currently a smoker
alcohol: true if the respondent at least occasionally drinks alcohol.
gamble: true if the respondent likes to gamble (eg. betting on horse racing or playing the lottery)
skydiving: true if the respondent has ever been skydiving.
speed: true if the respondent likes to drive fast
cheated: true if the respondent has ever cheated on a spouse or girlfriend/boyfriend
steak: true if the respondent likes to eat steak
steak_prep (response): how the respondent likes their steak cooked (factor, as described above, with 5 levels).
female: true if the respondent is female
age: age group, from 18–29 to 60+.
hhold_income: household income group, from $0–24,999 to $150,000+.
educ: highest level of education attained, from “less than high school” up to “graduate degree”
region: region (of the US) that the respondent lives in (five values).

The data are in link.

Read in the data and display the first few lines.

Solution

The usual:

my_url <- "https://raw.githubusercontent.com/nxskok/datafiles/master/steak.csv"
steak0 <- read_csv(my_url)

## Rows: 550 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): steak_prep, age, hhold_income, educ, region
## dbl (1): respondent_id
## lgl (9): lottery_a, smoke, alcohol, gamble, skydiving, speed, cheated, steak...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

steak0

## # A tibble: 550 × 15
##    respondent_id lottery_a smoke alcohol gamble skydiving speed cheated steak
##            <dbl> <lgl>     <lgl> <lgl>   <lgl>  <lgl>     <lgl> <lgl>   <lgl>
##  1    3237565956 FALSE     NA    NA      NA     NA        NA    NA      NA   
##  2    3234982343 TRUE      FALSE TRUE    FALSE  FALSE     FALSE FALSE   TRUE 
##  3    3234973379 TRUE      FALSE TRUE    TRUE   FALSE     TRUE  TRUE    TRUE 
##  4    3234972383 FALSE     TRUE  TRUE    TRUE   FALSE     TRUE  TRUE    TRUE 
##  5    3234958833 FALSE     FALSE TRUE    FALSE  FALSE     TRUE  TRUE    TRUE 
##  6    3234955240 TRUE      FALSE FALSE   FALSE  FALSE     TRUE  FALSE   TRUE 
##  7    3234955097 TRUE      FALSE TRUE    FALSE  FALSE     TRUE  TRUE    FALSE
##  8    3234955010 TRUE      FALSE TRUE    TRUE   TRUE      TRUE  FALSE   TRUE 
##  9    3234953052 TRUE      TRUE  TRUE    TRUE   FALSE     TRUE  FALSE   TRUE 
## 10    3234951249 FALSE     FALSE TRUE    TRUE   FALSE     FALSE FALSE   TRUE 
## # … with 540 more rows, and 6 more variables: steak_prep <chr>, female <lgl>,
## #   age <chr>, hhold_income <chr>, educ <chr>, region <chr>

I’m using a temporary name for reasons that will become clear shortly.

$\blacksquare$

What do you immediately notice about your data frame? Run summary on the entire data frame. Would you say you have a lot of missing values, or only a few?

Solution

I see missing values, starting in the very first row. Running the data frame through summary gives this, either as summary(steak0) or this way:

steak0 %>% summary()

##  respondent_id       lottery_a         smoke          alcohol       
##  Min.   :3.235e+09   Mode :logical   Mode :logical   Mode :logical  
##  1st Qu.:3.235e+09   FALSE:279       FALSE:453       FALSE:125      
##  Median :3.235e+09   TRUE :267       TRUE :84        TRUE :416      
##  Mean   :3.235e+09   NA's :4         NA's :13        NA's :9        
##  3rd Qu.:3.235e+09                                                  
##  Max.   :3.238e+09                                                  
##    gamble        skydiving         speed          cheated       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:280       FALSE:502       FALSE:59        FALSE:447      
##  TRUE :257       TRUE :36        TRUE :480       TRUE :92       
##  NA's :13        NA's :12        NA's :11        NA's :11       
##                                                                 
##                                                                 
##    steak          steak_prep          female            age           
##  Mode :logical   Length:550         Mode :logical   Length:550        
##  FALSE:109       Class :character   FALSE:246       Class :character  
##  TRUE :430       Mode  :character   TRUE :268       Mode  :character  
##  NA's :11                           NA's :36                          
##                                                                       
##                                                                       
##  hhold_income           educ              region         
##  Length:550         Length:550         Length:550        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##

Make a call about whether you think that’s a lot of missing values or only a few. This might not be all of them, because missing text doesn’t show here (we see later how to make it show up).

$\blacksquare$

What does the function drop_na do when applied to a data frame with missing values? To find out, pass the data frame into drop_na(), then into summary again. What has happened?

Solution

Let’s try it and see.

steak0 %>% drop_na() %>% summary()

##  respondent_id       lottery_a         smoke          alcohol       
##  Min.   :3.235e+09   Mode :logical   Mode :logical   Mode :logical  
##  1st Qu.:3.235e+09   FALSE:171       FALSE:274       FALSE:65       
##  Median :3.235e+09   TRUE :160       TRUE :57        TRUE :266      
##  Mean   :3.235e+09                                                  
##  3rd Qu.:3.235e+09                                                  
##  Max.   :3.235e+09                                                  
##    gamble        skydiving         speed          cheated         steak        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical   Mode:logical  
##  FALSE:158       FALSE:308       FALSE:28        FALSE:274       TRUE:331      
##  TRUE :173       TRUE :23        TRUE :303       TRUE :57                      
##                                                                                
##                                                                                
##                                                                                
##   steak_prep          female            age            hhold_income      
##  Length:331         Mode :logical   Length:331         Length:331        
##  Class :character   FALSE:174       Class :character   Class :character  
##  Mode  :character   TRUE :157       Mode  :character   Mode  :character  
##                                                                          
##                                                                          
##                                                                          
##      educ              region         
##  Length:331         Length:331        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

The missing values, the ones we can see anyway, have all gone. Precisely, drop_na, as its name suggests, drops all the rows that have missing values in them anywhere. This is potentially wasteful, since a row might be missing only one value, and we drop the entire rest of the row, throwing away the good data as well. If you check, we started with 550 rows, and we now have only 311 left. Ouch.

So now we’ll save this into our “good” data frame, which means doing it again (now that we know it works):

steak0 %>% drop_na() -> steak

Extra: another way to handle missing data is called “imputation”: what you do is to estimate a value for any missing data, and then use that later on as if it were the truth. One way of estimating missing values is to do a regression (of appropriate kind: regular or logistic) to predict a column with missing values from all the other columns.

Extra extra: below we see how we used to have to do this, for your information.

First, we run complete.cases on the data frame:

complete.cases(steak0)

##   [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [13]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
##  [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
##  [37]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
##  [49]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
##  [61] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [73] FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
##  [85] FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
##  [97]  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
## [109] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE
## [121]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
## [133] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [145]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [157]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
## [169]  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE
## [181] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE
## [193]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
## [205]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## [217] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
## [229]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [241]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE
## [253] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [265] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE
## [277]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [289]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
## [301] FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [313]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
## [325] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
## [337]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
## [349] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## [361] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [373] FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
## [385]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
## [397]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
## [409]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE
## [421]  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE
## [433]  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE
## [445]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
## [457]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
## [469]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [481] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE
## [493] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE
## [505]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE
## [517]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE
## [529] FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
## [541] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE

You might be able to guess what this does, in the light of what we just did, but if not, you can investigate. Let’s pick three rows where complete.cases is TRUE and three where it’s FALSE, and see what happens.

I’ll pick rows 496, 497, and 498 for the TRUE rows, and 540, 541 and 542 for the FALSE ones. Let’s assemble these rows into a vector and use slice to display the rows with these numbers:

rows <- c(496, 497, 498, 540, 541, 542)
rows

## [1] 496 497 498 540 541 542

Like this:

steak0 %>% slice(rows)

## # A tibble: 6 × 15
##   respondent_id lottery_a smoke alcohol gamble skydiving speed cheated steak
##           <dbl> <lgl>     <lgl> <lgl>   <lgl>  <lgl>     <lgl> <lgl>   <lgl>
## 1    3234776895 FALSE     FALSE FALSE   FALSE  FALSE     FALSE FALSE   TRUE 
## 2    3234776815 TRUE      TRUE  TRUE    FALSE  FALSE     TRUE  FALSE   TRUE 
## 3    3234776702 FALSE     FALSE FALSE   FALSE  FALSE     FALSE FALSE   TRUE 
## 4    3234763650 TRUE      FALSE FALSE   FALSE  FALSE     TRUE  FALSE   FALSE
## 5    3234763171 TRUE      FALSE TRUE    TRUE   FALSE     TRUE  FALSE   TRUE 
## 6    3234762715 FALSE     FALSE FALSE   FALSE  FALSE     TRUE  FALSE   FALSE
## # … with 6 more variables: steak_prep <chr>, female <lgl>, age <chr>,
## #   hhold_income <chr>, educ <chr>, region <chr>

What’s the difference? The rows where complete.cases is FALSE have one (or more) missing values in them; where complete.cases is TRUE the rows have no missing values. (Depending on the rows you choose, you may not see the missing value(s), as I didn’t.) Extra (within “extra extra”: I hope you are keeping track): this is a bit tricky to investigate more thoroughly, because the text variables might have missing values in them, and they won’t show up unless we turn them into a factor first:

steak0 %>%
  mutate(across(where(is.character), ~factor(.))) %>%
  summary()

##  respondent_id       lottery_a         smoke          alcohol       
##  Min.   :3.235e+09   Mode :logical   Mode :logical   Mode :logical  
##  1st Qu.:3.235e+09   FALSE:279       FALSE:453       FALSE:125      
##  Median :3.235e+09   TRUE :267       TRUE :84        TRUE :416      
##  Mean   :3.235e+09   NA's :4         NA's :13        NA's :9        
##  3rd Qu.:3.235e+09                                                  
##  Max.   :3.238e+09                                                  
##                                                                     
##    gamble        skydiving         speed          cheated       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:280       FALSE:502       FALSE:59        FALSE:447      
##  TRUE :257       TRUE :36        TRUE :480       TRUE :92       
##  NA's :13        NA's :12        NA's :11        NA's :11       
##                                                                 
##                                                                 
##                                                                 
##    steak               steak_prep    female           age     
##  Mode :logical   Medium     :132   Mode :logical   >60  :131  
##  FALSE:109       Medium rare:166   FALSE:246       18-29:110  
##  TRUE :430       Medium Well: 75   TRUE :268       30-44:133  
##  NA's :11        Rare       : 23   NA's :36        45-60:140  
##                  Well       : 36                   NA's : 36  
##                  NA's       :118                              
##                                                               
##               hhold_income                               educ    
##  $0 - $24,999       : 51   Bachelor degree                 :174  
##  $100,000 - $149,999: 76   Graduate degree                 :133  
##  $150,000+          : 54   High school degree              : 39  
##  $25,000 - $49,999  : 77   Less than high school degree    :  2  
##  $50,000 - $99,999  :172   Some college or Associate degree:164  
##  NA's               :120   NA's                            : 38  
##                                                                  
##                 region   
##  Pacific           : 91  
##  South Atlantic    : 88  
##  East North Central: 86  
##  Middle Atlantic   : 72  
##  West North Central: 42  
##  (Other)           :133  
##  NA's              : 38

There are missing values everywhere. What the where does is to do something for each column where the first thing is true: here, if the column is text, then replace it by the factor version of itself. This makes for a better summary, one that shows how many observations are in each category, and, more important for us, how many are missing (a lot).

All right, so there are 15 columns, so let’s investigate missingness in our rows by looking at the columns 1 through 8 and then 9 through 15, so they all fit on the screen. Recall that you can select columns by number:

steak0 %>% select(1:8) %>% slice(rows)

## # A tibble: 6 × 8
##   respondent_id lottery_a smoke alcohol gamble skydiving speed cheated
##           <dbl> <lgl>     <lgl> <lgl>   <lgl>  <lgl>     <lgl> <lgl>  
## 1    3234776895 FALSE     FALSE FALSE   FALSE  FALSE     FALSE FALSE  
## 2    3234776815 TRUE      TRUE  TRUE    FALSE  FALSE     TRUE  FALSE  
## 3    3234776702 FALSE     FALSE FALSE   FALSE  FALSE     FALSE FALSE  
## 4    3234763650 TRUE      FALSE FALSE   FALSE  FALSE     TRUE  FALSE  
## 5    3234763171 TRUE      FALSE TRUE    TRUE   FALSE     TRUE  FALSE  
## 6    3234762715 FALSE     FALSE FALSE   FALSE  FALSE     TRUE  FALSE

and

steak0 %>% select(9:15) %>% slice(rows)

## # A tibble: 6 × 7
##   steak steak_prep  female age   hhold_income        educ                 region
##   <lgl> <chr>       <lgl>  <chr> <chr>               <chr>                <chr> 
## 1 TRUE  Medium rare TRUE   45-60 $0 - $24,999        Some college or Ass… West …
## 2 TRUE  Medium      TRUE   45-60 $150,000+           Bachelor degree      New E…
## 3 TRUE  Medium rare TRUE   >60   $50,000 - $99,999   Graduate degree      Mount…
## 4 FALSE <NA>        FALSE  45-60 $100,000 - $149,999 Some college or Ass… South…
## 5 TRUE  Medium      FALSE  >60   <NA>                Graduate degree      Pacif…
## 6 FALSE <NA>        FALSE  18-29 $50,000 - $99,999   Some college or Ass… West …

In this case, the first three rows have no missing values anywhere, and the last three rows have exactly one missing value. This corresponds to what we would expect, with complete.cases identifying rows that have any missing values.

What we now need to do is to obtain a data frame that contains only the rows with non-missing values. This can be done by saving the result of complete.cases in a variable first; filter can take anything that produces a true or a false for each row, and will return the rows for which the thing it was fed was true.

cc <- complete.cases(steak0)
steak0 %>% filter(cc) -> steak.complete

A quick check that we got rid of the missing values:

steak.complete

## # A tibble: 331 × 15
##    respondent_id lottery_a smoke alcohol gamble skydiving speed cheated steak
##            <dbl> <lgl>     <lgl> <lgl>   <lgl>  <lgl>     <lgl> <lgl>   <lgl>
##  1    3234982343 TRUE      FALSE TRUE    FALSE  FALSE     FALSE FALSE   TRUE 
##  2    3234973379 TRUE      FALSE TRUE    TRUE   FALSE     TRUE  TRUE    TRUE 
##  3    3234972383 FALSE     TRUE  TRUE    TRUE   FALSE     TRUE  TRUE    TRUE 
##  4    3234958833 FALSE     FALSE TRUE    FALSE  FALSE     TRUE  TRUE    TRUE 
##  5    3234955240 TRUE      FALSE FALSE   FALSE  FALSE     TRUE  FALSE   TRUE 
##  6    3234955010 TRUE      FALSE TRUE    TRUE   TRUE      TRUE  FALSE   TRUE 
##  7    3234953052 TRUE      TRUE  TRUE    TRUE   FALSE     TRUE  FALSE   TRUE 
##  8    3234951249 FALSE     FALSE TRUE    TRUE   FALSE     FALSE FALSE   TRUE 
##  9    3234948883 FALSE     FALSE TRUE    FALSE  FALSE     TRUE  FALSE   TRUE 
## 10    3234948197 TRUE      FALSE FALSE   TRUE   FALSE     TRUE  FALSE   TRUE 
## # … with 321 more rows, and 6 more variables: steak_prep <chr>, female <lgl>,
## #   age <chr>, hhold_income <chr>, educ <chr>, region <chr>

There are no missing values there. Of course, this is not a proof, and there might be some missing values further down, but at least it suggests that we might be good.

For proof, this is the easiest way I know:

steak.complete %>%
  mutate(across(where(is.character), ~factor(.))) %>%
  summary()

##  respondent_id       lottery_a         smoke          alcohol       
##  Min.   :3.235e+09   Mode :logical   Mode :logical   Mode :logical  
##  1st Qu.:3.235e+09   FALSE:171       FALSE:274       FALSE:65       
##  Median :3.235e+09   TRUE :160       TRUE :57        TRUE :266      
##  Mean   :3.235e+09                                                  
##  3rd Qu.:3.235e+09                                                  
##  Max.   :3.235e+09                                                  
##                                                                     
##    gamble        skydiving         speed          cheated         steak        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical   Mode:logical  
##  FALSE:158       FALSE:308       FALSE:28        FALSE:274       TRUE:331      
##  TRUE :173       TRUE :23        TRUE :303       TRUE :57                      
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##        steak_prep    female           age                  hhold_income
##  Medium     :109   Mode :logical   >60  :82   $0 - $24,999       : 37  
##  Medium rare:128   FALSE:174       18-29:70   $100,000 - $149,999: 66  
##  Medium Well: 56   TRUE :157       30-44:93   $150,000+          : 39  
##  Rare       : 18                   45-60:86   $25,000 - $49,999  : 55  
##  Well       : 20                              $50,000 - $99,999  :134  
##                                                                        
##                                                                        
##                                educ                    region  
##  Bachelor degree                 :120   South Atlantic    :68  
##  Graduate degree                 : 86   Pacific           :57  
##  High school degree              : 20   East North Central:48  
##  Less than high school degree    :  1   Middle Atlantic   :46  
##  Some college or Associate degree:104   West North Central:29  
##                                         Mountain          :24  
##                                         (Other)           :59

If there were any missing values, they would be listed on the end of the counts of observations for each level, or on the bottom of the five-number sumamries. But there aren’t. So here’s your proof.

$\blacksquare$

Write the data into a .csv file, with a name like steak1.csv. Open this file in a spreadsheet and (quickly) verify that you have the right columns and no missing values.

Solution

This is write_csv, using my output from drop_na:

write_csv(steak, "steak1.csv")

Open up Excel, or whatever you have, and take a look. You should have all the right columns, and, scrolling down, no visible missing values.

$\blacksquare$

I’m always amused at how Americans put all Asians into one group.↩︎
I’m always amused at how Americans put all Asians into one group.↩︎