A phone company commissioned a survey of their customers’ satisfaction with their mobile devices. The responses to the survey were on a so-called Likert scale of “very unsatisfied”, “unsatisfied”, “satisfied”, “very satisfied”. Also recorded were each customer’s gender and age group (under 18, 18–24, 25–30, 31 or older). (A survey of this kind does not ask its respondents for their exact age, only which age group they fall in.) The data, as frequencies of people falling into each category combination, are in link.
Solution
##
## ── Column specification ───────────────────────────────────
## cols(
## gender = col_character(),
## age.group = col_character(),
## very.unsat = col_double(),
## unsat = col_double(),
## sat = col_double(),
## very.sat = col_double()
## )
With multiple columns that are all frequencies, this is a job for
pivot_longer
:
mobile %>%
pivot_longer(very.unsat:very.sat,
names_to="satisfied",
values_to="frequency") -> mobile.long
mobile.long
Yep, all good. See how mobile.long
contains what it should?
(For those keeping track, the original data frame had 8 rows and 4
columns to collect up, and the new one has \(8\times 4=32\) rows.)
Solution
I looked at mobile.long
in the previous part, but if you
didn’t, look at it here:
My intended response variable is what I called satisfied
.
This is chr
or “text”, not the factor
that I
want.
Solution
My intended response satisfied
is text, not a factor, so
I need to do this part.
The hint is to look at the column satisfied
in
mobile.long
and note that the satisfaction categories
appear in the data in the order that we want. This is good
news, because we can use fct_inorder
like this:
If you check, by looking at the data frame, satis
is
a factor
, and you can also do this to verify that its levels
are in the right order:
## [1] "very.unsat" "unsat" "sat" "very.sat"
Success.
Extra: so now you are asking, what if the levels are in the wrong order in the data? Well, below is what you used to have to do, and it will work for this as well. I’ll first find what levels of satisfaction I have. This can be done by counting them, or by finding the distinct ones:
or
If you count them, they come out in alphabetical order. If you ask for
the distinct ones, they come out in the order they were in
mobile.long
, which is the order the columns of those
names were in mobile
, which is the order we want.
To actually grab those satisfaction levels as a vector (that we will
need in a minute), use pluck
to pull the column out of the
data frame as a vector:
## [1] "very.unsat" "unsat" "sat" "very.sat"
which is in the correct order, or
## [1] "sat" "unsat" "very.sat" "very.unsat"
which is in alphabetical order. The problem with the second one is
that we know the correct order, but there isn’t a good way to code
that, so we have to rearrange it ourselves. The correct order from
v2
is 4, 2, 1, 3, so:
## [1] "very.unsat" "unsat" "sat" "very.sat"
## [1] "very.unsat" "unsat" "sat" "very.sat"
Either of these will work. The first one is more typing, but is
perhaps more obvious. There is a third way, which is to keep things as
a data frame until the end, and use slice
to pick out the
rows in the right order:
## [1] "very.unsat" "unsat" "sat" "very.sat"
If you don’t see how that works, run it yourself, one line at a time.
The other way of doing this is to physically type them into a vector, but this carries the usual warnings of requiring you to be very careful and that it won’t be reproducible (eg. if you do another survey with different response categories).
So now create the proper response variable thus, using your vector of categories:
satis
has the same values as satisfied
, but its
label ord
means that it is an ordered factor, as we want.
weights
!Solution
(i):
For (ii) and (iii), update
is the thing (it works for any
kind of model):
We’re not going to look at these, because the output from
summary
is not very illuminating. What we do next is to try
to figure out which (if either) of the explanatory variables
age.group
and gender
we need.
drop1
on your model containing both explanatory
variables to determine whether you can remove either of them. Use
test="Chisq"
to obtain P-values.Solution
drop1
takes a fitted model, and tests each term in
it in turn, and says which (if any) should be removed. Here’s how it goes:
The possibilities are to remove gender
, to remove
age.group
or to remove nothing.
The best one is “remove nothing”, because it’s the one on the output with the smallest
AIC. Both P-values are small, so it would be a mistake to remove
either of the explanatory variables.
anova
to decide whether we are justified in
removing gender
from a model containing both gender
and age.group
. Compare your P-value with the one from drop1
.Solution
This is a comparison of the model with both variables
(mobile.1
) and the model with gender
removed
(mobile.3
). Use anova
for this, smaller
(fewer-\(x\)) model first:
The P-value is (just) less than 0.05, so the models are significantly
different. That means that the model with both variables in fits
significantly better than the model with only age.group
, and
therefore that taking gender
out is a mistake.
The P-value is identical to the one from drop1
(because they
are both doing the same test).
anova
to see whether we are justified in removing
age.group
from a model containing both gender
and
age.group
. Compare your P-value with the one from
drop1
above.Solution
Exactly the same idea as the last part. In my case, I’m comparing
models mobile.2
and mobile.1
:
This one is definitely significant, so I need to keep
age.group
for sure. Again, the P-value is the same as the one
in drop1
.
Solution
I can’t drop either of my variables, so I have to keep them both:
mobile.1
, with both age.group
and gender
.
predict
three things: the fitted model that contains both
age group and gender, the data frame that you read in from the file
back in part (here) (which contains all the combinations of age group
and gender), and an appropriate type
.Solution
My model containing both \(x\)s was mobile.1
, the data frame
read in from the file was called mobile
, and I need
type="p"
to get probabilities:
probs <- predict(mobile.1, mobile, type = "p")
mobile %>%
select(gender, age.group) %>%
cbind(probs)
This worked for me, but this might happen to you, with the same commands as above:
## Error in MASS::select(., gender, age.group): unused arguments (gender, age.group)
Oh, this didn’t work. Why not? There don’t seem to be any errors.
This is the kind of thing that can bother you for days. The
resolution (that it took me a long time to discover) is that you might
have the tidyverse
and also MASS
loaded, in
the wrong order, and MASS
also has a select
(that
takes different inputs and does something different). If you look back
at part (here), you might have seen a message there when you
loaded MASS
that select
was “masked”. When you
have two packages that both contain a function with the same name, the
one that you can see (and that will get used) is the one that was
loaded last, which is the MASS
select (not the one we
actually wanted, which is the tidyverse
select). There are a
couple of ways around this. One is to un-load the package we no longer
need (when we no longer need it). The mechanism for this is shown at
the end of part (here). The other is to say explicitly
which package you want your function to come from, so that there is no
doubt. The tidyverse
is actually a collection of
packages. The best way to find out which one our select
comes
from is to go to the Console window in R Studio and ask for the help
for select
. With both tidyverse
and MASS
loaded, the help window offers you a choice of both select
s;
the one we want is “select/rename variables by name”, and the actual
package it comes from is dplyr
.
There is a third choice, which is the one I prefer now: install and load the package conflicted
. When you run your code and it calls for something like select
that is in two packages that you have loaded, it gives an error, like this:
Error: [conflicted] `select` found in 2 packages.
Either pick the one you want with `::`
* MASS::select
* dplyr::select
Or declare a preference with `conflict_prefer()`
* conflict_prefer("select", "MASS")
* conflict_prefer("select", "dplyr")
Fixing this costs you a bit of time upfront, but once you have fixed it, you know that the right thing is being run. What I do is to copy-paste one of those conflict_prefer
lines, in this case the second one, and put it before the select
that now causes the error. Right after the library(conflicted)
is a good place. When you use conflicted
, you will probably have to run several times to fix up all the conflicts, which will be a bit frustrating, and you will end up with several conflict_prefer
lines, but once you have them there, you won’t have to worry about the right function being called because you have explicitly said which one you want.
This is a non-standard use of cbind
because I wanted to grab
only the gender and age group columns from mobile
first, and
then cbind
that to the predicted probabilities. The
missing first input to cbind
is
“whatever came out of the previous step”,
that is, the first two columns of mobile
.
I only included the first two columns of mobile
in the
cbind
, because the rest of the columns of mobile
were frequencies, which I don’t need to see. (Having said that, it
would be interesting to make a plot using the observed
proportions and predicted probabilities, but I didn’t ask you for that.)
This is an unusual predict
, because we didn’t have to make a
data frame (with my usual name new
) containing all the
combinations of things to predict for. We were lucky enough to have
those already in the original data frame mobile
.
The usual way to do this is something like the trick that we did for getting the different satisfaction levels:
genders <- mobile.long %>% distinct(gender) %>% pluck("gender")
age.groups <- mobile.long %>%
distinct(age.group) %>%
pluck("age.group")
This is getting perilously close to deserving a function written to do it (strictly speaking, we should, since this is three times we’ve used this idea now).
Then crossing
to get the combinations, and then
predict
:
This method is fine if you want to do this question this way; the way I suggested first ought to be easier, but the nice thing about this is its mindlessness: you always do it about the same way.
Solution
I had both explanatory variables being significant, so I would expect to see both an age-group effect and a gender effect. For both males and females, there seems to be a decrease in satisfaction as the customers get older, at least until age 30 or so. I can see this because the predicted prob. of “very satisfied” decreases, and the predicted prob. of “very unsatisfied” increases. The 31+ age group are very similar to the 25–30 group for both males and females. So that’s the age group effect. What about a gender effect? Well, for all the age groups, the males are more likely to be very satisfied than the females of the corresponding age group, and also less likely to to be very unsatisfied. So the gender effect is that males are more satisfied than females overall. (Or, the males are less discerning. Take your pick.) When we did the tests above, age group was very definitely significant, and gender less so (P-value around 0.03). This suggests that the effect of age group ought to be large, and the effect of gender not so large. This is about what we observed: the age group effect was pretty clear, and the gender effect was noticeable but small: the females were less satisfied than the males, but there wasn’t all that much difference.
* This is to prepare you for something in the next question. It’s meant to be easy.
In R, the code NA
stands for “missing value” or
“value not known”. In R, NA
should not have quotes around
it. (It is a special code, not a piece of text.)
v
that contains some numbers and some
missing values, using c()
. Put those values into a
one-column data frame.Solution
Like this. The arrangement of numbers and missing values doesn’t matter, as long as you have some of each:
This has one column called v
.
is.na(v)
. When is this true and when is this false?Solution
This is TRUE
if the corresponding element of v
is
missing (in my case, the third value and the second-last one), and
FALSE
otherwise (when there is an actual value there).
!
means “not” in R (and other
programming languages). What does !is.na(v)
do? Create a
new column containing that.Solution
Try it and see. Give it whatever name you like. My name reflects that I know what it’s going to do:
This is the logical opposite of is.na
: it’s true if there is
a value, and false if it’s missing.
filter
to display just the
rows of your data frame that have a non-missing value of v
.Solution
filter
takes a column to say which rows to pick, in
which case the column should contain something that either is
TRUE
or FALSE
, or something that can be
interpreted that way:
or you can provide filter
something that can be calculated
from what’s in the data frame, and also returns something that is
either true or false:
In either case, I only have non-missing values of v
.
A survey called High School and Beyond was given to a large number of American high school seniors (grade 12) by the National Center of Education Statistics. The data set at link is a random sample of 200 of those students.
The variables collected are:
gender
: student’s gender, female or male.
race
: the student’s race (African-American,
Asian,
I’m always amused at how Americans put all Asians into one group. Hispanic, White).
ses
: Socio-economic status of student’s family (low,
middle, or high)
schtyp
: School type, public or private.
prog
: Student’s program, general, academic, or
vocational.
read
: Score on standardized reading test.
write
: Score on standardized writing test.
math
: Score on standardized math test.
science
: Score on standardized science test.
socst
: Score on standardized social studies test.
Our aim is to see how socio-economic status is related to the other variables.
Solution
This is a .csv
file (I tried to make it easy for you):
##
## ── Column specification ───────────────────────────────────
## cols(
## id = col_double(),
## race = col_character(),
## ses = col_character(),
## schtyp = col_character(),
## prog = col_character(),
## read = col_double(),
## write = col_double(),
## math = col_double(),
## science = col_double(),
## socst = col_double(),
## gender = col_character()
## )
Solution
The response variable ses
is categorical, with
categories that come in order (low less than middle less than
high).
Solution
It has to be an ordered
factor, which you can create in
the data frame (or outside, if you prefer):
ses
is now ord
. Good. Now fit the model:
No errors is good.
drop1
to decide which one to remove next.Solution
I would have expected the AIC column to come out in order, but it
doesn’t. Never mind. Scan for the largest P-value, which belongs to
read
. (This also has the lowest AIC.) So, remove read
:
Note how the P-value for science
has come down a long way.
A close call, but math
goes next. The update
doesn’t take long to type:
science
has become significant now (probably because it was
strongly correlated with at least one of the variables we removed (at
my guess, math
). That is, we didn’t need both
science
and math
, but we do need one of
them.
I think we can guess what will happen now: write
comes out,
and the other two variables will stay, so that’ll be where we stop:
Indeed so. We need just the science and social studies test scores to predict socio-economic status.
Using AIC to decide on which variable to remove next will give the
same answer here, but I would like to see the test=
part in
your drop1
to give P-values (expect to lose something, but
not too much, if that’s not there).
Extras: I talked about correlation among the explanatory variables earlier, which I can explore:
## read write math science socst
## read 1.0000000 0.5967765 0.6622801 0.6301579 0.6214843
## write 0.5967765 1.0000000 0.6174493 0.5704416 0.6047932
## math 0.6622801 0.6174493 1.0000000 0.6307332 0.5444803
## science 0.6301579 0.5704416 0.6307332 1.0000000 0.4651060
## socst 0.6214843 0.6047932 0.5444803 0.4651060 1.0000000
The first time I did this, I forgot that I had MASS
loaded
(for the polr
), and so, to get the right select
, I
needed to say which one I wanted.
Anyway, the correlations are all moderately high. There’s nothing that stands out as being much higher than the others. The lowest two are between social studies and math, and social studies and science. That would be part of the reason that social studies needs to stay. The highest correlation is between math and reading, which surprises me (they seem to be different skills).
So there was not as much insight there as I expected.
The other thing is that you can use step
for the
variable-elimination task as well:
## Start: AIC=404.63
## ses ~ read + write + math + science + socst
##
## Df AIC LRT Pr(>Chi)
## - read 1 403.09 0.4620 0.496684
## - math 1 403.19 0.5618 0.453517
## - write 1 403.81 1.1859 0.276167
## <none> 404.63
## - science 1 404.89 2.2630 0.132499
## - socst 1 410.08 7.4484 0.006349 **
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=403.09
## ses ~ write + math + science + socst
##
## Df AIC LRT Pr(>Chi)
## - math 1 402.04 0.9541 0.328689
## - write 1 402.10 1.0124 0.314325
## <none> 403.09
## - science 1 404.29 3.1968 0.073782 .
## - socst 1 410.58 9.4856 0.002071 **
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=402.04
## ses ~ write + science + socst
##
## Df AIC LRT Pr(>Chi)
## - write 1 400.60 0.5587 0.4547813
## <none> 402.04
## - science 1 405.41 5.3680 0.0205095 *
## - socst 1 411.07 11.0235 0.0008997 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=400.6
## ses ~ science + socst
##
## Df AIC LRT Pr(>Chi)
## <none> 400.60
## - science 1 403.45 4.8511 0.0276291 *
## - socst 1 409.74 11.1412 0.0008443 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I would accept you doing it this way, again as long as you have
the test=
there as well.
science
test score are 44
and 58. The quartiles of the socst
test score are 46 and 61. Make
a data frame that has all combinations of those quartiles. If your best
regression had any other explanatory variables in it, also put the
medians of those variables into this data frame.Solution
Thus, most obviously:
Since there are only two variables left, this new data frame has only \(2^2=4\) rows.
There is a veiled hint here that these are the two variables that
should have remained in your regression. If that was not what you got,
find the median of any other variables you had, and put that into your
new
. For example, if you still had math
, you’d do
this:
That is maths as the apparently-plural of math, not as the British name for mathematics.
ses
category.Solution
This is predict
, and we’ve done the setup. My best model
was called ses.4
:
If you forgot which type
to use, guess one that’s obviously
wrong, and it will tell you what your options are. That ought to be
enough to trigger your memory of the right thing.
If you get the type
right, but the predict
still
doesn’t work, check the construction of your new
data frame
very carefully. That’s likely where your problem is.
Solution
Use your predictions; hold the socst
score constant (that’s
the all else equal part). So compare the first and third rows (or,
if you like, the second and fourth rows) of your predictions and see
what happens as the science score goes from 44 to 58.
What I see is that the probability of being low
goes
noticeably down as the science score increases, the
probability of middle
stays about the same, and the
probability of high
goes up
(by about the same
amount as the probability of low
went down).
In other words, an increased science score goes with an increased
chance of high
(and a decreased chance of low
).
If your best model doesn’t have science
in it, then you
need to say something like “science
has no effect on socio-economic status”,
consistent with what you concluded before: if
you took it out, it’s because you thought it had no effect.
Extra: the effect of an increased social studies score is almost
exactly the same as an increased science score (so I didn’t ask you
about that). From a social-science point of view, this makes
perfect sense: the higher the social-economic stratum a student
comes from, the better they are likely to do in school.
I’ve been phrasing this as “association”, because really the cause
and effect is the other way around: a student’s family socioeconomic
status is explanatory, and school performance is response. But this
was the nicest example I could find of an ordinal response data set.
When you order a steak in a restaurant, the server will ask you how you would like it cooked, or to be precise, how much you would like it cooked: rare (hardly cooked at all), through medium rare, medium, medium well to well (which means “well done”, so that the meat has only a little red to it). Could you guess how a person likes their steak cooked, from some other information about them? The website link commissioned a survey where they asked a number of people how they preferred their steak, along with as many other things as they could think of to ask. (Many of the variables below are related to risk-taking, which was something the people designing the survey thought might have something to do with liking steak rare.) The variables of interest are all factors or true/false:
respondent_ID
: a ten-digit number identifying each
person who responded to the survey.
lottery_a
: true if the respondent preferred lottery A
with a small chance to win a lot of money, to lottery B, with a
larger chance to win less money.
smoke
: true if the respondent is currently a smoker
alcohol
: true if the respondent at least occasionally
drinks alcohol.
gamble
: true if the respondent likes to gamble (eg.
betting on horse racing or playing the lottery)
skydiving
: true if the respondent has ever been
skydiving.
speed
: true if the respondent likes to drive fast
cheated
: true if the respondent has ever cheated on a
spouse or girlfriend/boyfriend
steak
: true if the respondent likes to eat steak
steak_prep
(response): how the respondent likes their
steak cooked (factor, as described above, with 5 levels).
female
: true if the respondent is female
age
: age group, from 18–29 to 60+.
hhold_income
: household income group, from $0–24,999
to $150,000+.
educ
: highest level of education attained, from
“less than high school” up to “graduate degree”
region
: region (of the US)
that the respondent lives in (five values).
The data are in link. This is the cleaned-up data from a previous question, with the missing values removed.
Solution
The usual:
##
## ── Column specification ───────────────────────────────────
## cols(
## respondent_id = col_double(),
## lottery_a = col_logical(),
## smoke = col_logical(),
## alcohol = col_logical(),
## gamble = col_logical(),
## skydiving = col_logical(),
## speed = col_logical(),
## cheated = col_logical(),
## steak = col_logical(),
## steak_prep = col_character(),
## female = col_logical(),
## age = col_character(),
## hhold_income = col_character(),
## educ = col_character(),
## region = col_character()
## )
We should also (re-)load MASS
, since we’ll be needing it:
steak_prep
from some of
the other variables. Why is the model-fitting function polr
from package MASS
the best choice for these data
(alternatives being glm
and multinom
from package
nnet
)?Solution
It all depends on the kind of response variable. We have a
response variable with five ordered levels from Rare to
Well. There are more than two levels (it is more than a
“success” and “failure”), which rules out glm
, and
the levels are ordered, which rules out multinom
. As we
know, polr
handles an ordered response, so it is the
right choice.
steak_prep
,
in the order that R thinks they are in? If they are not in a sensible
order, create an ordered factor where the levels are in a sensible order.Solution
This is the most direct way to find out:
## [1] "Medium rare" "Rare" "Medium" "Medium Well"
## [5] "Well"
This is almost the right order (distinct
uses the order in
the data frame). We just need to switch the first two around, and then
we’ll be done:
## [1] "Rare" "Medium rare" "Medium" "Medium Well"
## [5] "Well"
If you used count
, there’s a bit more work to do:
## [1] "Medium" "Medium rare" "Medium Well" "Rare"
## [5] "Well"
because count
puts them in alphabetical order, so:
## [1] "Rare" "Medium rare" "Medium" "Medium Well"
## [5] "Well"
These use the idea in the
attitudes-to-abortion question: create a vector of the levels in the
right order, then create an ordered factor with
ordered()
. If you like, you can type the levels in the right
order (I won’t penalize you for that here), but it’s really better to
get the levels without typing or copying-pasting, so that you don’t
make any silly errors copying them (which will mess everything up
later).
So now I create my ordered response:
or using one of the other preps
vectors containing the levels
in the correct order.
As far as polr
is concerned,
it doesn’t matter whether I start at Rare
and go “up”, or
start at Well
and go “down”. So if you do it the other way
around, that’s fine. As long as you get the levels in a sensible
order, you’re good.
educ
, female
and
lottery_a
. This ought to be easy from your previous work,
but you have to be careful about one thing. No need to print out the
results.Solution
The thing you have to be careful about is that you use the ordered factor that you just created as the response:
drop1
on your fitted model, with
test="Chisq"
. Which explanatory variable should be removed
first, if any? Bear in mind that the variable with the
smallest AIC should come out first, in case your table
doesn’t get printed in order.Solution
This:
My table is indeed out of order (which is why I warned you about it,
in case that happens to you as well). The smallest AIC goes with
female
, which also has a very non-significant P-value, so
this one should come out first.
update
. (If all the variables should stay, you can skip
this part.)Solution
You could type or copy-paste the whole model again, but
update
is quicker:
That’s all.
I wanted to get some support for my drop1
above (since I was
a bit worried about those out-of-order rows). Now that we have fitted
a model with female
and one without, we can compare them
using anova
:
Don’t get taken in by that “LR stat” on the end of the first row of
the output table; the P-value wrapped onto the second line, and is in
fact exactly the same as in the drop1
output (it is doing
exactly the same test). As non-significant as you could wish for.
I was curious about whether either of the other \(x\)’s could come out now:
lottery_a
should come out, but educ
is edging
towards significance.
Solution
Again, I’m leaving it to you to follow all the steps. My variables
remaining are educ
and lottery_a
. The second of
these is a logical, TRUE and FALSE, so I just filled in the two values:
educs <- steak %>% distinct(educ) %>% pull(educ)
lottery_as <- c(FALSE, TRUE)
steak.new <- crossing(educ = educs, lottery_a = lottery_as)
steak.new
My best model so far is the one I called steak.2
, without
female
, so this. As before, if you forget what type
should be, put in something, and it will tell you what your
choices are:
I find this hard to read, so I’m going to round off those predictions. Three or four decimals seems to be sensible:
Say something about the effect of changing educational level on the predictions, and say something about the effect of favouring Lottery A vs. not. I don’t much mind what: you can say that there is not much effect (of either variable), or you can say something like “people with a graduate degree are slightly more likely to like their steak rare and less likely to like it well done” (for education level) and “people who preferred Lottery A are slightly less likely to like their steak rare and slightly more likely to like it well done” (for effect of Lottery A). You can see these by comparing the first five rows to assess the effect of education (or the last five rows, if you prefer), and you can compare eg. rows 1 and 6 to assess the effect of Lottery A (compare two lines with the same educational level but different preferences re Lottery A).
I would keep away from saying anything about education level “less than high school”, since this entire level is represented by exactly one person.
1
, that means “just an intercept”. Or you can remove
whatever remains using update
.) What do you conclude?
Explain briefly.Solution
The fitting part is the challenge, since the testing part is
anova
again. The direct fit is this:
and the update
version is this, about equally long, starting
from steak.2
since that is the best model so far:
You can use whichever you like. Either way, the second part is
anova
, and the two possible answers should be the same:
or
At the 0.05 level, removing both of the remaining variables is fine: that is, nothing (out of these variables) has any impact on the probability that a diner will prefer their steak cooked a particular way.
However, with data like this and a rather exploratory analysis, I might think about using a larger \(\alpha\) like 0.10, and at this level, taking out both these two variables is a bad idea. You could say that one or both of them is “potentially useful” or “provocative” or something like that.
If you think that removing these two variables is questionable, you
might like to go back to that drop1
output I had above:
The smallest AIC goes with lottery_a
, so that comes out (it
is nowhere near significant):
and what you see is that educational level is right on the edge of significance, so that may or may not have any impact. Make a call. But if anything, it’s educational level that makes a difference.
Solution
The article says that nothing has anything to do with steak preference. Whether you agree or not depends on what you thought above about dropping those last two variables. So say something consistent with what you said in the previous part. Two points for saying that the author said “nothing has any effect”, and one point for how your findings square with that. Now that you have worked through this great long question, this is where I tell you that I simplified things a fair bit for you! There were lots of other variables that might have had an impact on how people like their steaks, and we didn’t even consider those. Why did I choose what I did here? Well, I wanted to fit a regression predicting steak preference from everything else, do a big backward elimination, but:
## Warning: glm.fit: algorithm did not converge
## Error in polr(steak_prep_ord ~ ., data = steak): attempt to find suitable starting values failed
The .
in place of explanatory
variables means “all the other variables”, including the nonsensical
personal ID. That saved me having to type them all out.
Unfortunately, however, it didn’t work. The problem is a numerical
one. Regular regression has a well-defined procedure, where the computer follows
through the steps and gets to the answer, every time. Once you go
beyond regression, however, the answer is obtained by a step-by-step
method: the computer makes an initial guess, tries to improve it, then
tries to improve it again, until it can’t improve things any more, at
which point it calls it good. The problem here is that polr
cannot even get the initial guess! (It apparently is known to suck at
this, in problems as big and complicated as this one.)
I don’t normally recommend forward selection, but I wonder whether it works here:
steak.5 <- polr(steak_prep_ord ~ 1, data = steak)
steak.6 <- step(steak.5,
scope = . ~ lottery_a + smoke + alcohol + gamble + skydiving +
speed + cheated + female + age + hhold_income + educ + region,
direction = "forward", test = "Chisq", trace = 0
)
drop1(steak.6, test = "Chisq")
## Single term deletions
##
## Model:
## steak_prep_ord ~ educ
## Df AIC LRT Pr(>Chi)
## <none> 907.96
## educ 4 909.45 9.484 0.05008 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It does, and it says the only thing to add out of all the variables is education level. So, for you, I picked this along with a couple of other plausible-sounding variables and had you start from there.
Forward selection starts from a model containing nothing and asks
“what can we add?”. This is a bit more complicated than backward
elimination, because now you have to say what the candidate things to
add are. That’s the purpose of that scope
piece, and
there I had no alternative but to type the names of all the
variables. Backward elimination is easier, because the candidate
variables to remove are the ones in the model, and you don’t need a
scope
. The trace=0
says “don’t give me any output”
(you can change it to a different value if you want to see what that
does), and last, the drop1
looks at what is actually
in the final model (with a view to asking what can be removed,
but we don’t care about that here).
This question takes you through the data preparation for one of the other questions. You don’t have to do this* question, but you may find it interesting or useful.
When you order a steak in a restaurant, the server will ask you how you would like it cooked, or to be precise, how much you would like it cooked: rare (hardly cooked at all), through medium rare, medium, medium well to well (which means “well done”, so that the meat has only a little red to it). Could you guess how a person likes their steak cooked, from some other information about them? The website link commissioned a survey where they asked a number of people how they preferred their steak, along with as many other things as they could think of to ask. (Many of the variables below are related to risk-taking, which was something the people designing the survey thought might have something to do with liking steak rare.) The variables of interest are all factors or true/false:
respondent_ID
: a ten-digit number identifying each
person who responded to the survey.
lottery_a
: true if the respondent preferred lottery A
with a small chance to win a lot of money, to lottery B, with a
larger chance to win less money.
smoke
: true if the respondent is currently a smoker
alcohol
: true if the respondent at least occasionally
drinks alcohol.
gamble
: true if the respondent likes to gamble (eg.
betting on horse racing or playing the lottery)
skydiving
: true if the respondent has ever been
skydiving.
speed
: true if the respondent likes to drive fast
cheated
: true if the respondent has ever cheated on a
spouse or girlfriend/boyfriend
steak
: true if the respondent likes to eat steak
steak_prep
(response): how the respondent likes their
steak cooked (factor, as described above, with 5 levels).
female
: true if the respondent is female
age
: age group, from 18–29 to 60+.
hhold_income
: household income group, from $0–24,999
to $150,000+.
educ
: highest level of education attained, from
“less than high school”
up to “graduate degree”
region
: region (of the US)
that the respondent lives in (five values).
The data are in link.
Solution
The usual:
my_url <- "https://raw.githubusercontent.com/nxskok/datafiles/master/steak.csv"
steak0 <- read_csv(my_url)
##
## ── Column specification ───────────────────────────────────
## cols(
## respondent_id = col_double(),
## lottery_a = col_logical(),
## smoke = col_logical(),
## alcohol = col_logical(),
## gamble = col_logical(),
## skydiving = col_logical(),
## speed = col_logical(),
## cheated = col_logical(),
## steak = col_logical(),
## steak_prep = col_character(),
## female = col_logical(),
## age = col_character(),
## hhold_income = col_character(),
## educ = col_character(),
## region = col_character()
## )
I’m using a temporary name for reasons that will become clear shortly.
summary
on the entire data frame. Would you say you have a lot of missing values, or only a few?Solution
I see missing values, starting in the very first row.
Running the data frame through summary
gives this, either as summary(steak0)
or this way:
## respondent_id lottery_a smoke
## Min. :3.235e+09 Mode :logical Mode :logical
## 1st Qu.:3.235e+09 FALSE:279 FALSE:453
## Median :3.235e+09 TRUE :267 TRUE :84
## Mean :3.235e+09 NA's :4 NA's :13
## 3rd Qu.:3.235e+09
## Max. :3.238e+09
## alcohol gamble skydiving
## Mode :logical Mode :logical Mode :logical
## FALSE:125 FALSE:280 FALSE:502
## TRUE :416 TRUE :257 TRUE :36
## NA's :9 NA's :13 NA's :12
##
##
## speed cheated steak
## Mode :logical Mode :logical Mode :logical
## FALSE:59 FALSE:447 FALSE:109
## TRUE :480 TRUE :92 TRUE :430
## NA's :11 NA's :11 NA's :11
##
##
## steak_prep female age
## Length:550 Mode :logical Length:550
## Class :character FALSE:246 Class :character
## Mode :character TRUE :268 Mode :character
## NA's :36
##
##
## hhold_income educ region
## Length:550 Length:550 Length:550
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
Make a call about whether you think that’s a lot of missing values or only a few. This might not be all of them, because missing text doesn’t show here (we see later how to make it show up).
drop_na
do when applied to a data frame with missing values? To find out, pass the data frame into drop_na()
, then into summary
again. What has happened?Solution
Let’s try it and see.
## respondent_id lottery_a smoke
## Min. :3.235e+09 Mode :logical Mode :logical
## 1st Qu.:3.235e+09 FALSE:171 FALSE:274
## Median :3.235e+09 TRUE :160 TRUE :57
## Mean :3.235e+09
## 3rd Qu.:3.235e+09
## Max. :3.235e+09
## alcohol gamble skydiving
## Mode :logical Mode :logical Mode :logical
## FALSE:65 FALSE:158 FALSE:308
## TRUE :266 TRUE :173 TRUE :23
##
##
##
## speed cheated steak
## Mode :logical Mode :logical Mode:logical
## FALSE:28 FALSE:274 TRUE:331
## TRUE :303 TRUE :57
##
##
##
## steak_prep female age
## Length:331 Mode :logical Length:331
## Class :character FALSE:174 Class :character
## Mode :character TRUE :157 Mode :character
##
##
##
## hhold_income educ region
## Length:331 Length:331 Length:331
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
The missing values, the ones we can see anyway, have all gone. Precisely, drop_na
, as its
name suggests, drops all the rows that have missing values in them
anywhere. This is potentially wasteful, since a row might be missing
only one value, and we drop the entire rest of the row, throwing away
the good data as well. If you check, we started with 550 rows, and we
now have only 311 left. Ouch.
So now we’ll save this into our “good” data frame, which means doing it again (now that we know it works):
Extra: another way to handle missing data is called “imputation”: what you do is to estimate a value for any missing data, and then use that later on as if it were the truth. One way of estimating missing values is to do a regression (of appropriate kind: regular or logistic) to predict a column with missing values from all the other columns.
Extra extra: below we see how we used to have to do this, for your information.
First, we run complete.cases
on the data frame:
## [1] FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [10] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [19] TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## [28] TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
## [37] TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## [46] TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## [55] FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
## [64] FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## [73] FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## [82] TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
## [91] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
## [100] TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE
## [109] FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
## [118] FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [127] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
## [136] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
## [145] TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## [154] TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
## [163] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [172] FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
## [181] FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE
## [190] TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [199] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
## [208] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
## [217] FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
## [226] FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [235] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## [244] TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE
## [253] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [262] FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE
## [271] TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
## [280] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [289] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
## [298] TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE
## [307] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
## [316] TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
## [325] FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
## [334] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [343] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
## [352] TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
## [361] FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## [370] TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## [379] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## [388] TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
## [397] TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
## [406] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
## [415] FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [424] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
## [433] TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
## [442] FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
## [451] TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## [460] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
## [469] TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
## [478] FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## [487] TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
## [496] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
## [505] TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [514] TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## [523] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
## [532] FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
## [541] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE
## [550] FALSE
You might be able to guess what this does, in the light of what we
just did, but if not, you can investigate. Let’s pick three rows where
complete.cases
is
TRUE
and three where it’s
FALSE
, and see what happens.
I’ll pick rows 496, 497, and 498 for the TRUE rows, and 540, 541 and
542 for the FALSE ones. Let’s assemble these rows into a vector and
use slice
to display the rows with these numbers:
## [1] 496 497 498 540 541 542
Like this:
What’s the difference?
The rows where complete.cases
is FALSE have one (or more)
missing values in them; where complete.cases
is TRUE the
rows have no missing values. (Depending on the rows you choose,
you may not see the missing value(s), as I didn’t.)
Extra (within “extra extra”: I hope you are keeping track): this
is a bit tricky to investigate more thoroughly, because the text
variables might have missing values in them, and they won’t show
up unless we turn them into a factor first:
## respondent_id lottery_a smoke
## Min. :3.235e+09 Mode :logical Mode :logical
## 1st Qu.:3.235e+09 FALSE:279 FALSE:453
## Median :3.235e+09 TRUE :267 TRUE :84
## Mean :3.235e+09 NA's :4 NA's :13
## 3rd Qu.:3.235e+09
## Max. :3.238e+09
##
## alcohol gamble skydiving
## Mode :logical Mode :logical Mode :logical
## FALSE:125 FALSE:280 FALSE:502
## TRUE :416 TRUE :257 TRUE :36
## NA's :9 NA's :13 NA's :12
##
##
##
## speed cheated steak
## Mode :logical Mode :logical Mode :logical
## FALSE:59 FALSE:447 FALSE:109
## TRUE :480 TRUE :92 TRUE :430
## NA's :11 NA's :11 NA's :11
##
##
##
## steak_prep female age
## Medium :132 Mode :logical >60 :131
## Medium rare:166 FALSE:246 18-29:110
## Medium Well: 75 TRUE :268 30-44:133
## Rare : 23 NA's :36 45-60:140
## Well : 36 NA's : 36
## NA's :118
##
## hhold_income
## $0 - $24,999 : 51
## $100,000 - $149,999: 76
## $150,000+ : 54
## $25,000 - $49,999 : 77
## $50,000 - $99,999 :172
## NA's :120
##
## educ
## Bachelor degree :174
## Graduate degree :133
## High school degree : 39
## Less than high school degree : 2
## Some college or Associate degree:164
## NA's : 38
##
## region
## Pacific : 91
## South Atlantic : 88
## East North Central: 86
## Middle Atlantic : 72
## West North Central: 42
## (Other) :133
## NA's : 38
There are missing values everywhere. What the where
does is to do something for each column where the first thing is true:
here, if the column is text, then replace it by the factor version of
itself. This makes for a better summary, one that shows how many
observations are in each category, and, more important for us, how
many are missing (a lot).
All right, so there are 15 columns, so let’s investigate missingness
in our rows by looking at the columns 1 through 8 and then 9 through
15, so they all fit on the screen. Recall that you can select
columns by number:
and
In this case, the first three rows have no missing values anywhere,
and the last three rows have exactly one missing value. This
corresponds to what we would expect, with complete.cases
identifying rows that have any missing values.
What we now need to do is to obtain a data frame that contains only
the rows with non-missing values. This can be done by saving the
result of complete.cases
in a variable first; filter
can take anything that produces a true or a false for each row, and
will return the rows for which the thing it was fed was true.
A quick check that we got rid of the missing values:
There are no missing values there. Of course, this is not a proof, and there might be some missing values further down, but at least it suggests that we might be good.
For proof, this is the easiest way I know:
## respondent_id lottery_a smoke
## Min. :3.235e+09 Mode :logical Mode :logical
## 1st Qu.:3.235e+09 FALSE:171 FALSE:274
## Median :3.235e+09 TRUE :160 TRUE :57
## Mean :3.235e+09
## 3rd Qu.:3.235e+09
## Max. :3.235e+09
##
## alcohol gamble skydiving
## Mode :logical Mode :logical Mode :logical
## FALSE:65 FALSE:158 FALSE:308
## TRUE :266 TRUE :173 TRUE :23
##
##
##
##
## speed cheated steak
## Mode :logical Mode :logical Mode:logical
## FALSE:28 FALSE:274 TRUE:331
## TRUE :303 TRUE :57
##
##
##
##
## steak_prep female age
## Medium :109 Mode :logical >60 :82
## Medium rare:128 FALSE:174 18-29:70
## Medium Well: 56 TRUE :157 30-44:93
## Rare : 18 45-60:86
## Well : 20
##
##
## hhold_income
## $0 - $24,999 : 37
## $100,000 - $149,999: 66
## $150,000+ : 39
## $25,000 - $49,999 : 55
## $50,000 - $99,999 :134
##
##
## educ
## Bachelor degree :120
## Graduate degree : 86
## High school degree : 20
## Less than high school degree : 1
## Some college or Associate degree:104
##
##
## region
## South Atlantic :68
## Pacific :57
## East North Central:48
## Middle Atlantic :46
## West North Central:29
## Mountain :24
## (Other) :59
If there were any missing values, they would be listed on the end of the counts of observations for each level, or on the bottom of the five-number sumamries. But there aren’t. So here’s your proof.
.csv
file, with a name like
steak1.csv
. Open this file in a spreadsheet and (quickly)
verify that you have the right columns and no missing values.Solution
This is write_csv
, using my output from
drop_na
:
Open up Excel, or whatever you have, and take a look. You should have all the right columns, and, scrolling down, no visible missing values.