29  Logistic regression with nominal response

library(nnet)
library(tidyverse)

29.1 Finding non-missing values

* This is to prepare you for something in the next question. It’s meant to be easy.

In R, the code NA stands for “missing value” or “value not known”. In R, NA should not have quotes around it. (It is a special code, not a piece of text.)

  1. Create a vector v that contains some numbers and some missing values, using c(). Put those values into a one-column data frame.

  2. Obtain a new column containing is.na(v). When is this true and when is this false?

  3. The symbol ! means “not” in R (and other programming languages). What does !is.na(v) do? Create a new column containing that.

  4. Use filter to display just the rows of your data frame that have a non-missing value of v.

29.2 European Social Survey and voting

The European Social Survey is a giant survey carried out across Europe covering demographic information, attitudes to and amount of education, politics and so on. In this question, we will investigate what might make British people vote for a certain political party.

The information for this question is in a (large) spreadsheet at link. There is also a “codebook” at link that tells you what all the variables are. The ones we will use are the last five columns of the spreadsheet, described on pages 11 onwards of the codebook. (I could have given you more, but I didn’t want to make this any more complicated than it already was.)

  1. Read in the .csv file, and verify that you have lots of rows and columns.

  2. * Use the codebook to find out what the columns prtvtgb, gndr, agea, eduyrs and inwtm are. What do the values 1 and 2 for gndr mean? (You don’t, at this point, have to worry about the values for the other variables.)

  3. The three major political parties in Britain are the Conservative, Labour and Liberal Democrat. (These, for your information, correspond roughly to the Canadian Progressive Conservative, NDP and Liberal parties.) For the variable that corresponds to “political party voted for at the last election”, which values correspond to these three parties?

  4. Normally, I would give you a tidied-up data set. But I figure you could use some practice tidying this one up. As the codebook shows, there are some numerical codes for missing values, and we want to omit those. We want just the columns prtvtgb through inwtm from the right side of the spreadsheet. Use dplyr or tidyr tools to (i) select only these columns, (ii) include the rows that correspond to people who voted for one of the three major parties, (iii) include the rows that have an age at interview less than 999, (iv) include the rows that have less than 40 years of education, (v) include the rows that are not missing on inwtm (use the idea from Question~here for (v)). The last four of those (the inclusion of rows) can be done in one go.

  5. Why is my response variable nominal rather than ordinal? How can I tell? Which R function should I use, therefore, to fit my model?

  6. * Take the political party voted for, and turn it into a factor, by feeding it into factor. Fit an appropriate model to predict political party voted for at the last election (as a factor) from all the other variables. Gender is really a categorical variable too, but since there are only two possible values it can be treated as a number.

  7. We have a lot of explanatory variables. The standard way to test whether we need all of them is to take one of them out at a time, and test which ones we can remove. This is a lot of work. We won’t do that. Instead, the R function step does what you want. You feed step two things: a fitted model object, and the option trace=0 (otherwise you get a lot of output). The final part of the output from step tells you which explanatory variables you need to keep. Run step on your fitted model. Which explanatory variables need to stay in the model here?

  8. Fit the model indicated by step (in the last part).

  9. I didn’t think that interview length could possibly be relevant to which party a person voted for. Test whether interview length can be removed from your model of the last part. What do you conclude? (Note that step and this test may disagree.)

  10. Use your best model to obtain predictions from some suitably chosen combinations of values of the explanatory variables that remain. (If you have quantitative explanatory variables left, you could use their first and third quartiles as values to predict from. Running summary on the data frame will get summaries of all the variables.)

  11. What is the effect of increasing age? What is the effect of an increase in years of education?

29.3 Alligator food

What do alligators most like to eat? 219 alligators were captured in four Florida lakes. Each alligator’s stomach contents were observed, and the food that the alligator had eaten was classified into one of five categories: fish, invertebrates (such as snails or insects), reptiles (such as turtles), birds, and “other” (such as amphibians, plants or rocks). The researcher noted for each alligator what that alligator had most of in its stomach, as well as the gender of each alligator and whether it was “large” or “small” (greater or less than 2.3 metres in length). The data can be found in link. The numbers in the data set (apart from the first column) are all frequencies. (You can ignore that first column “profile”.)

Our aim is to predict food type from the other variables.

  1. Read in the data and display the first few lines. Describe how the data are not “tidy”.

  2. Use pivot_longer to arrange the data suitably for analysis (which will be using multinom). Demonstrate (by looking at the first few rows of your new data frame) that you now have something tidy.

  3. What is different about this problem, compared to Question here, that would make multinom the right tool to use?

  4. Fit a suitable multinomial model predicting food type from gender, size and lake. Does each row represent one alligator or more than one? If more than one, account for this in your modelling.

  5. Do a test to see whether Gender should stay in the model. (This will entail fitting another model.) What do you conclude?

  6. Predict the probability that an alligator prefers each food type, given its size, gender (if necessary) and the lake it was found in, using the more appropriate of the two models that you have fitted so far. This means (i) making a data frame for prediction, and (ii) obtaining and displaying the predicted probabilities in a way that is easy to read.

  7. What do you think is the most important way in which the lakes differ? (Hint: look at where the biggest predicted probabilities are.)

  8. How would you describe the major difference between the diets of the small and large alligators?

29.4 Crimes in San Francisco

The data in link is a subset of a huge dataset of crimes committed in San Francisco between 2003 and 2015. The variables are:

  • Dates: the date and time of the crime

  • Category: the category of crime, eg. “larceny” or “vandalism” (response).

  • Descript: detailed description of crime.

  • DayOfWeek: the day of the week of the crime.

  • PdDistrict: the name of the San Francisco Police Department district in which the crime took place.

  • Resolution: how the crime was resolved

  • Address: approximate street address of crime

  • X: longitude

  • Y: latitude

Our aim is to see whether the category of crime depends on the day of the week and the district in which it occurred. However, there are a lot of crime categories, so we will focus on the top four “interesting” ones, which are the ones included in this data file.

Some of the model-fitting takes a while (you’ll see why below). In your Quarto document, you can wait for the models to fit each time you re-run your document, or you can insert #| cache: true below the top line of your code chunk (above the first line of actual code). What that does is to re-run that code chunk only if it changes; if it hasn’t changed it will use the saved results from last time it was run. That can save you a lot of waiting around.

  1. Read in the data and display the dataset (or, at least, part of it).

  2. Fit a multinomial logistic regression that predicts crime category from day of week and district. (You don’t need to look at it.) The model-fitting produces some output in the console to convince you that it has worked.

  3. Fit a model that predicts Category from only the district. Hand in the output from the fitting process as well.

  4. Use anova to compare the two models you just obtained. What does the anova tell you?

  5. Using your preferred model, obtain predicted probabilities that a crime will be of each of these four categories for each day of the week in the TENDERLOIN district (the name is ALL CAPS). This will mean constructing a data frame to predict from, obtaining the predictions and then displaying them suitably.

  6. Describe briefly how the weekend days Saturday and Sunday differ from the rest.

29.5 Crimes in San Francisco – the data

The data in link is a huge dataset of crimes committed in San Francisco between 2003 and 2015. The variables are:

  • Dates: the date and time of the crime

  • Category: the category of crime, eg. “larceny” or “vandalism” (response).

  • Descript: detailed description of crime.

  • DayOfWeek: the day of the week of the crime.

  • PdDistrict: the name of the San Francisco Police Department district in which the crime took place.

  • Resolution: how the crime was resolved

  • Address: approximate street address of crime

  • X: longitude

  • Y: latitude

Our aim is to see whether the category of crime depends on the day of the week and the district in which it occurred. However, there are a lot of crime categories, so we will focus on the top four “interesting” ones, which we will have to discover.

  1. Read in the data and verify that you have these columns and a lot of rows. (The data may take a moment to read in. You will see why.)

  2. How is the response variable here different to the one in the question about steak preferences (and therefore why would multinom from package nnet be the method of choice)?

  3. Find out which crime categories there are, and arrange them in order of how many crimes there were in each category.

  4. Which are the four most frequent “interesting” crime categories, that is to say, not including “other offenses” and “non-criminal”? Get them into a vector called my.crimes. See if you can find a way of doing this that doesn’t involve typing them in (for full marks).

  5. (Digression, but needed for the next part.) The R vector letters contains the lowercase letters from a to z. Consider the vector ('a','m',3,'Q'). Some of these are found amongst the lowercase letters, and some not. Type these into a vector v and explain briefly why v %in% letters produces what it does.

  6. We are going to filter only the rows of our data frame that have one of the crimes in my.crimes as their Category. Also, select only the columns Category, DayOfWeek and PdDistrict. Save the resulting data frame and display its structure. (You should have a lot fewer rows than you did before.)

  7. Save these data in a file sfcrime1.csv.

29.6 What sports do these athletes play?

The data at link are physical and physiological measurements of 202 male and female Australian elite athletes. The data values are separated by tabs. We are going to see whether we can predict the sport an athlete plays from their height and weight.

The sports, if you care, are respectively basketball, “field athletics” (eg. shot put, javelin throw, long jump etc.), gymnastics, netball, rowing, swimming, 400m running, tennis, sprinting (100m or 200m running), water polo.

  1. Read in the data and display the first few rows.

  2. Make a scatterplot of height vs. weight, with the points coloured by what sport the athlete plays. Put height on the \(x\)-axis and weight on the \(y\)-axis.

  3. Explain briefly why a multinomial model (multinom from nnet) would be the best thing to use to predict sport played from the other variables.

  4. Fit a suitable model for predicting sport played from height and weight. (You don’t need to look at the results.) 100 steps isn’t quite enough, so set maxit equal to a larger number to allow the estimation to finish.

  5. Demonstrate using anova that Wt should not be removed from this model.

  6. Make a data frame consisting of all combinations of Ht 160, 180 and 200 (cm), and Wt 50, 75, and 100 (kg), and use it to obtain predicted probabilities of athletes of those heights and weights playing each of the sports. Display the results. You might have to display them smaller, or reduce the number of decimal places1 to fit them on the page.

  7. For an athlete who is 180 cm tall and weighs 100 kg, what sport would you guess they play? How sure are you that you are right? Explain briefly.

My solutions follow:

29.7 Finding non-missing values

* This is to prepare you for something in the next question. It’s meant to be easy.

In R, the code NA stands for “missing value” or “value not known”. In R, NA should not have quotes around it. (It is a special code, not a piece of text.)

  1. Create a vector v that contains some numbers and some missing values, using c(). Put those values into a one-column data frame.

Solution

Like this. The arrangement of numbers and missing values doesn’t matter, as long as you have some of each:

v <- c(1, 2, NA, 4, 5, 6, 9, NA, 11)
mydata <- tibble(v)
mydata

This has one column called v.

\(\blacksquare\)

  1. Obtain a new column containing is.na(v). When is this true and when is this false?

Solution

mydata <- mydata %>% mutate(isna = is.na(v))
mydata

This is TRUE if the corresponding element of v is missing (in my case, the third value and the second-last one), and FALSE otherwise (when there is an actual value there).

\(\blacksquare\)

  1. The symbol ! means “not” in R (and other programming languages). What does !is.na(v) do? Create a new column containing that.

Solution

Try it and see. Give it whatever name you like. My name reflects that I know what it’s going to do:

mydata <- mydata %>% mutate(notisna = !is.na(v))
mydata

This is the logical opposite of is.na: it’s true if there is a value, and false if it’s missing.

\(\blacksquare\)

  1. Use filter to display just the rows of your data frame that have a non-missing value of v.

Solution

filter takes a column to say which rows to pick, in which case the column should contain something that either is TRUE or FALSE, or something that can be interpreted that way:

mydata %>% filter(notisna)

or you can provide filter something that can be calculated from what’s in the data frame, and also returns something that is either true or false:

mydata %>% filter(!is.na(v))

In either case, I only have non-missing values of v.

\(\blacksquare\)

29.8 European Social Survey and voting

The European Social Survey is a giant survey carried out across Europe covering demographic information, attitudes to and amount of education, politics and so on. In this question, we will investigate what might make British people vote for a certain political party.

The information for this question is in a (large) spreadsheet at link. There is also a “codebook” at link that tells you what all the variables are. The ones we will use are the last five columns of the spreadsheet, described on pages 11 onwards of the codebook. (I could have given you more, but I didn’t want to make this any more complicated than it already was.)

  1. Read in the .csv file, and verify that you have lots of rows and columns.

Solution

The obvious way. Printing it out will display some of the data and tell you how many rows and columns you have:

my_url <- "http://ritsokiguess.site/datafiles/ess.csv"
ess <- read_csv(my_url)
Rows: 2286 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): cntry, cname, cproddat, name
dbl (13): cedition, cseqno, essround, edition, idno, dweight, pspwght, pweig...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ess

2286 rows and 17 columns.

\(\blacksquare\)

  1. * Use the codebook to find out what the columns prtvtgb, gndr, agea, eduyrs and inwtm are. What do the values 1 and 2 for gndr mean? (You don’t, at this point, have to worry about the values for the other variables.)

Solution

Respectively, political party voted for at last election, gender (of respondent), age at interview, years of full-time education, length of interview (in minutes). For gndr, male is 1 and female is 2.

\(\blacksquare\)

  1. The three major political parties in Britain are the Conservative, Labour and Liberal Democrat. (These, for your information, correspond roughly to the Canadian Progressive Conservative, NDP and Liberal parties.) For the variable that corresponds to “political party voted for at the last election”, which values correspond to these three parties?

Solution

1, 2 and 3 respectively. (That was easy!)

\(\blacksquare\)

  1. Normally, I would give you a tidied-up data set. But I figure you could use some practice tidying this one up. As the codebook shows, there are some numerical codes for missing values, and we want to omit those. We want just the columns prtvtgb through inwtm from the right side of the spreadsheet. Use dplyr or tidyr tools to (i) select only these columns, (ii) include the rows that correspond to people who voted for one of the three major parties, (iii) include the rows that have an age at interview less than 999, (iv) include the rows that have less than 40 years of education, (v) include the rows that are not missing on inwtm (use the idea from Question~here for (v)). The last four of those (the inclusion of rows) can be done in one go.

Solution

The major parties are numbered 1, 2 and 3, so we can select the ones less than 4 (or <=3). The reference back to the last question is a hint to use !is.na(). It also works to use drop_na, if you are familiar with that.

ess %>%
  select(prtvtgb:inwtm) %>%
  filter(prtvtgb < 4, agea < 999, eduyrs < 40, !is.na(inwtm)) -> ess.major

You might get a weird error in select, something about “unused argument”. If this happens to you, it’s not because you used select wrong, it’s because you used the wrong select! There is one in MASS, and you need to make sure that this package is “detached” so that you use the select you want, namely the one in dplyr, loaded with the tidyverse. Use the instructions at the end of the mobile phones question or the abortion question to do this.

The other way around this is to say, instead of select, dplyr::select with two colons. This means “the select that lives in dplyr, no other”, and is what Wikipedia calls “disambiguation”: out of several things with the same name, you say which one you mean.

If you do the pipeline, you will probably not get it right the first time. (I didn’t.) For debugging, try out one step at a time, and summarize what you have so far, so that you can check it for correctness. A handy trick for that is to make the last piece of your pipeline summary(), which produces a summary of the columns of the resulting data frame. For example, I first did this (note that my filter is a lot simpler than the one above):

ess %>%
  select(prtvtgb:inwtm) %>%
  filter(prtvtgb < 4, !is.na(inwtm)) %>%
  summary()
    prtvtgb           gndr            agea            eduyrs     
 Min.   :1.000   Min.   :1.000   Min.   : 18.00   Min.   : 0.00  
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 44.00   1st Qu.:11.00  
 Median :2.000   Median :2.000   Median : 58.00   Median :13.00  
 Mean   :1.803   Mean   :1.572   Mean   : 61.74   Mean   :14.23  
 3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.: 71.00   3rd Qu.:16.00  
 Max.   :3.000   Max.   :2.000   Max.   :999.00   Max.   :88.00  
     inwtm       
 Min.   :  7.00  
 1st Qu.: 35.00  
 Median : 41.00  
 Mean   : 43.54  
 3rd Qu.: 50.00  
 Max.   :160.00  

The mean of a categorical variable like party voted for or gender doesn’t make much sense, but it looks as if all the values are sensible ones (1 to 3 and 1, 2 respectively). However, the maximum values of age and years of education look like missing value codes, hence the other requirements I put in the question.2

Displaying as the last step of your pipeline also works, but the advantage of summary is that you get to see whether there are any unusual values, in this case unusually large values that are missing value codes.

\(\blacksquare\)

  1. Why is my response variable nominal rather than ordinal? How can I tell? Which R function should I use, therefore, to fit my model?

Solution

The response variable is political party voted for. There is no (obvious) ordering to this (unless you want to try to place the parties on a left-right spectrum), so this is nominal, and you’ll need multinom from package nnet.

If I had included the minor parties and you were working on a left-right spectrum, you would have had to decide where to put the somewhat libertarian Greens3 or the parties that exist only in Northern Ireland.4

\(\blacksquare\)

  1. * Take the political party voted for, and turn it into a factor, by feeding it into factor. Fit an appropriate model to predict political party voted for at the last election (as a factor) from all the other variables. Gender is really a categorical variable too, but since there are only two possible values it can be treated as a number.

Solution

This, or something like it. multinom lives in package nnet, which you’ll have to install first if you haven’t already:

library(nnet)
ess.1 <- multinom(factor(prtvtgb) ~ gndr + agea + eduyrs + inwtm, data = ess.major)
# weights:  18 (10 variable)
initial  value 1343.602829 
iter  10 value 1256.123798
final  value 1247.110080 
converged

Or create a factor version of your response in the data frame first:

ess.major <- ess.major %>% mutate(party = factor(prtvtgb))

and then:

ess.1a <- multinom(party ~ gndr + agea + eduyrs + inwtm, data = ess.major)
# weights:  18 (10 variable)
initial  value 1343.602829 
iter  10 value 1256.123798
final  value 1247.110080 
converged

\(\blacksquare\)

  1. We have a lot of explanatory variables. The standard way to test whether we need all of them is to take one of them out at a time, and test which ones we can remove. This is a lot of work. We won’t do that. Instead, the R function step does what you want. You feed step two things: a fitted model object, and the option trace=0 (otherwise you get a lot of output). The final part of the output from step tells you which explanatory variables you need to keep. Run step on your fitted model. Which explanatory variables need to stay in the model here?

Solution

I tried to give you lots of hints here:

ess.2a <- step(ess.1, trace = 0)
trying - gndr 
trying - agea 
trying - eduyrs 
trying - inwtm 
# weights:  15 (8 variable)
initial  value 1343.602829 
iter  10 value 1248.343563
final  value 1248.253638 
converged
trying - agea 
trying - eduyrs 
trying - inwtm 
ess.2a
Call:
multinom(formula = factor(prtvtgb) ~ agea + eduyrs + inwtm, data = ess.major)

Coefficients:
  (Intercept)        agea     eduyrs       inwtm
2    1.632266 -0.02153694 -0.0593757 0.009615167
3   -1.281031 -0.01869263  0.0886487 0.009337084

Residual Deviance: 2496.507 
AIC: 2512.507 

If you didn’t save your output in a variable, you’ll get my last bit automatically.

The end of the output gives us coefficients for (and thus tells us we need to keep) age, years of education and interview length.

The actual numbers don’t mean much; it’s the indication that the variable has stayed in the model that makes a difference.5

If you’re wondering about the process: first step tries to take out each explanatory variable, one at a time, from the starting model (the one that contains all the variables). Then it finds the best model out of those and fits it. (It doesn’t tell us which model this is, but evidently it’s the one without gender.) Then it takes that model and tries to remove its explanatory variables one at a time (there are only three of them left). Having decided it cannot remove any of them, it stops, and shows us what’s left.

Leaving out the trace=0 shows more output and more detail on the process, but I figured this was enough (and this way, you don’t have to wade through all of that output). Try values like 1 or 2 for trace and see what you get.

\(\blacksquare\)

  1. Fit the model indicated by step (in the last part).

Solution

Copy and paste, and take out the variables you don’t need. Or, better, save the output from step in a variable. This then becomes a fitted model object and you can look at it any of the ways you can look at a model fit. I found that gender needed to be removed, but if yours is different, follow through with whatever your step said to do.

ess.2 <- multinom(party ~ agea + eduyrs + inwtm, data = ess.major)
# weights:  15 (8 variable)
initial  value 1343.602829 
iter  10 value 1248.343563
final  value 1248.253638 
converged

If you saved the output from step, you’ll already have this and you don’t need to do it again:

anova(ess.2, ess.2a)

Same model.

\(\blacksquare\)

  1. I didn’t think that interview length could possibly be relevant to which party a person voted for. Test whether interview length can be removed from your model of the last part. What do you conclude? (Note that step and this test may disagree.)

Solution

Fit the model without inwtm:

ess.3 <- multinom(party ~ agea + eduyrs, data = ess.major)
# weights:  12 (6 variable)
initial  value 1343.602829 
iter  10 value 1250.418281
final  value 1250.417597 
converged

and then use anova to compare them:

anova(ess.3, ess.2)

The P-value, 0.1149, is not small, which says that the smaller model is good, ie. the one without interview length.

I thought drop1 would also work here, but it appears not to:

drop1(ess.1, test = "Chisq")
trying - gndr 
Error in if (trace) {: argument is not interpretable as logical

I think that’s a bug in multinom, since normally if step works, then drop1 will work too (normally step uses drop1).

The reason for the disagreement between step and anova is that step will tend to keep marginal explanatory variables, that is, ones that are “potentially interesting” but whose P-values might not be less than 0.05. There is still no substitute for your judgement in figuring out what to do! step uses a thing called AIC to decide what to do, rather than actually doing a test. If you know about “adjusted R-squared” in choosing explanatory variables for a regression, it’s the same idea: a variable can be not quite significant but still make the adjusted R-squared go up (typically only a little).

\(\blacksquare\)

  1. Use your best model to obtain predictions from some suitably chosen combinations of values of the explanatory variables that remain. (If you have quantitative explanatory variables left, you could use their first and third quartiles as values to predict from. Running summary on the data frame will get summaries of all the variables.)

Solution

First make our new data frame of values to predict from. You can use quantile or summary to find the quartiles. I only had agea and eduyrs left, having decided that interview time really ought to come out:

summary(ess.major)
    prtvtgb           gndr            agea           eduyrs     
 Min.   :1.000   Min.   :1.000   Min.   :18.00   Min.   : 0.00  
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:44.00   1st Qu.:11.00  
 Median :2.000   Median :2.000   Median :58.00   Median :13.00  
 Mean   :1.803   Mean   :1.574   Mean   :57.19   Mean   :13.45  
 3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:71.00   3rd Qu.:16.00  
 Max.   :3.000   Max.   :2.000   Max.   :94.00   Max.   :33.00  
     inwtm       party  
 Min.   :  7.0   1:484  
 1st Qu.: 35.0   2:496  
 Median : 41.0   3:243  
 Mean   : 43.7          
 3rd Qu.: 50.0          
 Max.   :160.0          

Quartiles for age are 44 and 71, and for years of education are 11 and 16.

This time, instead of predicting for variable values that predictions chooses for us (like a five-number summary), we are predicting for “custom” values, ones that we chose. To set that up, the marginaleffects way is to use datagrid like this:

new <- datagrid(model = ess.3, agea = c(44, 71), eduyrs = c(11, 16))
new

What datagrid does is to make all combinations of your variable values, and along with that, to use “typical” values for the others: the mean, in the case of quantitative variables like inwtm, and the most common category for categorical ones like party. If you feed datagrid a model first, it only includes variables in that model, which is easier to make sense of:

datagrid(newdata = ess.major, agea = c(44, 71), eduyrs = c(11, 16))

The other variables don’t make much sense, since they are really categorical but expressed as numbers, but they are not in the best model, so that doesn’t do any harm. (In other cases, you might need to be more careful.)

Next, we feed this into predictions, using the above dataframe as newdata, and with our best model, ess.3 (the one without interview length). The results might confuse you at first, since you will probably get an error:

cbind(predictions(ess.3, newdata = new))

The error message gives you a hint about what to do: add a type = "probs" to predictions (which is a consequence of how multinom works):

cbind(predictions(ess.3, newdata = new, type = "probs"))

There are twelve rows for our four predictions, because there are three predictions for each of our four “people”: the probabilities of each one voting for each of the three parties. The party predicted for is in the column group, and the probability of each person (labelled by rowid) voting for that party is in predicted. Let’s simplify things by keeping only those columns and the ones we are predicting for:

cbind(predictions(ess.3, newdata = new, type = "probs")) %>% 
  select(rowid, group, estimate, agea, eduyrs)

and then pivot wider to get all three predictions for each person on one line:

cbind(predictions(ess.3, newdata = new, type = "probs")) %>% 
  select(rowid, group, estimate, agea, eduyrs) %>% 
  pivot_wider(names_from = group, values_from = estimate)

\(\blacksquare\)

  1. What is the effect of increasing age? What is the effect of an increase in years of education?

Solution

To assess the effect of age, hold years of education constant. Thus, compare lines 1 and 3 (or 2 and 4): increasing age tends to increase the chance that a person will vote Conservative (party 1), and decrease the chance that a person will vote Labour (party 2). There doesn’t seem to be much effect of age on the chance that a person will vote Liberal Democrat.

To assess education, hold age constant, and thus compare rows 1 and 2 (or rows 3 and 4). This time, there isn’t much effect on the chances of voting Conservative, but as education increases, the chance of voting Labour goes down, and the chance of voting Liberal Democrat goes up.

A little history: back 150 or so years ago, Britain had two political parties, the Tories and the Whigs. The Tories became the Conservative party (and hence, in Britain and in Canada, the Conservatives are nicknamed Tories6). The Whigs became Liberals. At about the same time as working people got to vote (not women, yet, but working men) the Labour Party came into existence. The Labour Party has always been affiliated with working people and trades unions, like the NDP here. But power has typically alternated between Conservative and Labour goverments, with the Liberals as a third party. In the 1980s a new party called the Social Democrats came onto the scene, but on realizing that they couldn’t make much of a dent by themselves, they merged with the Liberals to form the Liberal Democrats, which became a slightly stronger third party.

I was curious about what the effect of interview length would be. Presumably, the effect is small, but I have no idea which way it would be. To assess this, this is predictions again, but this time we can let predictions pick some values for inwtm for us, and leave everything else at their mean. We have to remember to use the model ess.2 that contained interview length, this time:

cbind(predictions(ess.2, variables = "inwtm", type = "probs")) %>% 
  select(rowid, group, estimate, agea, eduyrs, inwtm) %>% 
  pivot_wider(names_from = group, values_from = estimate)