library(nnet)
library(tidyverse)
29 Logistic regression with nominal response
29.1 Finding non-missing values
* This is to prepare you for something in the next question. It’s meant to be easy.
In R, the code NA
stands for “missing value” or “value not known”. In R, NA
should not have quotes around it. (It is a special code, not a piece of text.)
Create a vector
v
that contains some numbers and some missing values, usingc()
. Put those values into a one-column data frame.Obtain a new column containing
is.na(v)
. When is this true and when is this false?The symbol
!
means “not” in R (and other programming languages). What does!is.na(v)
do? Create a new column containing that.Use
filter
to display just the rows of your data frame that have a non-missing value ofv
.
29.3 Alligator food
What do alligators most like to eat? 219 alligators were captured in four Florida lakes. Each alligator’s stomach contents were observed, and the food that the alligator had eaten was classified into one of five categories: fish, invertebrates (such as snails or insects), reptiles (such as turtles), birds, and “other” (such as amphibians, plants or rocks). The researcher noted for each alligator what that alligator had most of in its stomach, as well as the gender of each alligator and whether it was “large” or “small” (greater or less than 2.3 metres in length). The data can be found in link. The numbers in the data set (apart from the first column) are all frequencies. (You can ignore that first column “profile”.)
Our aim is to predict food type from the other variables.
Read in the data and display the first few lines. Describe how the data are not “tidy”.
Use
pivot_longer
to arrange the data suitably for analysis (which will be usingmultinom
). Demonstrate (by looking at the first few rows of your new data frame) that you now have something tidy.What is different about this problem, compared to Question here, that would make
multinom
the right tool to use?Fit a suitable multinomial model predicting food type from gender, size and lake. Does each row represent one alligator or more than one? If more than one, account for this in your modelling.
Do a test to see whether
Gender
should stay in the model. (This will entail fitting another model.) What do you conclude?Predict the probability that an alligator prefers each food type, given its size, gender (if necessary) and the lake it was found in, using the more appropriate of the two models that you have fitted so far. This means (i) making a data frame for prediction, and (ii) obtaining and displaying the predicted probabilities in a way that is easy to read.
What do you think is the most important way in which the lakes differ? (Hint: look at where the biggest predicted probabilities are.)
How would you describe the major difference between the diets of the small and large alligators?
29.4 Crimes in San Francisco
The data in link is a subset of a huge dataset of crimes committed in San Francisco between 2003 and 2015. The variables are:
Dates
: the date and time of the crimeCategory
: the category of crime, eg. “larceny” or “vandalism” (response).Descript
: detailed description of crime.DayOfWeek
: the day of the week of the crime.PdDistrict
: the name of the San Francisco Police Department district in which the crime took place.Resolution
: how the crime was resolvedAddress
: approximate street address of crimeX
: longitudeY
: latitude
Our aim is to see whether the category of crime depends on the day of the week and the district in which it occurred. However, there are a lot of crime categories, so we will focus on the top four “interesting” ones, which are the ones included in this data file.
Some of the model-fitting takes a while (you’ll see why below). In your Quarto document, you can wait for the models to fit each time you re-run your document, or you can insert #| cache: true
below the top line of your code chunk (above the first line of actual code). What that does is to re-run that code chunk only if it changes; if it hasn’t changed it will use the saved results from last time it was run. That can save you a lot of waiting around.
Read in the data and display the dataset (or, at least, part of it).
Fit a multinomial logistic regression that predicts crime category from day of week and district. (You don’t need to look at it.) The model-fitting produces some output in the console to convince you that it has worked.
Fit a model that predicts Category from only the district. Hand in the output from the fitting process as well.
Use
anova
to compare the two models you just obtained. What does theanova
tell you?Using your preferred model, obtain predicted probabilities that a crime will be of each of these four categories for each day of the week in the
TENDERLOIN
district (the name is ALL CAPS). This will mean constructing a data frame to predict from, obtaining the predictions and then displaying them suitably.Describe briefly how the weekend days Saturday and Sunday differ from the rest.
29.5 Crimes in San Francisco – the data
The data in link is a huge dataset of crimes committed in San Francisco between 2003 and 2015. The variables are:
Dates
: the date and time of the crimeCategory
: the category of crime, eg. “larceny” or “vandalism” (response).Descript
: detailed description of crime.DayOfWeek
: the day of the week of the crime.PdDistrict
: the name of the San Francisco Police Department district in which the crime took place.Resolution
: how the crime was resolvedAddress
: approximate street address of crimeX
: longitudeY
: latitude
Our aim is to see whether the category of crime depends on the day of the week and the district in which it occurred. However, there are a lot of crime categories, so we will focus on the top four “interesting” ones, which we will have to discover.
Read in the data and verify that you have these columns and a lot of rows. (The data may take a moment to read in. You will see why.)
How is the response variable here different to the one in the question about steak preferences (and therefore why would
multinom
from packagennet
be the method of choice)?Find out which crime categories there are, and arrange them in order of how many crimes there were in each category.
Which are the four most frequent “interesting” crime categories, that is to say, not including “other offenses” and “non-criminal”? Get them into a vector called
my.crimes
. See if you can find a way of doing this that doesn’t involve typing them in (for full marks).(Digression, but needed for the next part.) The R vector
letters
contains the lowercase letters froma
toz
. Consider the vector('a','m',3,'Q')
. Some of these are found amongst the lowercase letters, and some not. Type these into a vectorv
and explain briefly whyv %in% letters
produces what it does.We are going to
filter
only the rows of our data frame that have one of the crimes inmy.crimes
as theirCategory
. Also,select
only the columnsCategory
,DayOfWeek
andPdDistrict
. Save the resulting data frame and display its structure. (You should have a lot fewer rows than you did before.)Save these data in a file
sfcrime1.csv
.
29.6 What sports do these athletes play?
The data at link are physical and physiological measurements of 202 male and female Australian elite athletes. The data values are separated by tabs. We are going to see whether we can predict the sport an athlete plays from their height and weight.
The sports, if you care, are respectively basketball, “field athletics” (eg. shot put, javelin throw, long jump etc.), gymnastics, netball, rowing, swimming, 400m running, tennis, sprinting (100m or 200m running), water polo.
Read in the data and display the first few rows.
Make a scatterplot of height vs. weight, with the points coloured by what sport the athlete plays. Put height on the \(x\)-axis and weight on the \(y\)-axis.
Explain briefly why a multinomial model (
multinom
fromnnet
) would be the best thing to use to predict sport played from the other variables.Fit a suitable model for predicting sport played from height and weight. (You don’t need to look at the results.) 100 steps isn’t quite enough, so set
maxit
equal to a larger number to allow the estimation to finish.Demonstrate using
anova
thatWt
should not be removed from this model.Make a data frame consisting of all combinations of
Ht
160, 180 and 200 (cm), andWt
50, 75, and 100 (kg), and use it to obtain predicted probabilities of athletes of those heights and weights playing each of the sports. Display the results. You might have to display them smaller, or reduce the number of decimal places1 to fit them on the page.For an athlete who is 180 cm tall and weighs 100 kg, what sport would you guess they play? How sure are you that you are right? Explain briefly.
My solutions follow:
29.7 Finding non-missing values
* This is to prepare you for something in the next question. It’s meant to be easy.
In R, the code NA
stands for “missing value” or “value not known”. In R, NA
should not have quotes around it. (It is a special code, not a piece of text.)
- Create a vector
v
that contains some numbers and some missing values, usingc()
. Put those values into a one-column data frame.
Solution
Like this. The arrangement of numbers and missing values doesn’t matter, as long as you have some of each:
<- c(1, 2, NA, 4, 5, 6, 9, NA, 11)
v <- tibble(v)
mydata mydata
This has one column called v
.
\(\blacksquare\)
- Obtain a new column containing
is.na(v)
. When is this true and when is this false?
Solution
<- mydata %>% mutate(isna = is.na(v))
mydata mydata
This is TRUE
if the corresponding element of v
is missing (in my case, the third value and the second-last one), and FALSE
otherwise (when there is an actual value there).
\(\blacksquare\)
- The symbol
!
means “not” in R (and other programming languages). What does!is.na(v)
do? Create a new column containing that.
Solution
Try it and see. Give it whatever name you like. My name reflects that I know what it’s going to do:
<- mydata %>% mutate(notisna = !is.na(v))
mydata mydata
This is the logical opposite of is.na
: it’s true if there is a value, and false if it’s missing.
\(\blacksquare\)
- Use
filter
to display just the rows of your data frame that have a non-missing value ofv
.
Solution
filter
takes a column to say which rows to pick, in which case the column should contain something that either is TRUE
or FALSE
, or something that can be interpreted that way:
%>% filter(notisna) mydata
or you can provide filter
something that can be calculated from what’s in the data frame, and also returns something that is either true or false:
%>% filter(!is.na(v)) mydata
In either case, I only have non-missing values of v
.
\(\blacksquare\)