STAD29 Assignment 4

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Choice-box

A psychology experiment began by showing a video in which four German children demonstrated how to use a device called a “choice-box”, which consisted of three pipes. Three of the children in the video used pipe #1, demonstrating how to throw a ball into the pipe and receive a toy from a dispenser. The other child in the video used pipe #2, also throwing a ball into the pipe and receiving a toy from the dispenser. Pipe #3 was never used on the video.

The pipes on the choice-box were actually different colours, and different versions of the video were used in which the identity of pipes #1, #2, and #3 were varied at random, and the order of children using pipes #1 and #2 on the video were also varied at random: sometimes the three children demonstrating the same pipe appeared first, and sometimes the one child demonstrating the other pipe appeared first.

The 629 subjects of the experiment, who were other children of various ages, were each given one ball to use in the choice-box. The experimenter noted which pipe each subject threw the ball into, and how it related to the pipes used in the video that subject had watched. These are in the column y:

  • majority: the subject threw their ball into the pipe demonstrated by three children on their video (what I called pipe #1).
  • minority: the subject threw their ball into the pipe demonstrated by only one child on their video (what I called pipe #2).
  • unchosen: the subject threw their ball into the pipe demonstrated by none of the children on their video (what I called pipe #3). I should probably point out that these subjects got a toy from the dispenser as well.

The aim of the experiment was to see whether the subjects were influenced by what happened on the video they saw: for example, was a subject more likely to choose the pipe demonstrated three times on their video? The experimenters also recorded the gender, age, and culture of each subject (coded as C1 through C8), along with whether the video showed three children using pipe #1 first, or one child using pipe #2 first. Did these other variables have an effect on which pipe a subject chose? This kind of experiment might shed some light about how children are influenced by what they see and what changes it.

The data are in http://ritsokiguess.site/datafiles/Boxes.csv.

(a) (1 point) Read in and display (some of) the data.

As ever:

my_url <- "http://ritsokiguess.site/datafiles/Boxes.csv"
boxes <- read_csv(my_url)
Rows: 629 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): y, gender, culture, majority_first
dbl (1): age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
boxes

There are 629 rows (subjects), with one per row (which will mean later that we don’t have to deal with that weights thing).

Extra: the culture values are?

boxes %>% count(culture)

I had to work this out to finish writing the question. There are more children from culture C3 than any other.

(b) (2 points) What assumption is made about the response categories in order to use multinom from package nnet?

The response categories are assumed to be unordered (or, labels for different categories in no particular order).

You can decide for yourself whether you think the categories of y are actually ordered or unordered. For example, you might think that a sensible ordering would be majority, minority, unchosen, in descending order of how much influence the video had over the subject, but we are going to treat them as unordered here.

(c) (2 points) Fit an appropriate model for predicting the (treated as unordered) category of y from the other variables. Include a squared term in age. You don’t need to display any results.

That means something like this:

boxes.1 <- multinom(y ~ gender + age + I(age^2) + culture + majority_first, 
                    data = boxes)
# weights:  39 (24 variable)
initial  value 691.027130 
iter  10 value 619.662201
iter  20 value 615.538678
final  value 615.520419 
converged

(d) (3 points) To find out what, if anything, you can remove from your model, use step. The input to step is a model (here, the one you fitted in the previous part). The output from step is another model, the one obtained by removing everything that can be removed. Save this model. Running step displays some additional output, showing you what it is doing. (You might find that there is a lot of additional output; that is fine to hand in on this assignment.)

Aside: In most cases, drop1 works just fine, and you can remove non-significant things one at a time. This, however, is not one of those cases:

drop1(boxes.1, test = "Chisq")
trying - gender 
Error in if (trace) {: argument is not interpretable as logical

This is, as far as I can tell, an actual bug in multinom, or in the way multinom interfaces with drop1, one that no-one has gotten around to fixing.

End of aside.

So what you can do is to run step and let it do the removal of unimportant explanatory variables for you:

boxes.2 <- step(boxes.1)
Start:  AIC=1279.04
y ~ gender + age + I(age^2) + culture + majority_first

trying - gender 
# weights:  36 (22 variable)
initial  value 691.027130 
iter  10 value 623.801093
iter  20 value 618.975316
final  value 618.958678 
converged
trying - age 
# weights:  36 (22 variable)
initial  value 691.027130 
iter  10 value 620.650429
iter  20 value 617.395922
final  value 617.381771 
converged
trying - I(age^2) 
# weights:  36 (22 variable)
initial  value 691.027130 
iter  10 value 619.003503
iter  20 value 617.499856
final  value 617.497924 
converged
trying - culture 
# weights:  18 (10 variable)
initial  value 691.027130 
iter  10 value 624.110284
final  value 623.287626 
converged
trying - majority_first 
# weights:  36 (22 variable)
initial  value 691.027130 
iter  10 value 649.159141
iter  20 value 646.799762
final  value 646.785972 
converged
                 Df      AIC
- culture        10 1266.575
- age            22 1278.764
- I(age^2)       22 1278.996
<none>           24 1279.041
- gender         22 1281.917
- majority_first 22 1337.572
# weights:  18 (10 variable)
initial  value 691.027130 
iter  10 value 624.110284
final  value 623.287626 
converged

Step:  AIC=1266.58
y ~ gender + age + I(age^2) + majority_first

trying - gender 
# weights:  15 (8 variable)
initial  value 691.027130 
iter  10 value 626.524047
final  value 626.513205 
converged
trying - age 
# weights:  15 (8 variable)
initial  value 691.027130 
iter  10 value 627.242036
final  value 625.727509 
converged
trying - I(age^2) 
# weights:  15 (8 variable)
initial  value 691.027130 
iter  10 value 626.363996
final  value 626.092979 
converged
trying - majority_first 
# weights:  15 (8 variable)
initial  value 691.027130 
iter  10 value 654.784947
final  value 654.773132 
converged
                 Df      AIC
<none>           10 1266.575
- age             8 1267.455
- I(age^2)        8 1268.186
- gender          8 1269.026
- majority_first  8 1325.546

At the bottom of the extra output is a table showing what is in the final model from step: age, age-squared, gender, and majority_first. The top line <none> says that it is best, according to AIC (which is what step uses) to take out nothing else.

Extra: the reason there is a lot of extra output is that running multinom even once produces a progress report with lines “initial”, “iter” (sometimes a few of those), and “final”. The word “converged” below “final” shows that it worked properly (which it looked as if it did every time here).

The main reason for the long extra output, though, is that multinom had to be run a bunch of times:

  • starting from what I called boxes.1, take out each of the explanatory variables from the model one at a time and fit the model without each one, saving the AIC from each fit
  • make a table showing the resulting AIC for the removal of each explanatory variable, including removing nothing
  • remove the best thing to remove (here culture)
  • repeat the first two steps for the current best model
  • the best thing to remove is now nothing, so we are done, and the table shows what is still in our model.

(e) (3 points) For your best model, create a dataframe for predicting the probability that a child will choose the majority, minority, or unchosen pipe, for ages 5 through 13. What values have been used for the other explanatory variables?

This means using datagrid to supply some representative values for the other variables, and the ages we want. Use the model that came out of step, which I called boxes.2:

new <- datagrid(model = boxes.2, age = c(5:13))
new

As per tradition, my resulting dataframe is called new.

Two points for that much. The other point is for saying what values of the other variables found their way into your dataframe and why. These are gender boy and majority_first being no (and y being majority, but this is not used in the predictions). These are categorical, and these categories appear here because they are the most frequent categories in the data.

Extra: you don’t need to demonstrate that your assertion is correct, but if you wanted to:

boxes %>% count(gender)
boxes %>% count(majority_first)

It is a close call, but there are a few more boys than girls in the dataset, and slightly more of the time, what I called pipe #2 was demonstrated first in the video that a subject saw.1

(f) (3 points) Calculate and display your predictions side by side with the corresponding explanatory variable values. Arrange your predictions in a way that makes them easier to compare.

predictions. Select the output columns you want to keep, and then pivot the group (predicted response category) wider:

cbind(predictions(boxes.2, newdata = new)) %>% 
  select(group, age, gender, majority_first, estimate) %>% 
  pivot_wider(names_from = group, values_from = estimate)

Two points for the first two lines, and a third point for doing some sensible rearrangement of the predictions to make them easier to read.

Extra: the effect of age is that the youngest and oldest children are most likely to follow the majority (what I called pipe #1); older children are less likely to follow the minority (vs younger children) and more likely to choose the pipe that was never chosen on their video. Come up with your own explanation of why you think that is.

(The age effect for girls is the same as for boys, at least according to our model, because there is no interaction between gender and age in the model. I tried adding one, but it was not significant, so the effect of age seems to be the same for boys and for girls.)

(g) (2 points) Plot the predictions as they depend on age. Hint: use the simplified procedure shown in lecture, not the complicated one shown in the lecture notes.

Probably the easiest way to think about this is to start with what comes out of predictions before you tidy it up:

cbind(predictions(boxes.2, newdata = new)) 

The category of y being predicted for is called group (then, as you recall, we took the estimate for each group and pivoted them wider to make them easier to look at).

Then think back to what we did when plotting the predictions for the coal miners: we put the group on the end of the condition. In this case, that means having age first and then group:

plot_predictions(model = boxes.2, condition = c("age", "group"))

It is perfectly good (though more, unnecessary, work for you) to grapple with what is in the lecture notes:

plot_predictions(boxes.1, condition = c("age"), type = "probs", draw = FALSE)  %>% 
  ggplot(aes(x = age, y = estimate, colour = group)) +
  geom_line() 

Extra: You might be wondering what values are being used for the other variables gender and majority_first in the graph. I think it is the same as for datagrid: the most common category. I haven’t checked this; one way you might investigate is to compare the predictions you got in the previous part with your graph here.

The way plot_predictions works is to put the first thing in condition on the \(x\)-axis, to use coloured curves for the second thing, and to use facets for the third thing (if there is one). This suggests that you might add gender to the plot like this:

plot_predictions(model = boxes.2, condition = c("age", "group", "gender"))

It seems to make sense to keep the colours to represent the outcome category.

What we see here is that regardless of age, girls are more likely than boys to go with the minority (pipe #2) on their video, and less likely to go with the majority (pipe #1).

It also looks as if the plot I had you make was indeed based on boys.

2 NBA schedule

The NBA (National Basketball Association) runs North America’s major basketball league, whose games are played from October to April. The 2023-2024 schedule is at http://ritsokiguess.site/datafiles/nba_sched.csv.

(a) (1 point) Read in and display some of the data.

As you well know by now:

my_url <- "http://ritsokiguess.site/datafiles/nba_sched.csv"
nba <- read_csv(my_url)
Rows: 1200 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): date, road, home, start_eastern, TV

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nba

Extra: there is (as you might guess) a story to tell here. I went searching for the season schedule, and most of what you can find is something like the current week’s games (on, for example, a TV network website). The schedule you just read in is based on what looks like a press release from the NBA directly with the entire season’s schedule in it, but it was a PDF, and I didn’t know how to extract things from a PDF in R. Until I found out, that is. Most of what I do here uses ideas at https://crimebythenumbers.com/scrape-table.html.

There is a package pdftools that will help you extract text from PDF files, such as the one with the schedule in it here.

So:

library(pdftools)
Using poppler version 22.02.0

as well as the usual library(tidyverse), and then:

my_url <- "https://ak-static.cms.nba.com/wp-content/uploads/sites/46/2023/08/2023-24-NBA-Schedule-By-Date-as-of-8-17-23.pdf"
nba_pdf <- pdf_text(my_url)
str(nba_pdf)
 chr [1:24] "2023-24 NBA REGULAR SEASON SCHEDULE\nAS OF AUGUST 17, 2023 | SUBJECT TO CHANGE\n\n\n DAY      DATE         AWAY"| __truncated__ ...

pdf_text, from pdftools, extracts the text from a pdf file. Not in a very convenient way, though. There were 24 pages in the original, and the last line above (which shows the “structure”2 of nba_pdf without displaying all of it, because it is long) shows that each one of the pages ended up as one element of a text vector.

Let’s take a look at the last page (which is the shortest). This is one very long line, so scroll right to see more. Square brackets for pulling things out of a vector:

nba_pdf[24]
[1] "                         2023-24 NBA REGULAR SEASON SCHEDULE\n\n          DAY        DATE            AWAY                HOME        LOCAL       ET       NAT TV   #\n          Sun.      4/14/24           Detroit         San Antonio     2:30 PM   3:30 PM\n          Sun.      4/14/24            Utah           Golden State   12:30 PM   3:30 PM\n          Sun.      4/14/24          Houston          LA Clippers    12:30 PM   3:30 PM\n          Sun.      4/14/24          Portland         Sacramento     12:30 PM   3:30 PM\n\n\n        # - SCHEDULE/ARENA NOTES:\n        I - In-Season Tournament Group Play\n        A - Game to be played in Mexico City Arena, Mexico City\n        B - Game to be played in Accor Arena, Paris\n        C - Games to be played in Moody Center, Austin\n\n\n\n\nSUBJECT TO CHANGE                                  PAGE 24 of 24                          AS OF AUGUST 17, 2023\n"

What you can do now is to compare this to the last page of the original pdf. It contains all the same text, but not formatted as nicely. In particular, everywhere you see a \n in the above is the start of a new line in the original.3 So a good first step would be to turn those newline symbols into real newlines, which you can do like this:

nba_strings <- strsplit(nba_pdf, split = "\n")
str(nba_strings)
List of 24
 $ : chr [1:53] "2023-24 NBA REGULAR SEASON SCHEDULE" "AS OF AUGUST 17, 2023 | SUBJECT TO CHANGE" "" "" ...
 $ : chr [1:57] "                         2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY        DATE        AWAY              HOME        LOCAL       ET       NAT TV   #" "          Mon.      10/30/23      Detroit      Oklahoma City   7:00 PM    8:00 PM" ...
 $ : chr [1:58] "                         2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY        DATE         AWAY              HOME        LOCAL        ET       NAT TV   #" "          Mon.       11/6/23       Atlanta      Oklahoma City    7:00 PM    8:00 PM" ...
 $ : chr [1:58] "                         2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY        DATE         AWAY              HOME        LOCAL       ET       NAT TV   #" "          Tue.      11/14/23    LA Clippers        Denver       8:00 PM   10:00 PM    TNT     I" ...
 $ : chr [1:58] "                         2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY        DATE         AWAY              HOME        LOCAL       ET       NAT TV   #" "          Wed.      11/22/23      Denver           Orlando      7:00 PM    7:00 PM" ...
 $ : chr [1:57] "                         2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY        DATE         AWAY              HOME        LOCAL       ET       NAT TV   #" "          Wed.      11/29/23        Utah          Memphis       7:00 PM    8:00 PM" ...
 $ : chr [1:57] "                         2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY        DATE         AWAY              HOME        LOCAL       ET       NAT TV   #" "          Wed.      12/13/23      Atlanta          Toronto      7:30 PM    7:30 PM" ...
 $ : chr [1:57] "                         2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY        DATE        AWAY              HOME        LOCAL        ET     NAT TV     #" "          Wed.      12/20/23    Charlotte         Indiana       7:00 PM    7:00 PM" ...
 $ : chr [1:57] "                         2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY        DATE         AWAY              HOME        LOCAL       ET       NAT TV   #" "          Wed.      12/27/23     Milwaukee         Brooklyn     7:30 PM    7:30 PM" ...
 $ : chr [1:57] "                        2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY       DATE         AWAY              HOME        LOCAL       ET       NAT TV   #" "          Wed.       1/3/24      Brooklyn         Houston      7:00 PM    8:00 PM" ...
 $ : chr [1:57] "                        2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY       DATE         AWAY              HOME         LOCAL       ET       NAT TV   #" "          Wed.      1/10/24    Philadelphia         Atlanta     7:30 PM    7:30 PM   ESPN" ...
 $ : chr [1:57] "                        2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY       DATE         AWAY              HOME         LOCAL        ET       NAT TV   #" "          Wed.      1/17/24      Houston          New York       7:30 PM    7:30 PM" ...
 $ : chr [1:57] "                        2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY       DATE         AWAY              HOME         LOCAL        ET       NAT TV   #" "          Wed.      1/24/24      Phoenix            Dallas       9:00 PM   10:00 PM   ESPN" ...
 $ : chr [1:57] "                        2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY       DATE         AWAY              HOME         LOCAL       ET       NAT TV   #" "          Wed.      1/31/24      Phoenix          Brooklyn      7:30 PM    7:30 PM   ESPN" ...
 $ : chr [1:57] "                        2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY       DATE         AWAY              HOME         LOCAL        ET       NAT TV   #" "          Wed.       2/7/24      Atlanta            Boston       7:30 PM    7:30 PM   ESPN" ...
 $ : chr [1:57] "                        2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY       DATE         AWAY              HOME         LOCAL       ET       NAT TV   #" "          Wed.      2/14/24      Chicago          Cleveland     7:30 PM    7:30 PM   ESPN" ...
 $ : chr [1:57] "                        2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY       DATE         AWAY              HOME         LOCAL        ET       NAT TV   #" "          Tue.      2/27/24      Brooklyn          Orlando       7:00 PM    7:00 PM" ...
 $ : chr [1:57] "                        2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY       DATE         AWAY              HOME         LOCAL        ET       NAT TV   #" "          Tue.       3/5/24      Orlando          Charlotte      7:00 PM    7:00 PM" ...
 $ : chr [1:57] "                        2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY       DATE         AWAY              HOME         LOCAL        ET       NAT TV   #" "          Mon.      3/11/24      Toronto           Denver        7:00 PM    9:00 PM" ...
 $ : chr [1:57] "                        2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY       DATE         AWAY              HOME         LOCAL        ET       NAT TV   #" "          Mon.      3/18/24     Memphis          Sacramento      7:00 PM   10:00 PM" ...
 $ : chr [1:57] "                        2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY       DATE         AWAY              HOME         LOCAL       ET       NAT TV   #" "          Mon.      3/25/24       Dallas             Utah       7:00 PM    9:00 PM" ...
 $ : chr [1:57] "                        2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY       DATE        AWAY              HOME        LOCAL        ET       NAT TV   #" "          Tue.      4/2/24   Oklahoma City     Philadelphia    7:00 PM    7:00 PM" ...
 $ : chr [1:57] "                        2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY       DATE        AWAY              HOME         LOCAL       ET       NAT TV   #" "          Tue.       4/9/24      Detroit        Philadelphia   7:00 PM    7:00 PM" ...
 $ : chr [1:19] "                         2023-24 NBA REGULAR SEASON SCHEDULE" "" "          DAY        DATE            AWAY                HOME        LOCAL       ET       NAT TV   #" "          Sun.      4/14/24           Detroit         San Antonio     2:30 PM   3:30 PM" ...

strsplit comes from base R (so is old). There is undoubtedly something in stringr (beginning with str_) that will do the same job, but this works and is clear enough: “take the text in nba_pdf and turn it into a new piece of text, splitting at newlines”.

This is actually now a list, with one element for each page.4 Within the list, there is a text vector with one element for each line on that page. Each page has about 57 lines of text on it.

I don’t care which page anything is on, so I can use unlist to remove the “list attribute” (whatever makes it into a list), which turns it into a vector with one element per line. There is also some whitespace (spaces, tabs) at the beginning of each line, and things will work better later if I get rid of that. This does actually work with pipes, even though we are not working with dataframes:

nba_strings %>% 
  unlist() %>% 
  trimws(which = "left") -> nba1
str(nba1)
 chr [1:1329] "2023-24 NBA REGULAR SEASON SCHEDULE" ...

The non-pipe equivalent has nested brackets, which I’m not a fan of:

nba1 <- trimws(unlist(nba_strings), which = "left")

Either way, there are now 1329 lines, with no indication of pages.

I’m not a fan of nested brackets, but: turn nba_strings into a vector (from a list), and then remove any whitespace at the beginning (left side) of each line.

This now looks a lot better:5

head(nba1, 10)
 [1] "2023-24 NBA REGULAR SEASON SCHEDULE"                                               
 [2] "AS OF AUGUST 17, 2023 | SUBJECT TO CHANGE"                                         
 [3] ""                                                                                  
 [4] ""                                                                                  
 [5] "DAY      DATE         AWAY            HOME         LOCAL       ET       NAT TV   #"
 [6] "Tue.    10/24/23    L.A. Lakers       Denver       5:30 PM    7:30 PM    TNT"      
 [7] "Tue.    10/24/23      Phoenix      Golden State    7:00 PM   10:00 PM    TNT"      
 [8] "Wed.    10/25/23      Atlanta        Charlotte     7:00 PM    7:00 PM"             
 [9] "Wed.    10/25/23    Washington        Indiana      7:00 PM    7:00 PM"             
[10] "Wed.    10/25/23      Boston         New York      7:00 PM    7:00 PM   ESPN"      

If this were a datafile, you could edit out the header lines and read it in with read_table, because the columns that interest us are aligned. I did actually think about doing it that way, but we can do better, without any hand-editing needed.

The columns in here are separated by two or more spaces (and a variable number of spaces at that), as you can check. We have to be a little careful, because if you pretended that the columns were separated by one or more spaces, you would get messed up by team names that were more than one word, like the LA Lakers or Golden State (this was where read_table got tripped up for me).

The stringr package6 has a function str_split_fixed that does exactly what we want. The inputs are:

  • a vector of text to be split up (our nba1)
  • a regular expression that says how to split it up
  • the maximum number of columns to split it up into.

The regular expression below is how you encode “two or more spaces”. The first character inside the quotes is one space. The curly-bracket stuff says to match any number of that first character from 2 up to infinity. (You would expect to see another number after the comma, which would be the maximum number of spaces to match, but the fact that there isn’t one means that the number of spaces you match can be as big as you like.)

There are actually up to 8 columns, including the last one labelled # (I don’t know what that represents). The result is big, so I am using str again to show its structure:

str(str_split_fixed(nba1, " {2,}", 8))
 chr [1:1329, 1:8] "2023-24 NBA REGULAR SEASON SCHEDULE" ...

This is actually a matrix now (an array with rows and columns: 1329 rows and the 8 columns we just made). You can think of str_split_fixed as “adding a dimension”: we had a one-dimensional vector of text, and now we have a two-dimensional matrix of text, because we split each original piece of text into 8 parts.

I don’t like dealing with matrices much, but as_tibble will turn anything that looks like a dataframe into an actual dataframe:

str_split_fixed(nba1, " {2,}", 8) %>% as_tibble() -> nba2
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.
nba2

The warning says that the matrix we made didn’t have any column names, and the action as_tibble took was to make up some column names for us. (The actual column names we want to use are the sort of thing that’s in the fourth row, but we will get to that shortly.)

You can see that some of the rows of this dataframe represent actual scheduled basketball games, but some of them don’t: the others are the header and footer lines that came from the pages of the original pdf. We need some way to decide between these two kinds of rows. I first saw that the actual games have a date that ends in either /23 or /24; you might also look across to V5 or V6 and see that the start times of actual games have an AM or a PM in them (or maybe it would be enough to match the : there). Anyway, the date thing is what I saw.

stringr has a function str_detect that returns TRUE if the first input matches the second one. The first input needs to be a piece of text, and the second one needs to be a regular expression (again!). This one says that V2 has to match a literal slash, followed by a 2, followed by either a 3 or a 4, followed by the end of the text. (The season spans two calendar years, so a game can be played in either 2023 or 2024.)

nba2 %>% 
  filter(str_detect(V2, "/2[34]$")) %>% 
  rename(dow = V1,
         date = V2,
         road = V3,
         home = V4,
         start_eastern = V6,
         TV = V7) %>% 
  select(-dow, -V5, -V8) -> nba3
nba3

Having picked out the rows with games in them, I finally pick out the columns I want to save for you, and give them good names. I could probably do something more elegant than this using the column names that were in the original pdf, but this was enough for me. The pdf included the day of the week for each game, but I wanted you to work those out from the date, so I left them out.

Then I saved nba3 as a csv for you.

(b) (4 points) NBA games are played on different days of the week. Which day of the week has the most games, and which day of the week has the fewest? Use the tools you saw in lecture, in this course and STAC32, to work this out. (Hint: does it seem to matter that the year only has two digits here?)

Before you start coding, do some thinking:

  • to get the days of the week, you can use wday.
  • to use wday, you need actual dates, so you have to make those first. These are American-style dates, with the month first, so we will need mdy for this.
  • once you have worked out the day of week of each game, you can use count to count them up, and arrange to put them in order.

That leads to something like this:

nba %>% 
  mutate(date1 = mdy(date)) %>% 
  mutate(dow = wday(date1, label = TRUE)) %>% 
  count(dow) %>% 
  arrange(desc(n))

Wednesday is the most common day for a game, and Thursday is the least common.

Points: two for making dates, one for extracting the day of the week, and one for counting them up and sorting them (or otherwise making clear which was the largest and which was the smallest count).

To pick up on the hint: let’s run the first mutate again:

nba %>% 
  mutate(date1 = mdy(date)) %>% 
  select(date, date1)

mdy inferred that the dates were in the 21st century, so that the two-digit years were actually all right. (It would be smart to check this before you get too far into the question.)

Extra: I was actually surprised by the Wednesday thing; I was expecting a weekend day to be most common. I used to follow the NHL (back in the days of Wayne Gretzky as a player, which shows you how old I am). As I remember, there were most games on a Saturday, and fewest on a Monday, and that was the sort of thing I was expecting to see here. My guess is that a lot of people who follow the NBA also follow college basketball, many of whose games are played on Saturdays.

(c) (4 points) You have a friend who lives in Auckland, New Zealand, who is a big basketball fan. They have a streaming package that enables them to watch any NBA game live. They get home from work at 4:00pm (local time) every day. What are some games they would be able to watch from start to finish as they happen? Use tools we have seen in lecture to find this out. For the purposes of this question, you can include games that have already been played (ie. in 2023). Hints below:

  • your friend needs to know about games that start at 4:00pm or later Auckland time (16:00 or later).
  • you can use unite to glue a date and time together as text
  • if your time does not have seconds, omit the s in the appropriate function
  • when you create a date-time that needs to be in a certain timezone, add the timezone when you create it
  • America/Toronto will do for Eastern time; use OlsonNames() to get a list of all the time zone names that R knows about. (The output from OlsonNames() is long, so do not hand it in; just find what you need and use that.)

Again, thinking first:

  • we are talking time zones, so we need actual date-times with timezones.
  • the unite hint above says how to make a date-time as text
  • use something like ymd_hms to make an actual date-time. Here, the months are first and the years last, and there are no seconds, so use mdy_hm instead. (All these permutations exist.)
  • add a timezone when you are creating the date-times (Eastern time), otherwise you’ll get the UTC timezone.
  • find out using OlsonNames() that the time zone that Auckland, NZ is in is called Pacific/Auckland (down near the bottom; there is also a timezone called simply NZ that should also work. New Zealand as a whole has only one timezone.)
  • use with_tz to move a date-time from one timezone to another.

I know the start of a basketball game as “tip” or “tip-off”, hence my names for the columns below. Something like this is what you need:

nba %>% unite(tip_text, date, start_eastern) %>% 
  mutate(tip = mdy_hm(tip_text, tz = "America/Toronto")) %>% 
  mutate(tip_nz = with_tz(tip, "Pacific/Auckland"))

Two points for making Eastern timezone date-times (the first two lines above); one point for getting the same date-times in Auckland time (the third line); the last point for somehow finding out which games your friend can watch (see below).

On that last point, you could scroll down and find some games that start at or after 16:00 in the tip_nz column. But we can do better than that: pull out the hour from tip_nz, and grab the ones that are 16 or more:

nba %>% unite(tip_text, date, start_eastern) %>% 
  mutate(tip = mdy_hm(tip_text, tz = "America/Toronto")) %>% 
  mutate(tip_nz = with_tz(tip, "Pacific/Auckland")) %>% 
  mutate(tip_nz_hour = hour(tip_nz)) %>% 
  filter(tip_nz_hour >= 16)

Thus, for example, Golden State at Denver and Portland at Sacramento on Nov 8, LA Lakers at Phoenix and Oklahoma City at Sacramento on Nov 10, and so on. Say something like this for the last point. However you find them is good, but the way I just did it is best. There are 132 games your friend can watch after they get home from work, some of which are at the same time (so they will have to choose which one to watch live).

Extra 1: As you page through the games, you will see that they are all on the west coast (Pacific time zone). This is because the time difference between North America and New Zealand is such that all the other, earlier, games are on while your friend is at work. (You might argue that a game that starts at 6:00am Auckland time is also one your friend might be able to watch, if they get up early enough, because it would be before they go to work.)

Extra 2: If you look at the columns I called tip and tip_nz (in the dataframe with all 1200 games), you can see that at the start, they are 17 hours apart and all of the games are too early for your friend to watch (even the ones that start at 10:00pm Eastern). Once you get into November, the time difference becomes 18 hours, because daylight savings ends in November in North America. Near the end of the dataframe, on March 10, the difference again becomes 17 hours, because daylight saving time starts again in North America, but on April 6, it changes again to 16 hours. Why? Because New Zealand’s daylight saving time ends (it is at this point fall in New Zealand).7

What that means is that the NBA games start late enough for your Auckland friend to watch only when there is an 18 hour time difference, because only then do the 10:00pm Eastern (7:00pm Pacific) games match up to 4:00pm the next day in Auckland. When the time difference is less than that, the 10:00pm Eastern games start earlier than 4:00pm in Auckland.

I don’t know about you, but I’m quite glad to have with_tz handle all of that for us, and get an answer for your friend that actually works.

Footnotes

  1. The usual consent form for a study like this, probably signed by each subject child’s parents in this case, allows the subject to withdraw from the study at any time, resulting in some missing data if this happens. My guess is that the experimenters started with equal numbers of boys and girls, and equal numbers of subjects seeing pipe #1 first in their video, but a few of the subject children didn’t complete the study for whatever reason.↩︎

  2. Not to be confused with the functions from stringr that start with str_.↩︎

  3. This is the same principle as \t representing “tab” in the athletes data back in C32.↩︎

  4. If it were a column in a dataframe, it would be a list-column.↩︎

  5. Nothing here is a dataframe, so if I display something, it displays all of it, no matter how big it is. This displays the first 10 lines only.↩︎

  6. Part of the tidyverse, so we actually already loaded it.↩︎

  7. The other end of NZ’s daylight saving time is in September, before the NBA season starts. That is when it starts, because at that point it is spring in NZ.↩︎