Worksheet 11

Published

March 26, 2025

Packages

library(ggbiplot)
library(tidyverse)
library(tmaptools)
library(leaflet)
library(conflicted)
conflicts_prefer(dplyr::summarize)
conflicts_prefer(dplyr::filter)

Rugby league teams by location

There are 37 teams that play professional or semi-professional rugby league in Europe.1 These are listed in the file at http://ritsokiguess.site/datafiles/rugby-league-teams.csv, including the name of each team, its location, and the league in which they play (from Super League, best, to League One, worst). Our aim is to make a map of the locations of the teams, to see what we can learn about where the teams tend to be from.

  1. Read in and display (some of) the data.
my_url <- "http://ritsokiguess.site/datafiles/rugby-league-teams.csv"
teams <- read_csv(my_url)
Rows: 37 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): team, location, league

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
teams
  1. Look up the latitudes and longitudes of the location where each team plays.

This is geocode_OSM, which needs some disentangling afterwards. Remember that geocode_OSM returns three things, so you need to put it in a list first. I saved this back into the same dataframe:

teams %>% 
  rowwise() %>% 
  mutate(latlong = list(geocode_OSM(location))) -> teams

This is the slowest part, so it’s best to save the answer so that you only do it once.

This is where we are at now:

teams

We want to take a look at the things in what I called latlong, which will be some kind of unnest. My take here is that I want to keep one row for each team, so I am going to unnest wider:

teams %>% unnest_wider(latlong)

The latitiudes and longitudes we want are in coords, so unnest_wider again:

teams %>% unnest_wider(latlong) %>% 
  unnest_wider(coords) -> teams
teams

Now we have, over on the right, columns x (longitudes) and y (latitudes). These places are all in the northern hemisphere, so the latitudes are positive, but some of the longitudes are negative (west of the Greenwich meridian) and some are positive (east of these). These are the right sign conventions for drawing maps.

  1. Draw a map showing where these 37 teams play.

This is where leaflet comes in:

leaflet(data = teams) %>%
  addTiles() %>% 
  addCircleMarkers(lng = ~x, lat = ~y)

As an option, you can use the default markers, which look like Google map pins:

leaflet(data = teams) %>%
  addTiles() %>% 
  addMarkers(lng = ~x, lat = ~y)

I think I like the circles better, because the pins hide behind each other and you don’t know how dense they are until you zoom in.

Talking of zooming in, this works like a google map: click and drag to centre the map, and use the mouse wheel to zoom in or out (or click the + or - top left).

  1. Where are most of the teams found?

Most of the teams are in a band across northern England from about Liverpool to Hull. There are some teams elsewhere: around the coast of north-west England, in Wales, in London (actually two teams2), in the south of France, and in other odd places. But the vast majority of the teams are in this narrow band.

  1. What can you find out about why most of the teams are located where they are?

See what Google has to say. For example, I searched for “why are there so many rugby league teams in the north of England”, and found this article that tells the story pretty well. If you have played or watched rugby, the version of it you probably saw is called Rugby Union. It has 15 players a side, and “scrums” in which a mass of players fight for the ball. Rugby league has only 13 players a side and no scrums, and is a faster, more open game. The reason there are two very similar but distinct sports has to do with being paid to play the game. Rugby union players were traditionally “gentlemen” who had well-paying jobs or who were independently wealthy, and so rugby union was for a long time (even into my lifetime) an amateur game where players did not get paid to play. In the north of England, rugby was always a working-class game, and the best players had to take time off work to train and play, for which they could not at first be paid. This was the reason that rugby league became a separate game: it has been professional since the start. (Rugby union is now professional too.)

So, “because the northern players were working class, and they wanted to get paid for playing.”

(As far as England is concerned, there are rugby union teams all over the country, but mostly only rugby league teams in the north.)

Also part of the rugby league heartland was Cumbria, at the top left of the map (the three teams are the long-standing ones of Barrow, Whitehaven and Workington). The other teams are attempts to foster interest in rugby league in other parts of the country. One of the newest teams is Cornwall, at the bottom left of England. (The south-west of France has also been a place where rugby league is played, for reasons I know not.)

  1. Re-draw your map, but now colouring the points according to which league the team plays in.

Borrow the idea from the clusters in class: make a colour palette first. Note that color in here has to be thus spelled:

pal <- colorFactor("Set1", teams$league)

Then use this with a color in your markers line:

leaflet(data = teams) %>%
  addTiles() %>% 
  addCircleMarkers(lng = ~x, lat = ~y, color = ~pal(league))

Go back to the data to determine which league is which. They are, from highest to lowest level of play: Super League (green), Championship (red), League 1 (blue).3 Apart from those two teams in the south of France, all the Super League teams are in that band across northern England, and most of the teams in non-traditional places (for Rugby League) are in the lowest-level League 1. Thus, attempts to grow the sport in other places have not been very successful.

Need for cognition

Questionnaires are often used to identify aspects of personality. The data in http://www.utsc.utoronto.ca/~butler/d29/cognition.txt, were the results of a questionnaire intended to identify people’s “need for cognition”. The questionnaire items are shown here.4 Each response is on a scale of 1–5, with 5 denoting “strongly agree”” and 1 denoting “strongly disagree”. Some of the items are “reverse-coded”, which means that “strongly disagree” is intended to correspond to a strong “need for cognition”. (This is to stop people from simply answering “strongly agree” all the way down the list.) The questionnaire item file indicates which items were reverse-coded. The column id is a serial number identifying the respondent, and the questionnaire responses are in columns c1 through c18.

  1. Read in and display some of the data. Hint: the values are separated by single spaces, but there are some extra spaces on the beginning of the rows to make the columns line up. You might get one “parsing failure” warning that you can ignore.

You might be thinking read_delim, but the hint (about the columns being lined up) ought to make you prefer read_table:

my_url <- "http://www.utsc.utoronto.ca/~butler/d29/cognition.txt"
cognition <- read_table(my_url)

── Column specification ────────────────────────────────────────────────────────
cols(
  id = col_double(),
  c1 = col_double(),
  c2 = col_double(),
  c3 = col_double(),
  c4 = col_double(),
  c5 = col_double(),
  c6 = col_double(),
  c7 = col_double(),
  c8 = col_double(),
  c9 = col_double(),
  c10 = col_double(),
  c11 = col_double(),
  c12 = col_double(),
  c13 = col_double(),
  c14 = col_double(),
  c15 = col_double(),
  c16 = col_double(),
  c17 = col_double(),
  c18 = col_double()
)
Warning: 1 parsing failure.
row col   expected     actual                                                    file
101  -- 19 columns 20 columns 'http://www.utsc.utoronto.ca/~butler/d29/cognition.txt'
cognition

Extra: this is a dataset I haven’t used for a while. I discovered the other day that I still have access to my UTSC webspace (and I could even remember the password!), so this file has been just sitting there for some large number of years, gathering virtual dust since it was used on an assignment maybe 10 years ago.

The warning says that row 101 has an extra column, but the numbers there seem to have been read in correctly.

  1. Check that all of the responses to the questionnaire items are between 1 and 5, except for some missing ones. (In this question and the next one, you might want to use a search engine to hunt for ideas, but you have seen the solution ideas before, so that when you see the search results, you need to be able to look at them and say “ah yes, we saw that before” or something similar that tells you that you know it will solve your problem because it has done so before.)

Find a way that works for you. I think the easiest is to pass the dataframe into summary:

summary(cognition)
       id            c1              c2              c3              c4       
 Min.   :  1   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.: 51   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:1.000   1st Qu.:1.000  
 Median :101   Median :4.000   Median :4.000   Median :1.000   Median :2.000  
 Mean   :101   Mean   :3.224   Mean   :3.915   Mean   :1.811   Mean   :1.975  
 3rd Qu.:151   3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:2.000   3rd Qu.:2.000  
 Max.   :201   Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
                                                                              
       c5              c6              c7              c8       
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:2.000  
 Median :2.000   Median :4.000   Median :2.000   Median :2.000  
 Mean   :1.861   Mean   :3.286   Mean   :2.468   Mean   :2.527  
 3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
                 NA's   :2                                      
       c9             c10            c11             c12             c13       
 Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:1.000   1st Qu.:4.00   1st Qu.:4.000   1st Qu.:1.000   1st Qu.:3.000  
 Median :2.000   Median :4.00   Median :4.000   Median :1.000   Median :3.000  
 Mean   :2.251   Mean   :4.07   Mean   :4.211   Mean   :1.776   Mean   :3.294  
 3rd Qu.:3.000   3rd Qu.:5.00   3rd Qu.:5.000   3rd Qu.:2.000   3rd Qu.:4.000  
 Max.   :5.000   Max.   :5.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :2       NA's   :2      NA's   :2                                      
      c14             c15             c16            c17           c18       
 Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.0   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:3.000   1st Qu.:1.00   1st Qu.:1.0   1st Qu.:2.000  
 Median :4.000   Median :4.000   Median :2.00   Median :2.0   Median :4.000  
 Mean   :3.735   Mean   :3.668   Mean   :2.27   Mean   :1.9   Mean   :3.325  
 3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:3.00   3rd Qu.:2.0   3rd Qu.:4.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.0   Max.   :5.000  
 NA's   :1       NA's   :2       NA's   :1      NA's   :1     NA's   :1      

This gives you the answer to the whole thing right away: check that the max is no bigger than 5, the min is no smaller than 1, and that there are a few NA values.

To go the tidyverse way, it’s easier if you pivot-longer first: it then becomes a group-by and summarize:

cognition %>%
  pivot_longer(starts_with("c"), names_to = "qname", values_to = "response") %>% 
  group_by(qname) %>% 
  summarize(mn = min(response, na.rm = TRUE), 
            mx = max(response, na.rm = TRUE), 
            nas = sum(is.na(response)))

There are some gotchas here:

  • min and max of anything with missing values is missing (as for a mean), and the workaround, as for a mean, is to explicitly remove the missings before finding the min or max.
  • is.na returns TRUE if a value is missing and FALSE otherwise. Remember that you can do arithmetic with these, in which case TRUE counts as 1 and FALSE as 0, so that summing it up counts the number of 1’s, that is, the number of TRUEs, that is, the number of missing values. The sum(is.na(x)) thing is rather an R idiom: when you have seen it enough times to memorize it, you can code it without thinking too much.

Either way, there are 201 respondents, all the min scores are 1 and all the max scores are 5, and there are only a few missing values.

  1. Create and save a dataframe in which the respondents with any missing values at all are dropped.

The handy function for this is drop_na. Running it without any inputs checks all the columns for missing values, which is exactly what we want:

cognition %>% drop_na() -> cognition_clean
cognition_clean

We now have 195 respondents, so we only lost 6 of them in getting rid of the missing values.

I created a new dataframe, but you can most likely overwrite the original.

  1. Run a principal components analysis on your new dataframe, having removed the column of IDs. Make a screeplot.

This:

cognition_clean %>% select(-id) -> cognition0
cognition.1 <- princomp(cognition0, cor = TRUE)
ggscreeplot(cognition.1)

There are some naming issues here. I needed a temporary name for the version of my cleaned dataframe without id, so that I could feed it into princomp. Or, you could do it this way:

cognition_clean %>% 
  select(-id) %>% 
  princomp(cor = TRUE) -> cognition.1a
ggscreeplot(cognition.1a)

They both get to the same place (the scree plot is the same), and you might argue that the second way is cleaner, since it avoids making the temporary dataframe cognition0, but to use it you’ll need to recognize that the (invisible) dataframe input to princomp is “whatever came out of the previous step”.

  1. Based on your screeplot, how many principal components do you think you should use? Explain briefly (in your head, if nowhere else).

We are looking for an elbow. There is a big one at 2, meaning one principal component; there is a small one at 4 (meaning 3), and another small one at 6 (meaning 5 principal components). Beyond that, you might see more small elbows (say at 9, or 12, or even 17), but considering that there were only 18 questions, having that many principal components isn’t really offering any meaningful insight. The goal of principal components is to “reduce the dimensionality of the data”, and so you would like to have a lot fewer principal components than you have variables.

  1. Look at the summary output of your principal components analysis. What does this tell you about your proposed number(s) of components?

All right, so this:

summary(cognition.1)
Importance of components:
                          Comp.1     Comp.2     Comp.3     Comp.4     Comp.5
Standard deviation     2.4040402 1.17972556 1.10357933 1.00886083 0.99653725
Proportion of Variance 0.3210783 0.07731958 0.06766041 0.05654445 0.05517147
Cumulative Proportion  0.3210783 0.39839788 0.46605829 0.52260274 0.57777421
                           Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
Standard deviation     0.94897885 0.94008101 0.87702947 0.83063143 0.82292935
Proportion of Variance 0.05003116 0.04909735 0.04273226 0.03833048 0.03762293
Cumulative Proportion  0.62780537 0.67690272 0.71963498 0.75796546 0.79558838
                          Comp.11    Comp.12    Comp.13   Comp.14    Comp.15
Standard deviation     0.80748393 0.75681360 0.73777569 0.7041429 0.66084053
Proportion of Variance 0.03622391 0.03182038 0.03023961 0.0275454 0.02426168
Cumulative Proportion  0.83181229 0.86363267 0.89387228 0.9214177 0.94567936
                          Comp.16   Comp.17    Comp.18
Standard deviation     0.62244632 0.5498780 0.53662497
Proportion of Variance 0.02152441 0.0167981 0.01599813
Cumulative Proportion  0.96720377 0.9840019 1.00000000

Look at the Cumulative Proportion (of variance explained) by your candidate numbers of components. This is like R-squared:

  • 1 component explains 32% of the variance (which seems small)
  • 3 components explain 47% of the variance
  • 5 components explain 58% of the variance

My take is that one component explains too little of the variance, and that either of the others could be considered acceptable. To square this with the screeplot: there is a long “tail” of scree that falls off very slowly, so it is impossible to explain a very large fraction of the variance without taking a lot of components. There just is a lot of random variation, as you would probably expect in data from humans. It all comes down to what you think is tolerable, or, to put it another way, where you think the mountain ends and the scree begins. (The number of components is what you think is the last point on the mountain, and after that is all scree. Saying that one component explains too little of the variance is the same thing as saying that its elbow, at 2, is too far up the mountain.)

  1. Look at the loadings on component 1. Can you explain the pattern of plus and minus signs on the loadings? (Hint: look at the description of the items linked above.)
cognition.1$loadings

Loadings:
    Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
c1   0.251  0.136  0.110  0.140  0.155  0.428                0.568        
c2   0.309                                     0.199  0.223         0.404 
c3  -0.253         0.109                0.565 -0.177  0.138 -0.249  0.131 
c4  -0.275  0.275  0.103  0.205               -0.125  0.166  0.324  0.204 
c5  -0.289  0.170         0.362               -0.201 -0.131  0.166  0.128 
c6   0.183  0.134  0.267               -0.123 -0.702  0.327        -0.282 
c7  -0.227  0.166 -0.143  0.176  0.470               -0.166 -0.375        
c8  -0.206  0.117  0.205 -0.414  0.577 -0.205                0.204        
c9  -0.234  0.203  0.315 -0.180 -0.150                0.230 -0.386        
c10  0.259  0.129 -0.280               -0.235 -0.372 -0.277               
c11  0.282  0.102        -0.122  0.359        -0.218        -0.146  0.418 
c12 -0.259        -0.147  0.443 -0.111 -0.127 -0.136 -0.311         0.111 
c13  0.232  0.429 -0.116                0.264  0.117 -0.144               
c14  0.230  0.389  0.155        -0.313  0.128        -0.321 -0.193        
c15  0.169  0.317         0.456  0.243 -0.133  0.324  0.258 -0.156 -0.498 
c16 -0.164  0.373 -0.333        -0.257 -0.341         0.389         0.283 
c17 -0.229  0.398        -0.290 -0.150               -0.351  0.215 -0.304 
c18                0.692  0.191        -0.345  0.185 -0.234         0.219 
    Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17 Comp.18
c1                   0.157   0.325   0.414   0.131   0.150         
c2   0.153   0.248   0.102          -0.112  -0.449  -0.450  -0.328 
c3          -0.414  -0.184   0.322  -0.101  -0.187  -0.306   0.116 
c4   0.219   0.278  -0.135          -0.552   0.153           0.358 
c5          -0.146  -0.264  -0.420   0.178                  -0.582 
c6  -0.285   0.154   0.222                  -0.103                 
c7  -0.310   0.490           0.309                   0.129  -0.106 
c8                          -0.232   0.216          -0.398   0.218 
c9   0.519           0.306           0.252   0.269          -0.129 
c10  0.340  -0.103  -0.179   0.362           0.401  -0.329         
c11  0.185  -0.292                          -0.269   0.541   0.112 
c12                  0.551           0.206  -0.220  -0.182   0.334 
c13 -0.294  -0.127   0.415  -0.339  -0.290   0.345  -0.171         
c14          0.286  -0.388  -0.212   0.284  -0.222           0.326 
c15  0.202  -0.275                                           0.107 
c16 -0.334  -0.202           0.122   0.314                         
c17                  0.149   0.316  -0.210  -0.372   0.151  -0.267 
c18 -0.264  -0.253           0.207           0.189                 

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
Proportion Var  0.056  0.056  0.056  0.056  0.056  0.056  0.056  0.056  0.056
Cumulative Var  0.056  0.111  0.167  0.222  0.278  0.333  0.389  0.444  0.500
               Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17
SS loadings      1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000
Proportion Var   0.056   0.056   0.056   0.056   0.056   0.056   0.056   0.056
Cumulative Var   0.556   0.611   0.667   0.722   0.778   0.833   0.889   0.944
               Comp.18
SS loadings      1.000
Proportion Var   0.056
Cumulative Var   1.000

Looking down the column Comp.1, the negative signs are on items 3, 4, 5, 7, 8, 9, 12, 16 and 17. If you look at the description of the items (linked in the description of this scenario), you’ll see that these are exactly the items that are reverse-coded. Hence a high score on component 1 goes with “strongly agree” on all the normally-coded items and with “strongly disagree” on the reverse-coded ones: that is, with a high need for cognition overall.

In many principal components analyses, the loadings on component 1 are often similar in size, so that component 1 is overall-big vs overall-small, a “measure of size” you might say. Here, that plays out in a high need for cognition overall vs. a low one.

  1. What items does component 2 mainly depend on? (For me, there were five loadings noticeably higher than the rest.) What do these items seem to have in common?

The loadings for me on component 2 are highest for items 13 through 17, two of which (16 and 17) are reverse-coded. Let’s take a look at these, paraphrased:

  • 13: I like having puzzles I must solve
  • 14: I like thinking abstractly
  • 15: I prefer tasks that are intellectual/difficult/important
  • 16: I feel relief (not satisfaction) after a task requiring mental effort
  • 17: I want to get the job done, not think about how or why

I guess the thing these have in common is that a high-scorer likes thinking, but also likes the thinking to be over. (Or, likes thinking some, but not too much.)

Extra: In practice, you would now go on and interpret the other components you thought were important. For example, component 3 depends mainly on item 18 (thinking about issues that don’t affect you), and component 4 depends on items 8 (negatively), 12, and 15, and also maybe 5. Untangling the negative sign, this is preferring to think about long-term projects, not wanting to learn new ways to think, preferring intellectual tasks, and maybe preferring to avoid having to think in depth. This is an odd mixture of normal and reverse-coded items; maybe it indicates a person who likes to think only about certain things and in certain ways.

  1. Make a biplot of these data, using the id values from your cleaned (no NAs) dataset.

Just this:

ggbiplot(cognition.1, labels = cognition_clean$id)

The arrows point mostly left and right. If you can see which is which, the normally-coded items point to the right and the reverse-coded ones to the left. The items that point upward more are the ones that feature in component 2.

Extra: to get the same plot, minus the arrows, plot the scores on component 1 vs. the scores on component 2, plotting the id as text rather than points as circles:

cbind(id = cognition_clean$id, cognition.1$scores) %>% 
  as_tibble() %>% 
  ggplot(aes(x = Comp.1, y = Comp.2, label = id)) + geom_text()

  1. Find a person who scored low on component 1. By looking at their original data, show that it makes sense that they would have scored low on component 1.

The prime candidate seems to be person 68, way over on the left:

cognition_clean %>% filter(id == 68)

To score low on component 1, a person needs to score low on the normally-coded questions and high on the reverse-coded ones. The reverse-coded ones were 3–5, 7–9, 12, 16, and 17. Eyeball the scores to see that this is about right, with the exception of items 12 (unexpectedly low) and 15 (unexpectedly high).

Extra: this assessment turns out to be easier here than it usually is, because we know that 1 is low and 5 is high for all the items, so there is no question about how low is low or how high is high. In the places rated example in lecture, the variables were on different scales, so we had to compare an individual’s values with a summary, or work out percentile ranks, to see whether the value on a variable was high or low.

  1. Find a person who scored high on component 2. By looking at their original data, show that this too makes sense.

I picked person 57, at the top of the graph. Component 2 only depended on questions 13 through 17, so these are the only ones we need to look at:

cognition_clean %>% filter(id == 57) %>% 
  select(id, c13:c17)

To be a high scorer on component 2, the person would need to score high on these items, and mostly, it seems that they did.

Footnotes

  1. Rugby league is also played in Australia, traditionally around the city of Sydney, and in places in the Pacific like New Zealand and Papua New Guinea.↩︎

  2. As I re-read this in 2025, there is now only one team in London, but there is a team in Goole (the Goole Vikings). You might wish to find that on your map.↩︎

  3. There are actually two teams in Hull (green), and two in London (one red and one blue, which is why London appears to be purple.)↩︎

  4. This was rather evidently made by putting a textbook on top of a photocopier and saying “scan this”.↩︎