library(ggbiplot)
library(tidyverse)
library(tmaptools)
library(leaflet)
library(conflicted)
conflicts_prefer(dplyr::summarize)
conflicts_prefer(dplyr::filter)
Worksheet 11
Packages
Rugby league teams by location
There are 37 teams that play professional or semi-professional rugby league in Europe.1 These are listed in the file at http://ritsokiguess.site/datafiles/rugby-league-teams.csv, including the name of each team, its location, and the league in which they play (from Super League, best, to League One, worst). Our aim is to make a map of the locations of the teams, to see what we can learn about where the teams tend to be from.
- Read in and display (some of) the data.
<- "http://ritsokiguess.site/datafiles/rugby-league-teams.csv"
my_url <- read_csv(my_url) teams
Rows: 37 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): team, location, league
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
teams
- Look up the latitudes and longitudes of the location where each team plays.
This is geocode_OSM
, which needs some disentangling afterwards. Remember that geocode_OSM
returns three things, so you need to put it in a list
first. I saved this back into the same dataframe:
%>%
teams rowwise() %>%
mutate(latlong = list(geocode_OSM(location))) -> teams
This is the slowest part, so it’s best to save the answer so that you only do it once.
This is where we are at now:
teams
We want to take a look at the things in what I called latlong
, which will be some kind of unnest
. My take here is that I want to keep one row for each team, so I am going to unnest wider:
%>% unnest_wider(latlong) teams
The latitiudes and longitudes we want are in coords
, so unnest_wider
again:
%>% unnest_wider(latlong) %>%
teams unnest_wider(coords) -> teams
teams
Now we have, over on the right, columns x
(longitudes) and y
(latitudes). These places are all in the northern hemisphere, so the latitudes are positive, but some of the longitudes are negative (west of the Greenwich meridian) and some are positive (east of these). These are the right sign conventions for drawing maps.
- Draw a map showing where these 37 teams play.
This is where leaflet
comes in:
leaflet(data = teams) %>%
addTiles() %>%
addCircleMarkers(lng = ~x, lat = ~y)
As an option, you can use the default markers, which look like Google map pins:
leaflet(data = teams) %>%
addTiles() %>%
addMarkers(lng = ~x, lat = ~y)
I think I like the circles better, because the pins hide behind each other and you don’t know how dense they are until you zoom in.
Talking of zooming in, this works like a google map: click and drag to centre the map, and use the mouse wheel to zoom in or out (or click the + or - top left).
- Where are most of the teams found?
Most of the teams are in a band across northern England from about Liverpool to Hull. There are some teams elsewhere: around the coast of north-west England, in Wales, in London (actually two teams2), in the south of France, and in other odd places. But the vast majority of the teams are in this narrow band.
- What can you find out about why most of the teams are located where they are?
See what Google has to say. For example, I searched for “why are there so many rugby league teams in the north of England”, and found this article that tells the story pretty well. If you have played or watched rugby, the version of it you probably saw is called Rugby Union. It has 15 players a side, and “scrums” in which a mass of players fight for the ball. Rugby league has only 13 players a side and no scrums, and is a faster, more open game. The reason there are two very similar but distinct sports has to do with being paid to play the game. Rugby union players were traditionally “gentlemen” who had well-paying jobs or who were independently wealthy, and so rugby union was for a long time (even into my lifetime) an amateur game where players did not get paid to play. In the north of England, rugby was always a working-class game, and the best players had to take time off work to train and play, for which they could not at first be paid. This was the reason that rugby league became a separate game: it has been professional since the start. (Rugby union is now professional too.)
So, “because the northern players were working class, and they wanted to get paid for playing.”
(As far as England is concerned, there are rugby union teams all over the country, but mostly only rugby league teams in the north.)
Also part of the rugby league heartland was Cumbria, at the top left of the map (the three teams are the long-standing ones of Barrow, Whitehaven and Workington). The other teams are attempts to foster interest in rugby league in other parts of the country. One of the newest teams is Cornwall, at the bottom left of England. (The south-west of France has also been a place where rugby league is played, for reasons I know not.)
- Re-draw your map, but now colouring the points according to which
league
the team plays in.
Borrow the idea from the clusters in class: make a colour palette first. Note that color
in here has to be thus spelled:
<- colorFactor("Set1", teams$league) pal
Then use this with a color
in your markers
line:
leaflet(data = teams) %>%
addTiles() %>%
addCircleMarkers(lng = ~x, lat = ~y, color = ~pal(league))
Go back to the data to determine which league is which. They are, from highest to lowest level of play: Super League (green), Championship (red), League 1 (blue).3 Apart from those two teams in the south of France, all the Super League teams are in that band across northern England, and most of the teams in non-traditional places (for Rugby League) are in the lowest-level League 1. Thus, attempts to grow the sport in other places have not been very successful.
Need for cognition
Questionnaires are often used to identify aspects of personality. The data in http://www.utsc.utoronto.ca/~butler/d29/cognition.txt, were the results of a questionnaire intended to identify people’s “need for cognition”. The questionnaire items are shown here.4 Each response is on a scale of 1–5, with 5 denoting “strongly agree”” and 1 denoting “strongly disagree”. Some of the items are “reverse-coded”, which means that “strongly disagree” is intended to correspond to a strong “need for cognition”. (This is to stop people from simply answering “strongly agree” all the way down the list.) The questionnaire item file indicates which items were reverse-coded. The column id
is a serial number identifying the respondent, and the questionnaire responses are in columns c1
through c18
.
- Read in and display some of the data. Hint: the values are separated by single spaces, but there are some extra spaces on the beginning of the rows to make the columns line up. You might get one “parsing failure” warning that you can ignore.
You might be thinking read_delim
, but the hint (about the columns being lined up) ought to make you prefer read_table
:
<- "http://www.utsc.utoronto.ca/~butler/d29/cognition.txt"
my_url <- read_table(my_url) cognition
── Column specification ────────────────────────────────────────────────────────
cols(
id = col_double(),
c1 = col_double(),
c2 = col_double(),
c3 = col_double(),
c4 = col_double(),
c5 = col_double(),
c6 = col_double(),
c7 = col_double(),
c8 = col_double(),
c9 = col_double(),
c10 = col_double(),
c11 = col_double(),
c12 = col_double(),
c13 = col_double(),
c14 = col_double(),
c15 = col_double(),
c16 = col_double(),
c17 = col_double(),
c18 = col_double()
)
Warning: 1 parsing failure.
row col expected actual file
101 -- 19 columns 20 columns 'http://www.utsc.utoronto.ca/~butler/d29/cognition.txt'
cognition
Extra: this is a dataset I haven’t used for a while. I discovered the other day that I still have access to my UTSC webspace (and I could even remember the password!), so this file has been just sitting there for some large number of years, gathering virtual dust since it was used on an assignment maybe 10 years ago.
The warning says that row 101 has an extra column, but the numbers there seem to have been read in correctly.
- Check that all of the responses to the questionnaire items are between 1 and 5, except for some missing ones. (In this question and the next one, you might want to use a search engine to hunt for ideas, but you have seen the solution ideas before, so that when you see the search results, you need to be able to look at them and say “ah yes, we saw that before” or something similar that tells you that you know it will solve your problem because it has done so before.)
Find a way that works for you. I think the easiest is to pass the dataframe into summary
:
summary(cognition)
id c1 c2 c3 c4
Min. : 1 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.: 51 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:1.000
Median :101 Median :4.000 Median :4.000 Median :1.000 Median :2.000
Mean :101 Mean :3.224 Mean :3.915 Mean :1.811 Mean :1.975
3rd Qu.:151 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:2.000 3rd Qu.:2.000
Max. :201 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
c5 c6 c7 c8
Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000
Median :2.000 Median :4.000 Median :2.000 Median :2.000
Mean :1.861 Mean :3.286 Mean :2.468 Mean :2.527
3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
NA's :2
c9 c10 c11 c12 c13
Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:1.000 1st Qu.:4.00 1st Qu.:4.000 1st Qu.:1.000 1st Qu.:3.000
Median :2.000 Median :4.00 Median :4.000 Median :1.000 Median :3.000
Mean :2.251 Mean :4.07 Mean :4.211 Mean :1.776 Mean :3.294
3rd Qu.:3.000 3rd Qu.:5.00 3rd Qu.:5.000 3rd Qu.:2.000 3rd Qu.:4.000
Max. :5.000 Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.000
NA's :2 NA's :2 NA's :2
c14 c15 c16 c17 c18
Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.0 Min. :1.000
1st Qu.:3.000 1st Qu.:3.000 1st Qu.:1.00 1st Qu.:1.0 1st Qu.:2.000
Median :4.000 Median :4.000 Median :2.00 Median :2.0 Median :4.000
Mean :3.735 Mean :3.668 Mean :2.27 Mean :1.9 Mean :3.325
3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:3.00 3rd Qu.:2.0 3rd Qu.:4.000
Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.0 Max. :5.000
NA's :1 NA's :2 NA's :1 NA's :1 NA's :1
This gives you the answer to the whole thing right away: check that the max is no bigger than 5, the min is no smaller than 1, and that there are a few NA
values.
To go the tidyverse way, it’s easier if you pivot-longer first: it then becomes a group-by and summarize:
%>%
cognition pivot_longer(starts_with("c"), names_to = "qname", values_to = "response") %>%
group_by(qname) %>%
summarize(mn = min(response, na.rm = TRUE),
mx = max(response, na.rm = TRUE),
nas = sum(is.na(response)))
There are some gotchas here:
min
andmax
of anything with missing values is missing (as for a mean), and the workaround, as for a mean, is to explicitly remove the missings before finding the min or max.is.na
returnsTRUE
if a value is missing andFALSE
otherwise. Remember that you can do arithmetic with these, in which caseTRUE
counts as 1 andFALSE
as 0, so that summing it up counts the number of 1’s, that is, the number ofTRUE
s, that is, the number of missing values. Thesum(is.na(x))
thing is rather an R idiom: when you have seen it enough times to memorize it, you can code it without thinking too much.
Either way, there are 201 respondents, all the min scores are 1 and all the max scores are 5, and there are only a few missing values.
- Create and save a dataframe in which the respondents with any missing values at all are dropped.
The handy function for this is drop_na
. Running it without any inputs checks all the columns for missing values, which is exactly what we want:
%>% drop_na() -> cognition_clean
cognition cognition_clean
We now have 195 respondents, so we only lost 6 of them in getting rid of the missing values.
I created a new dataframe, but you can most likely overwrite the original.
- Run a principal components analysis on your new dataframe, having removed the column of IDs. Make a screeplot.
This:
%>% select(-id) -> cognition0
cognition_clean .1 <- princomp(cognition0, cor = TRUE)
cognitionggscreeplot(cognition.1)
There are some naming issues here. I needed a temporary name for the version of my cleaned dataframe without id
, so that I could feed it into princomp
. Or, you could do it this way:
%>%
cognition_clean select(-id) %>%
princomp(cor = TRUE) -> cognition.1a
ggscreeplot(cognition.1a)
They both get to the same place (the scree plot is the same), and you might argue that the second way is cleaner, since it avoids making the temporary dataframe cognition0
, but to use it you’ll need to recognize that the (invisible) dataframe input to princomp
is “whatever came out of the previous step”.
- Based on your screeplot, how many principal components do you think you should use? Explain briefly (in your head, if nowhere else).
We are looking for an elbow. There is a big one at 2, meaning one principal component; there is a small one at 4 (meaning 3), and another small one at 6 (meaning 5 principal components). Beyond that, you might see more small elbows (say at 9, or 12, or even 17), but considering that there were only 18 questions, having that many principal components isn’t really offering any meaningful insight. The goal of principal components is to “reduce the dimensionality of the data”, and so you would like to have a lot fewer principal components than you have variables.
- Look at the
summary
output of your principal components analysis. What does this tell you about your proposed number(s) of components?
All right, so this:
summary(cognition.1)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation 2.4040402 1.17972556 1.10357933 1.00886083 0.99653725
Proportion of Variance 0.3210783 0.07731958 0.06766041 0.05654445 0.05517147
Cumulative Proportion 0.3210783 0.39839788 0.46605829 0.52260274 0.57777421
Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
Standard deviation 0.94897885 0.94008101 0.87702947 0.83063143 0.82292935
Proportion of Variance 0.05003116 0.04909735 0.04273226 0.03833048 0.03762293
Cumulative Proportion 0.62780537 0.67690272 0.71963498 0.75796546 0.79558838
Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
Standard deviation 0.80748393 0.75681360 0.73777569 0.7041429 0.66084053
Proportion of Variance 0.03622391 0.03182038 0.03023961 0.0275454 0.02426168
Cumulative Proportion 0.83181229 0.86363267 0.89387228 0.9214177 0.94567936
Comp.16 Comp.17 Comp.18
Standard deviation 0.62244632 0.5498780 0.53662497
Proportion of Variance 0.02152441 0.0167981 0.01599813
Cumulative Proportion 0.96720377 0.9840019 1.00000000
Look at the Cumulative Proportion (of variance explained) by your candidate numbers of components. This is like R-squared:
- 1 component explains 32% of the variance (which seems small)
- 3 components explain 47% of the variance
- 5 components explain 58% of the variance
My take is that one component explains too little of the variance, and that either of the others could be considered acceptable. To square this with the screeplot: there is a long “tail” of scree that falls off very slowly, so it is impossible to explain a very large fraction of the variance without taking a lot of components. There just is a lot of random variation, as you would probably expect in data from humans. It all comes down to what you think is tolerable, or, to put it another way, where you think the mountain ends and the scree begins. (The number of components is what you think is the last point on the mountain, and after that is all scree. Saying that one component explains too little of the variance is the same thing as saying that its elbow, at 2, is too far up the mountain.)
- Look at the loadings on component 1. Can you explain the pattern of plus and minus signs on the loadings? (Hint: look at the description of the items linked above.)
.1$loadings cognition
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
c1 0.251 0.136 0.110 0.140 0.155 0.428 0.568
c2 0.309 0.199 0.223 0.404
c3 -0.253 0.109 0.565 -0.177 0.138 -0.249 0.131
c4 -0.275 0.275 0.103 0.205 -0.125 0.166 0.324 0.204
c5 -0.289 0.170 0.362 -0.201 -0.131 0.166 0.128
c6 0.183 0.134 0.267 -0.123 -0.702 0.327 -0.282
c7 -0.227 0.166 -0.143 0.176 0.470 -0.166 -0.375
c8 -0.206 0.117 0.205 -0.414 0.577 -0.205 0.204
c9 -0.234 0.203 0.315 -0.180 -0.150 0.230 -0.386
c10 0.259 0.129 -0.280 -0.235 -0.372 -0.277
c11 0.282 0.102 -0.122 0.359 -0.218 -0.146 0.418
c12 -0.259 -0.147 0.443 -0.111 -0.127 -0.136 -0.311 0.111
c13 0.232 0.429 -0.116 0.264 0.117 -0.144
c14 0.230 0.389 0.155 -0.313 0.128 -0.321 -0.193
c15 0.169 0.317 0.456 0.243 -0.133 0.324 0.258 -0.156 -0.498
c16 -0.164 0.373 -0.333 -0.257 -0.341 0.389 0.283
c17 -0.229 0.398 -0.290 -0.150 -0.351 0.215 -0.304
c18 0.692 0.191 -0.345 0.185 -0.234 0.219
Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17 Comp.18
c1 0.157 0.325 0.414 0.131 0.150
c2 0.153 0.248 0.102 -0.112 -0.449 -0.450 -0.328
c3 -0.414 -0.184 0.322 -0.101 -0.187 -0.306 0.116
c4 0.219 0.278 -0.135 -0.552 0.153 0.358
c5 -0.146 -0.264 -0.420 0.178 -0.582
c6 -0.285 0.154 0.222 -0.103
c7 -0.310 0.490 0.309 0.129 -0.106
c8 -0.232 0.216 -0.398 0.218
c9 0.519 0.306 0.252 0.269 -0.129
c10 0.340 -0.103 -0.179 0.362 0.401 -0.329
c11 0.185 -0.292 -0.269 0.541 0.112
c12 0.551 0.206 -0.220 -0.182 0.334
c13 -0.294 -0.127 0.415 -0.339 -0.290 0.345 -0.171
c14 0.286 -0.388 -0.212 0.284 -0.222 0.326
c15 0.202 -0.275 0.107
c16 -0.334 -0.202 0.122 0.314
c17 0.149 0.316 -0.210 -0.372 0.151 -0.267
c18 -0.264 -0.253 0.207 0.189
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.056 0.056 0.056 0.056 0.056 0.056 0.056 0.056 0.056
Cumulative Var 0.056 0.111 0.167 0.222 0.278 0.333 0.389 0.444 0.500
Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.056 0.056 0.056 0.056 0.056 0.056 0.056 0.056
Cumulative Var 0.556 0.611 0.667 0.722 0.778 0.833 0.889 0.944
Comp.18
SS loadings 1.000
Proportion Var 0.056
Cumulative Var 1.000
Looking down the column Comp.1
, the negative signs are on items 3, 4, 5, 7, 8, 9, 12, 16 and 17. If you look at the description of the items (linked in the description of this scenario), you’ll see that these are exactly the items that are reverse-coded. Hence a high score on component 1 goes with “strongly agree” on all the normally-coded items and with “strongly disagree” on the reverse-coded ones: that is, with a high need for cognition overall.
In many principal components analyses, the loadings on component 1 are often similar in size, so that component 1 is overall-big vs overall-small, a “measure of size” you might say. Here, that plays out in a high need for cognition overall vs. a low one.
- What items does component 2 mainly depend on? (For me, there were five loadings noticeably higher than the rest.) What do these items seem to have in common?
The loadings for me on component 2 are highest for items 13 through 17, two of which (16 and 17) are reverse-coded. Let’s take a look at these, paraphrased:
- 13: I like having puzzles I must solve
- 14: I like thinking abstractly
- 15: I prefer tasks that are intellectual/difficult/important
- 16: I feel relief (not satisfaction) after a task requiring mental effort
- 17: I want to get the job done, not think about how or why
I guess the thing these have in common is that a high-scorer likes thinking, but also likes the thinking to be over. (Or, likes thinking some, but not too much.)
Extra: In practice, you would now go on and interpret the other components you thought were important. For example, component 3 depends mainly on item 18 (thinking about issues that don’t affect you), and component 4 depends on items 8 (negatively), 12, and 15, and also maybe 5. Untangling the negative sign, this is preferring to think about long-term projects, not wanting to learn new ways to think, preferring intellectual tasks, and maybe preferring to avoid having to think in depth. This is an odd mixture of normal and reverse-coded items; maybe it indicates a person who likes to think only about certain things and in certain ways.
- Make a biplot of these data, using the
id
values from your cleaned (no NAs) dataset.
Just this:
ggbiplot(cognition.1, labels = cognition_clean$id)
The arrows point mostly left and right. If you can see which is which, the normally-coded items point to the right and the reverse-coded ones to the left. The items that point upward more are the ones that feature in component 2.
Extra: to get the same plot, minus the arrows, plot the scores on component 1 vs. the scores on component 2, plotting the id as text rather than points as circles:
cbind(id = cognition_clean$id, cognition.1$scores) %>%
as_tibble() %>%
ggplot(aes(x = Comp.1, y = Comp.2, label = id)) + geom_text()
- Find a person who scored low on component 1. By looking at their original data, show that it makes sense that they would have scored low on component 1.
The prime candidate seems to be person 68, way over on the left:
%>% filter(id == 68) cognition_clean
To score low on component 1, a person needs to score low on the normally-coded questions and high on the reverse-coded ones. The reverse-coded ones were 3–5, 7–9, 12, 16, and 17. Eyeball the scores to see that this is about right, with the exception of items 12 (unexpectedly low) and 15 (unexpectedly high).
Extra: this assessment turns out to be easier here than it usually is, because we know that 1 is low and 5 is high for all the items, so there is no question about how low is low or how high is high. In the places rated example in lecture, the variables were on different scales, so we had to compare an individual’s values with a summary, or work out percentile ranks, to see whether the value on a variable was high or low.
- Find a person who scored high on component 2. By looking at their original data, show that this too makes sense.
I picked person 57, at the top of the graph. Component 2 only depended on questions 13 through 17, so these are the only ones we need to look at:
%>% filter(id == 57) %>%
cognition_clean select(id, c13:c17)
To be a high scorer on component 2, the person would need to score high on these items, and mostly, it seems that they did.
Footnotes
Rugby league is also played in Australia, traditionally around the city of Sydney, and in places in the Pacific like New Zealand and Papua New Guinea.↩︎
As I re-read this in 2025, there is now only one team in London, but there is a team in Goole (the Goole Vikings). You might wish to find that on your map.↩︎
There are actually two teams in Hull (green), and two in London (one red and one blue, which is why London appears to be purple.)↩︎
This was rather evidently made by putting a textbook on top of a photocopier and saying “scan this”.↩︎