library(tidyverse)
library(MASS, exclude = "select")
Worksheet 10
Packages
Protein intake
For 25 European countries, the percent of protein its inhabitants1 get from various different food sources. The food sources are: red meat, white meat, eggs, milk, fish, cereals, starch, nuts, fruit and vegetables.2 The original data, which we are not going to use, are in http://www.biz.uiowa.edu/faculty/jledolter/DataMining/protein.csv. The values date from 1973, and so some of the country names reflect how things were in those days (eg. West Germany, Czechoslovakia, USSR3.) Optionally, take a look at these data to see the kind of thing we have.
We are going to do a hierarchical cluster analysis with these data, for which we need dissimilarities. See the first question below.
- I wanted to have you work with the standardized data, to retain the country names, and to make dissimilarities. This is beyond the scope of what I expect you to be able to do at this point, and so I made the
dist
object for you. It is in https://www.utsc.utoronto.ca/~butler/food-dist.rds. Read it in usingread_rds
and display its first six values usinghead
. (read_rds
works the same way as the otherread_
functions: give it a file name or a URL and save the thing read in into a variable.)
Solution
Act as if you know exactly what you are doing:
<- "https://www.utsc.utoronto.ca/~butler/food-dist.rds"
my_url <- read_rds(my_url)
protein head(protein)
[1] 7.471269 7.396024 3.401416 6.198577 8.160712 7.815497
For yourself, but definitely not to hand in, you can look at the whole thing:
protein
Albania Austria Belgium Bulgaria Czechoslovakia Denmark
Austria 7.471269
Belgium 7.396024 3.099220
Bulgaria 3.401416 5.957859 6.554831
Czechoslovakia 6.198577 2.656800 2.868434 4.777270
Denmark 8.160712 3.840392 3.234461 7.452355 4.254058
E Germany 7.815497 3.181823 2.767546 6.603974 2.377336 3.451644
Finland 7.194956 4.933604 4.379966 7.111582 4.823128 3.204996
France 7.957913 4.799181 2.906774 7.137557 4.502379 4.822257
Greece 5.409828 6.096708 5.669836 4.620622 5.763639 6.705858
Hungary 5.737880 4.039983 5.155925 4.051117 3.391800 6.307761
Ireland 8.374995 3.482975 2.062536 7.747775 4.018898 3.681214
Italy 5.121349 4.425125 4.652645 3.568798 4.014265 5.890457
Netherlands 7.258421 1.404659 2.831744 6.245991 2.661273 3.234770
Norway 6.789150 4.742837 3.780110 6.510339 4.329972 2.351019
Poland 7.245466 3.503826 3.916044 5.419236 2.680092 4.919476
Portugal 8.414011 8.154452 7.274264 7.544359 6.937859 7.312604
Romania 3.359990 5.682885 6.066833 2.329924 4.324704 6.868701
Spain 7.066255 5.846375 5.020004 5.985244 4.998131 6.223245
Sweden 6.971704 3.613096 3.290967 6.644175 4.025521 1.678281
Switzerland 6.282687 2.821679 2.918096 5.541022 3.264327 4.078588
UK 7.607015 4.784069 2.471612 7.445089 4.936572 4.474747
USSR 5.356583 5.012363 3.986679 4.708797 3.191555 5.102093
W Germany 7.758902 2.068391 1.730857 6.869085 2.721744 3.047377
Yugoslavia 3.769951 6.668188 7.140810 2.522356 5.310936 7.906832
E Germany Finland France Greece Hungary Ireland Italy
Austria
Belgium
Bulgaria
Czechoslovakia
Denmark
E Germany
Finland 4.919068
France 5.062210 5.932336
Greece 6.693762 6.671574 5.622348
Hungary 4.507261 6.561561 6.633202 5.029849
Ireland 3.900657 4.123558 4.076157 6.953154 6.137447
Italy 5.232568 6.066943 4.961277 2.560019 3.864237 6.046157
Netherlands 3.099948 4.038848 4.546428 6.087472 4.289232 3.011517 4.679145
Norway 3.958945 2.569742 5.152408 5.597229 6.042591 4.636293 4.988488
Poland 3.395950 5.136490 4.971731 5.265513 3.713577 4.914018 3.762990
Portugal 6.564306 8.137470 7.408134 6.010110 7.176570 9.030333 5.941018
Romania 5.845188 6.190826 7.230309 4.591958 2.988383 7.083140 3.966247
Spain 4.863365 6.738939 5.786259 3.823517 4.727567 6.613365 3.550828
Sweden 3.734958 2.520495 5.032349 6.004846 5.769452 3.680763 5.124902
Switzerland 4.487664 4.410449 3.180289 4.887186 4.924540 3.539993 3.620051
UK 5.082709 5.075806 3.220823 5.857390 6.719047 2.785847 5.489949
USSR 4.097469 4.215669 5.508253 5.047479 4.171149 4.895119 4.438906
W Germany 2.398274 4.433868 3.878960 6.374982 4.893410 2.264241 5.010897
Yugoslavia 6.765217 7.123807 8.228402 5.053126 3.642689 8.173209 4.578370
Netherlands Norway Poland Portugal Romania Spain
Austria
Belgium
Bulgaria
Czechoslovakia
Denmark
E Germany
Finland
France
Greece
Hungary
Ireland
Italy
Netherlands
Norway 4.075548
Poland 3.463712 4.645983
Portugal 7.941863 6.018194 6.010917
Romania 5.633626 5.809400 4.858136 7.140846
Spain 5.836904 5.129975 4.021494 3.795000 5.335352
Sweden 2.931571 1.854053 4.852402 7.377085 6.013290 5.906307
Switzerland 2.417076 4.226289 3.984684 7.773057 5.491966 5.673510
UK 4.501055 4.754449 5.970657 8.586599 7.125025 6.207222
USSR 4.611688 4.042494 3.603501 6.473995 3.425689 4.552131
W Germany 1.610811 4.054809 3.835623 7.742660 6.288741 5.581562
Yugoslavia 6.707908 6.730336 5.499985 7.373903 1.249137 5.753091
Sweden Switzerland UK USSR W Germany
Austria
Belgium
Bulgaria
Czechoslovakia
Denmark
E Germany
Finland
France
Greece
Hungary
Ireland
Italy
Netherlands
Norway
Poland
Portugal
Romania
Spain
Sweden
Switzerland 3.454412
UK 4.160704 3.623004
USSR 4.621931 4.670755 5.270167
W Germany 3.050378 2.835199 3.635288 4.702489
Yugoslavia 7.075591 6.570428 8.213646 4.223718 7.386264
It looks like the dist
objects you have seen before, though perhaps bigger.
Extra 1: RDS format is a way of saving any R object into a file, to share with other people, or to save for yourself (if, for example, it takes a long time to compute, like the results of a simulation). This is more flexible than what you have learned so far, because all we know is how to read in dataframes, and a dist
object is not a dataframe. (If you read the Extra below, you will see how much work I had to do to turn it into a dataframe to make it “tidy”.) read_rds
reads a saved R object from a file or URL, and write_rds
saves an R object into a file. You will see me use this below.
A rather more old-fashioned way of doing this is via the save
and load
mechanism. The problem with this is that load
creates an object in your workspace that has the same name as the object I save
d. This always strikes me as a weird way of operating: why wouldn’t you want to choose your own name?
read_rds
and write_rds
are part of the tidyverse
, specifically part of readr
where all the other read_
functions live. So they have the same syntax as for reading anything else: a filename or URL for read_rds
. Base R has two functions called readRDS
and saveRDS
that do the same thing, but for me the names don’t match: I think the opposite of “read” is “write”, and the opposite of “save” is “load”. So I always had a hard time remembering what the names of the base R functions are, and eagerly grabbed the chance to use the tidyverse
ones that had names I could remember.
read_rds
and write_rds
are, I understand, nothing more than readRDS
and saveRDS
with better names.
Extra 2: I want to tell you what I did to get the distances in the form you just read in. As I said, I wanted to standardize the original values (if you look at the original data, even though they are all percentages, some of the columns are bigger and more variable than others, and I didn’t want that to affect the clustering). I also wanted to keep the country names, and this is more of an adventure than you really want to be dealing with yourself.
So: begin with the original data file, which goes like this:
<- "http://www.biz.uiowa.edu/faculty/jledolter/DataMining/protein.csv"
my_url <- read_csv(my_url) food
Rows: 25 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Country
dbl (9): RedMeat, WhiteMeat, Eggs, Milk, Fish, Cereals, Starch, Nuts, Fr&Veg
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
food
25 rows (countries) and 9 columns of food categories, in addition to the column of country names.
The only way to keep a column like Country
when making a dist
object is to have it be row names. To make that, we use column_to_rownames
. column_to_rownames
takes a column, turns it into rownames, and gets rid of the original column:
%>%
food column_to_rownames("Country")
The column to become row names has to be in quotes, which I always seem to miss.
Note how the column of countries no longer has a heading (or a name of any kind); that’s how you know it is now row names.
I found a function that did everything I wanted the rest of the way, which seemed easier than figuring out scale
(see the lecture on K-means clustering) and then dist
. It lives in package cluster
and is called daisy
:4
library(cluster)
%>% column_to_rownames("Country") %>%
food daisy(stand = TRUE) -> d
The stand = TRUE
says to standardize the data before computing the distances.
Let’s check that at least the first few of these match what we read in from the file:
head(d)
[1] 7.471269 7.396024 3.401416 6.198577 8.160712 7.815497
head(protein)
[1] 7.471269 7.396024 3.401416 6.198577 8.160712 7.815497
They do. Here is a better check:
all(d == protein)
[1] TRUE
Asking “are all the elements of d
equal to the corresponding elements of protein
?”, and getting the answer “yes”.
And some other ways to do the same thing:
identical(d, protein)
[1] TRUE
all.equal(d, protein)
[1] TRUE
\(\blacksquare\)
- Run a hierarchical cluster analysis using single linkage, and display the dendrogram.
Solution
“Hierarchical” means to use the distances that you read in from the file:
.1 <- hclust(protein, method = "single")
proteinplot(protein.1)
Three points is rather generous for that, really, but this is the key part of the question, so I guess it works.
\(\blacksquare\)
- What characteristic of single linkage is displayed on your dendrogram? Explain briefly.
Solution
There are a lot of instances of a single country being joined on to a cluster containing several other countries. For example, starting with the cluster containing Austria and the Netherlands, West Germany, Belgium, Ireland and Switzerland are joined on in sequence. This is characteristic of single linkage, because each individual (country, here) only has to be close to one other individual in a cluster, not to all the individuals in a cluster.
Extra: I wanted you to see this, even though we are not going to do anything further with this analysis. Think about how you would pick a good number of clusters from this dendrogram: I think you would find it extremely difficult!
\(\blacksquare\)
- Run a hierarchical cluster analysis using Ward’s method, and display the dendrogram.
Solution
Again, use the distances. The only real difficulty here is in remembering how to specify Ward’s method, with a lowercase W and an uppercase extra D:
.2 <- hclust(protein, method = "ward.D")
proteinplot(protein.2)
You might like to note for yourself that Ward has (as usual) formed small clusters and then joined those, at least for the most part.
\(\blacksquare\)
- What seems to be a sensible number of clusters? Add these to your plot.
Solution
I can see about two sensible choices here, based on long vertical periods where the clustering stays the same: 2 clusters or 5 clusters.
To draw these, use rect.hclust
. This is a base R plot
(rather than a ggplot
) so you will probably need to draw the graph again first:
plot(protein.2)
rect.hclust(protein.2, 2)
If you went with 5, you get this:
plot(protein.2)
rect.hclust(protein.2, 5)
I like the result of the 5 clusters better, but looking at the dendrogram, it’s hard to argue with a choice of 2 clusters.
\(\blacksquare\)
- What do the countries in each of your clusters seem to have in common, based on what you know or can find out? Explain briefly.
Solution
The easiest way to tackle this is to haul out Google Maps and find all these countries. Then you can ask yourself what they have in common geographically. Or you could ask yourself what they have in common culturally.
If you said two clusters, it’s probably easier to start with the smaller cluster on the right. These are four countries in southeastern Europe (“the Balkans”), and four southern European countries that are commonly described as “Mediterranean”. So these are southern, and the other cluster is “the rest”, or “Western and Northern”.
Five clusters actually divide up rather interestingly, if you know some history. From left to right:
- Scandinavia, or Northern.
- Eastern, or former Soviet Bloc communist countries.
- Western and Central (or “the rest” again).
- Southeastern, or Balkan, or the southern part of former-Communist countries (these were in the Soviet Bloc as well).
- Southern, or Mediterranean as they are often described.
The data came from 1973, and so the now former-Communist countries were still Communist. At that time, it makes sense to think that these countries had a different (probably worse) diet than those in the west. For example, the two Germanies, the capitalist West and the communist East, ended up in different clusters (of the 5), and even though Albania and Greece are next door to each other geographically, the former was Communist and the latter not at that time, and they also ended up in different clusters of the 5.
Extra: I was thinking about choosing between numbers of clusters in this context. There is (considerable) discussion about that in the K-means question (if you haven’t gotten to that yet, come back to this later).
One of the methods I talk about is due to Hartigan, and is based on total within-cluster sums of squares. The corresponding thing to that here is the sum of squared distances between individuals that are in the same cluster. Let’s see if we can work that out for our five-cluster solution, and then worry about generalizing to other numbers of clusters.
I think I want to start by making my distance object tidy. There is some awkwardness here, which I talk about in a moment:
%>% as.matrix() %>%
protein as.data.frame() %>%
rownames_to_column("row") %>%
pivot_longer(-row, names_to = "column", values_to = "distance") %>%
filter(row < column) -> protein_long
protein_long
- It seems not to be possible to turn a
dist
object directly into a dataframe, so it goes in two steps: turn it into amatrix
first, and then turn that into a dataframe. (The reason is probably a historical one: matrices anddist
objects go back into the far history of R, and dataframes don’t go back so far, in the same way that in Python, arrays are part of the basic R and dataframes live inpandas
.) - In the
tidyverse
, row names are sort of an unwelcome intrusion; they are not handled in any sort of proper way.tibble
type dataframes are not allowed to have row names, but old-fashioneddata.frame
dataframes are allowed to have them. So if we useas_tibble
to convert the matrix to a dataframe, we will lose the country names that we are going to such trouble to keep. The way around this is to make an old-fashioneddata.frame
instead from the things in the matrix. - To do anything
tidyverse
with those country names, we have to make them into an actual column. Here, we create a column calledrow
5 that contains the country names. - I only want to count the distances once, not twice, so I made sure that the thing called
row
is alphabetically before the thing calledcolumn
. - Finally, I can make this tidy. The dataframe
protein_long
has a column calledrow
, one calledcolumn
(these are both countries), and a columndistance
that is the dissimilarity between them.
Next, I want to get hold of which cluster each country is in. This is cutree
:
cutree(protein.2, 5) %>% enframe() -> cluster
cluster
The enframe
, as in the lecture notes, makes a dataframe out of the countries and which cluster they belong to.
Now, we want to find out the sum of squared distances between the countries that are in the same cluster. The distances are here:
protein_long
and what we need to do now is to look up the cluster membership of the thing in row
first. “look up” implies left_join
:
%>% left_join(cluster, by = c("row"="name")) protein_long
The column called value
contains the cluster number of the country in row
, which is Albania on all the ones you can see (scroll down to see others).
Now we do the same thing again, but for the countries in column
:
%>% left_join(cluster, by = c("row"="name")) %>%
protein_long left_join(cluster, by = c("column"="name"))
Now we have two value
columns. The first one is from the lookup of the country in row
, and the second from the lookup of the country in column
. They would both be called value
, which would be very confusing, so left_join
gives them different names so that you can tell which one is which.
How do we determine whether two countries are in the same cluster or not? If they are in the same cluster, value.x
and value.y
will be the same:
%>% left_join(cluster, by = c("row"="name")) %>%
protein_long left_join(cluster, by = c("column"="name")) %>%
filter(value.x == value.y)
There are 56 pairs of countries in the same cluster: Albania and Bulgaria, … , Austria and Belgium, (scrolling down) Belgium and France, and so on.
The last step here is to work out the total within-cluster sum of squared distances. We have now gotten rid of all the pairs of countries in different clusters, so now this is literally the sum of squares of the distances you see here:
%>% left_join(cluster, by = c("row"="name")) %>%
protein_long left_join(cluster, by = c("column"="name")) %>%
filter(value.x == value.y) %>%
summarize(ssq = sum(distance^2))
We could do this all again for 2 clusters,6 but we’re going to be doing this for other numbers of clusters as well, so we should stop and write a function for this. Let’s keep the ingredients as general as we can, so we’ll make the inputs the long dataframe of distances,7 the output from hclust
from which we obtained the cluster memberships, and the number of clusters. With a generous amount of copying, pasting and editing, we come up with this (there are no new ideas here):
<- function(dist_long, hc, how_many) {
within_cluster_ssq # cluster membership for that many clusters
cutree(hc, how_many) %>% enframe() -> cluster
# look up cluster memberships, save ones in the same cluster, work out
# sum of sq distances
%>% left_join(cluster, by = c("row"="name")) %>%
dist_long left_join(cluster, by = c("column"="name")) %>%
filter(value.x == value.y) %>%
summarize(ssq = sum(distance^2)) %>%
pull(ssq)
}
So, I lied: one new thing to return the sum of squares as a number.
I had to be a little careful with names because there are different things that might have cluster
in their name: the output from hclust
, here called hc
, the input desired number of clusters (called how_many
), and the cluster memberships of each individual (here actually getting the name cluster
).8
Let’s try it on one where we know the answer:
within_cluster_ssq(protein_long, protein.2, 5)
[1] 622.9649
Success! (Grab yourself a coffee.)
All right, so let’s do this for more numbers of clusters:
tibble(clusters = 1:10) %>%
rowwise() %>%
mutate(ssq = within_cluster_ssq(protein_long, protein.2, clusters)) -> ss
ss
If you have read ahead to the K-means stuff, you are probably thinking right now “scree plot”. Yes indeed. Let’s draw one. (As I write this, I am thinking about how to introduce this idea in hierarchical cluster analysis so that we can do a scree plot both ways.):
ggplot(ss, aes(x = clusters, y = ssq)) + geom_point() + geom_line()
Well, that one’s pretty clear. A big elbow at two clusters, and another elbow at 5. Those were the two possible numbers of clusters we got from the dendrogram. As with K-means, you might take the viewpoint that the elbow at 2 clusters is a long way up the mountain, and so there ought to be more insight with 5 clusters than with 2. That was my opinion also from looking at the results: the 5 clusters really did seem to separate the countries in a meaningful way.
All right, Hartigan. I’m going to copy the ideas from the Extra to the K-means problem (see there for more detailed discussion):
%>%
ss ungroup() %>%
mutate(ssq_next = lead(ssq)) %>%
mutate(mult = 25 - clusters - 1) %>% # 25 countries here
mutate(stat = (ssq - ssq_next) / ssq_next * mult)
Now, we look down the stat
column for values bigger than 10, which mean we should add one more cluster. The value for 1 cluster is a lot bigger than 10, so we should look at 2 clusters. The value on the 2 line is less than 10, so 2 clusters is a place to stop. (If you have two clusters, there is no benefit to having 3 clusters over having two.) But there are some more values over 10: if you have 3 clusters, you should look at 4, and if you have 4, you should look at 5, but there is no value in looking at more clusters.
If you prefer, another way of looking at Hartigan is to see where the stat
column is less than 10. A smallish number of clusters where that is true, like 2 or 5, is a good place to stop looking.
Once again, the same story (which is therefore very consistent): 2 or 5 clusters are the best choices.
Aside: the Hartigan statistic for 6 clusters is almost 10, which suggests that you might also entertain 7 clusters. Sure enough, on the scree plot, there is the tiniest of elbows at 7. To look further at that:9
plot(protein.2)
rect.hclust(protein.2, 7)
Compared to the 5-cluster solution, there are two more splits: France and the UK have split off from the rest of Western Europe,10 and Portugal and Spain have split off from the other Mediterranean countries.
\(\blacksquare\)
Eau-de-vie k-means
Eau-de-vie is an alcoholic drink. It is a type of brandy,11 but is distinct from traditional brandy (which is distilled from grapes) in that it is distilled from fruit. In the data set that we will investigate, 77 samples of three different types of eau-de-vie were obtained. These types were “poire”, made from pears, “mirabelle”, made from plums,12 and “kirsch”,13 made from cherries. For each sample, the content of six different chemicals was measured. Our aim is to find out whether the chemical content of an eau-de-vie could be used to identify what kind of fruit it was made from.
The chemicals measured were:
meoh
: Methanolacet
: Ethyl acetatebu1
: Isobutanolmepr
: Monoterpenesacal
: Acetaldehydelnpro1
: 1-propanol (apparently on a log scale)
The data are in http://ritsokiguess.site/datafiles/eau-de-vie.csv.
- Read in and display (some of) the data.
Solution
<- "http://ritsokiguess.site/datafiles/eau-de-vie.csv"
my_url <- read_csv(my_url) eaudevie
Rows: 77 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): type
dbl (6): meoh, acet, bu1, mepr, acal, lnpro1
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
eaudevie
There are 77 rows (samples), and the columns are the type of eau-de-vie (categorical) and the six chemicals.
\(\blacksquare\)
- By doing some calculation or making a graph, explain briefly why it would be a good idea to standardize the quantitative variables in this dataframe.
Solution
They are measured on very different scales. The easiest way to demonstrate this is via summary
:
summary(eaudevie)
type meoh acet bu1
Length:77 Min. : 3.0 Min. : 13.0 Min. : 0.20
Class :character 1st Qu.: 620.0 1st Qu.:127.0 1st Qu.: 9.30
Mode :character Median : 910.0 Median :181.0 Median :17.00
Mean : 845.6 Mean :211.7 Mean :14.74
3rd Qu.:1087.0 3rd Qu.:287.0 3rd Qu.:20.00
Max. :1548.0 Max. :495.0 Max. :30.00
mepr acal lnpro1
Min. : 9.00 Min. : 2.00 Min. :3.300
1st Qu.:26.00 1st Qu.: 8.60 1st Qu.:4.280
Median :33.00 Median :11.00 Median :5.260
Mean :35.76 Mean :12.51 Mean :5.301
3rd Qu.:46.00 3rd Qu.:15.00 3rd Qu.:6.220
Max. :72.00 Max. :28.00 Max. :8.010
The medians range from 5.26 (lnpro1
) to 910 (meoh
). In addition, the IQR values appear to be bigger when the median is bigger.14 These are very different scales.
Another way you might do it is to calculate some statistics, such as the mean and SD (or median and IQR) for all the quantitative columns:
%>%
eaudevie summarize(across(where(is.numeric),
list(m = ~mean(.), s = ~sd(.))))
If you scroll across, you’ll see that the bigger SDs tend to go with the bigger means, as well as the means being very different. (They are mean-SD pairs for each variable.)
If you didn’t think of this idea, you can also pivot-longer and summarize (using the same idea as for the boxplots you made before):
%>%
eaudevie pivot_longer(-type) %>%
group_by(name) %>%
summarize(m = mean(value), sd = sd(value))
Once again, and perhaps more clearly, the means and SDs vary considerably, with the bigger SDs going with the bigger means.
Or you could make a graph:
- we are not interested in
type
here (we are going to be making clusters without regard totype
) - a normal one-quantitative-variable plot is a histogram, but we are going to be comparing the plots for different variables, so I think a one-variable boxplot for each is better
- we want to compare the scales, so this time do not use
scales = "free"
:
%>%
eaudevie pivot_longer(-type) %>%
ggplot(aes(x = 1, y = value)) + geom_boxplot() +
facet_wrap(~name)
This shows rather plainly that some of the variables have much bigger medians than others, and the ones that have bigger medians also have bigger spread (IQR).
Find a way to illustrate that some of the variables have much bigger means or medians than the others, and that the ones with bigger means (medians) also have bigger spread.
Extra: having the bigger means and the bigger SDs going together suggests a transformation. Let me redraw the graph I just did, but this time taking logs of everything:
%>%
eaudevie pivot_longer(-type) %>%
ggplot(aes(x = 1, y = log(value))) + geom_boxplot() +
facet_wrap(~name)
This has evened out the variability (the boxes are now about the same height regardless of the median), but the medians are still different. So, for the K-means clustering that we are going to do, it would still make sense to standardize the variables.
I note that maybe the log transformation has gone too far. We had (apparently) symmetric shapes before, but we now have a lot of low-end outliers.
\(\blacksquare\)
- Standardize all the quantitative variables. Explain briefly how you know that the standardization has worked.
Solution
The coding is copied straight from the lecture notes:
%>% mutate(across(where(is.numeric), ~scale(.))) -> eaudevie_scaled
eaudevie eaudevie_scaled
(in words, “go across each column that is numeric and scale it”.)
The result of the standardizing should look at least roughly like a \(z\)-score for each column. Eyeball what you see and say that, if that’s what you see (values mostly within \(\pm 2\)). There are a few values more extreme than that, but not many, and not (as I see it) greatly out of line with what you’d expect.
You can probably save this back into the original dataframe, since I don’t anticipate using the original data values in this question.
Extra: if you want, you can find where the extreme scaled values are, eg:
%>%
eaudevie_scaled rowwise() %>%
mutate(row_max = max(c_across(meoh:lnpro1))) %>%
arrange(desc(row_max))
These are the rows with an extreme high one. The syntax c_across
is a way to select the columns you want a row statistic for.
acal
and lnpro1
seem to have the most unusually high values (that would show up as upper outliers if you drew your boxplots above with scales = "free"
):
%>%
eaudevie pivot_longer(-type) %>%
ggplot(aes(x = 1, y = value)) + geom_boxplot() +
facet_wrap(~name, scales = "free")
The lnpro1
ones don’t show up as outliers, but there is a slightly long upper tail.
\(\blacksquare\)
- Make a scree plot for these data. For this, (i) copy the appropriate function from the lecture notes, (ii) run it for each number of clusters from 1 to 15, (iii) draw the scree plot.
Solution
This is really all ideas copied straight from the lecture notes.15 The function you need is this one:
<- function(i, d) {
ss %>%
d select(where(is.numeric)) %>%
kmeans(i, nstart = 20) -> km
$tot.withinss
km }
Then, set up a dataframe with the numbers of clusters you want:
tibble(clusters = 1:15)
and then, rowwise, run the function on each number of clusters and your (scaled) dataset:
tibble(cluster = 1:15) %>%
rowwise() %>%
mutate(ssq = ss(cluster, eaudevie_scaled)) -> d
d
(I saved this, because I am going to use it again later, but you don’t have to) and then make a plot of it, joining the points by lines:
ggplot(d, aes(x = cluster, y = ssq)) + geom_point() + geom_line()
\(\blacksquare\)
- How many clusters does your scree plot suggest? Justify your choice briefly.
Solution
The clearest elbow is at 4 clusters (the total within-cluster SS is clearly less than for 3 clusters, and not much greater than for 5).
You might reasonably take the point of view that 4 clusters is still a long way up the mountain (the variability within clusters is still too high, and we could get clusters that are more uniform by taking more of them). It is rather difficult to see elbows further along this scree plot, and any ones you find are likely to be specific to your plot: mine has tiny elbows at 8 or 11 clusters.
Talk about your thought process. The best answer includes some discussion of how high up the mountain you are, even if you eventually decide on 4 clusters.
Extra 1: There are numerous more automatic ways to choose the number of clusters. This post talks about elbows, the “average silhouette method” and the “gap test”, which is due to Tibshirani, Walther and Hastie, who are big names in machine learning.16 See also my Extra 2.
The paper by Tibshirani et al talks about a simpler method due to Hartigan17 The idea is to start from one cluster, and look for the benefit to adding more clusters.
We start from the dataframe we made to draw the scree plot:
d
The statistic is based on how much the total within-cluster sum of squares decreases to the one for the next number of clusters, which we make a column of. This will not work rowwise
so I have to get rid of the rowwise
property first. You might have seen lag
before, which gets the value from the previous row. In this case, we want the value from the next row, which is done with lead
(the opposite of lag
: think, if you have studied economics, of leading and lagging indicators):
%>%
d ungroup() %>%
mutate(ssq_next = lead(ssq))
The statistic is the proportional reduction in these, times a multiplier that depends on the number of observations (77) and the number of clusters we are currently looking at:
%>%
d ungroup() %>%
mutate(ssq_next = lead(ssq)) %>%
mutate(mult = 77 - cluster - 1) %>%
mutate(stat = (ssq - ssq_next) / ssq_next * mult)
The thing stat
is a kind of \(F\)-statistic, a scaled proportional reduction in total within-cluster SS. According to Hartigan, we should keep adding clusters while stat
is bigger than 10. This says here that we should keep adding clusters up to 5: that is, until we get to 6 clusters, but we should not (quite) add a 7th cluster, because 9.673 is just less than 10.
It is actually a complete coincidence that I will have you obtain a solution with six clusters later. I did that part first, choosing 6 because the answers came out interesting. Then I came back and did this, and, lo and behold, six clusters is the answer we get here as well. (The big gap between 15 and 10 actually corresponds to the elbow at 4 on the scree plot: you should definitely add a 4th cluster, but after that the scree plot (maybe) says that you should not add any more, while Hartigan says that you should add a fifth and a sixth.)
Extra 2: another idea is to think about “how many clusters could we get by chance”. My thinking here was guided by this paper, which really applies to the kind of scree plot you get for principal components, so the details are specific to that, but the idea is something we can apply here. This, it turns out, is Tibshirani et al’s gap test. I didn’t realize this at first, and wrote out something different, but my original idea was heading in the right direction.
Let’s first generate some independent random (standard) normal18 data that is the same shape as ours (77 rows and 6 (quantitative) columns). The easiest way to do that is as a matrix
:
<- matrix(rnorm(77*6), nrow = 77, ncol = 6)
m as_tibble(m) -> mm
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.
mm
The point about this is that all \(77 \times 6\) normal random numbers are independent, so there should be no actual structure here, and any clusters found are just chance. Here is the total within-cluster SS for 4 clusters, for these random data:
ss(4, mm)
[1] 284.9839
and the same thing that we did before for 1 through 15 clusters:
tibble(cluster = 1:15) %>%
rowwise() %>%
mutate(ssq = ss(cluster, mm)) -> d2
d2
Now, let’s see whether we can put that on top of the scree plot:
ggplot(d, aes(x = cluster, y = ssq)) + geom_point() + geom_line() +
geom_point(data = d2, colour = "red") + geom_line(data = d2, colour = "red")
The black curve is a long way below the red one, which says that the clusters found within our actual data are a lot less variable within clusters than we would expect by chance. That is to say, any of these numbers of clusters for our data is better than chance by quite a long way. Except that I don’t know whether this is “significantly” better than chance because there is no idea of variability. Read on.
There is a problem here: we only generated one set of random data. We should do more, to see how the total within-cluster SS might vary. Let’s start with 3 simulations, and scale up later.
To begin, this code:
tibble(cluster = 1:15) %>%
rowwise() %>%
mutate(ssq = ss(cluster, mm)) -> d2
d2
has already been repeated several times, so we should make it into a function. Its input is a dataframe of data to analyze, and a maximum number of clusters, which can default to the number of rows in the dataframe minus one (although we are going to use the 15 clusters that was our max before):
<- function(df, max_cluster = nrow(df) - 1) {
ssq_table tibble(cluster = 1:max_cluster) %>%
rowwise() %>%
mutate(ssq = ss(cluster, df)) %>%
ungroup()
}
The ungroup
is to get rid of the rowwise
here, since there will probably be other things that are also rowwise
.
This should work on (i) our actual data:
ssq_table(eaudevie_scaled, max_cluster = 15)
and the simulated data:
ssq_table(mm, max_cluster = 15)
and those numbers seem to correspond.
All right, simulate:
tibble(sim = 1:3)
Simulate what? Well, first generate some random normal data and make into a dataframe:
tibble(sim = 1:3) %>%
rowwise() %>%
mutate(random_normal = list(as_tibble(matrix(rnorm(77*6), nrow = 77, ncol = 6))))
That is a bit complicated, but it’s the same thing as we did before (go back and check). You can unnest
it if you want to check it for sensibleness.
Then, get the total within-cluster SS for each number of clusters of each of these. This is simpler than you think, because we just wrote a function to do it:
tibble(sim = 1:3) %>%
rowwise() %>%
mutate(random_normal =
list(as_tibble(matrix(rnorm(77*6), nrow = 77, ncol = 6)))) %>%
mutate(ssq_tab = list(ssq_table(random_normal, max_cluster = 15)))
and take a look at it:
tibble(sim = 1:3) %>%
rowwise() %>%
mutate(random_normal =
list(as_tibble(matrix(rnorm(77*6), nrow = 77, ncol = 6)))) %>%
mutate(ssq_tab = list(ssq_table(random_normal, max_cluster = 15))) %>%
unnest(ssq_tab)
We have three copies of a random SSQ for each number of clusters (one per simulation).
Tibshirani et al say to find the mean and SD of the logs of those, which is nothing more than a group_by
and summarize
(though an unusual one):
tibble(sim = 1:3) %>%
rowwise() %>%
mutate(random_normal =
list(as_tibble(matrix(rnorm(77*6), nrow = 77, ncol = 6)))) %>%
mutate(ssq_tab = list(ssq_table(random_normal, max_cluster = 15))) %>%
unnest(ssq_tab) %>%
group_by(cluster) %>%
summarize(mean_log = mean(log(ssq)), sd_log = sd(log(ssq)))
Now, let’s glue the original total within-cluster SSs (from our data) onto the side of that, and take the logs of those, since everything is on a log scale:
tibble(sim = 1:3) %>%
rowwise() %>%
mutate(random_normal =
list(as_tibble(matrix(rnorm(77*6), nrow = 77, ncol = 6)))) %>%
mutate(ssq_tab = list(ssq_table(random_normal, max_cluster = 15))) %>%
unnest(ssq_tab) %>%
group_by(cluster) %>%
summarize(mean_log = mean(log(ssq)), sd_log = sd(log(ssq))) %>%
left_join(d, by = "cluster") %>%
mutate(ssq_log = log(ssq))
and now what we do is to calculate two statistics as shown:
tibble(sim = 1:3) %>%
rowwise() %>%
mutate(random_normal =
list(as_tibble(matrix(rnorm(77*6), nrow = 77, ncol = 6)))) %>%
mutate(ssq_tab = list(ssq_table(random_normal, max_cluster = 15))) %>%
unnest(ssq_tab) %>%
group_by(cluster) %>%
summarize(mean_log = mean(log(ssq)), sd_log = sd(log(ssq))) %>%
left_join(d, by = "cluster") %>%
mutate(ssq_log = log(ssq)) %>%
mutate(stat = mean_log - ssq_log,
stat2 = stat + sd_log)
I’ll tell you what to do with those in a minute, but the thing to do, now that this seems to be working, is to up the number of simulations:
tibble(sim = 1:1000) %>%
rowwise() %>%
mutate(random_normal =
list(as_tibble(matrix(rnorm(77*6), nrow = 77, ncol = 6)))) %>%
mutate(ssq_tab = list(ssq_table(random_normal, max_cluster = 15))) %>%
unnest(ssq_tab) %>%
group_by(cluster) %>%
summarize(mean_log = mean(log(ssq)), sd_log = sd(log(ssq))) %>%
left_join(d, by = "cluster") %>%
mutate(ssq_log = log(ssq)) %>%
mutate(stat = mean_log - ssq_log,
stat2 = stat + sd_log)
To find a good number of clusters, find the values of stat2
that are bigger than the stat
on the next line. In this case, 2 is a good number of clusters (\(0.197 > 0.175\)), 3 is not quite (\(0.245 < 0.247\)), and 4 and above is good (\(0.317 > 0.280\) and so on). You want a small number of clusters, so 2 or 4 would be good choices according to this method.19
As ever, different ways may disagree. I think this method is reacting more sharply to the elbow than Hartigan’s method was. There is no “best” answer here.
\(\blacksquare\)
- (3 points) Run a K-means cluster analysis with six clusters. (This may or may not be the number of clusters you got from your scree plot.) Save the results.
Solution
Much like the lecture notes:
%>%
eaudevie_scaled select(-type) %>%
kmeans(6, nstart = 20) -> eaudevie_6
\(\blacksquare\)
- Make a table of
type
against cluster number. Do the clusters distinguish the types? Explain briefly.
Solution
I mentioned above that the clustering doesn’t use type
at all, so if the clusters do split by type, this means that the types are different according to the quantitative variables:
table(type = eaudevie$type, cluster = eaudevie_6$cluster)
cluster
type 1 2 3 4 5 6
KIRSCH 0 0 9 9 0 0
MIRAB 2 11 0 0 3 13
POIRE 10 1 0 0 12 7
Reminder that your cluster numbers are probably different from mine, so use yours.
On mine, Kirsch is in clusters 3 and 4 (as I predicted earlier), and there are no samples of any other type in those two clusters. My clusters 2 and 6 are mostly (but not exclusively) Mirabelle, and my clusters 1 and 5 are mainly Poire. Some of these two types ended up in the “wrong” cluster; these might be ones that might be misclassified in a discriminant analysis.
Extra (this was previously a question for you): we could use a discriminant analysis to make a plot of our K-means cluster analysis, bearing in mind that it would be good to include both the true types and your clusters on the plot.
One way of making a plot from a K-means cluster analysis is to use the plot of LD1 vs. LD2 from a discriminant analysis, but to label the points by which cluster they came from. Here, we have the true type
s as well, and it requires thought to get that all on a plot. My idea is to plot the points as cluster numbers instead of as points, which geom_text
will do, and then we can keep the colours as type
. This is different from the example in class, where we had no “true groups”.
First, do the lda
and get the predictions:
.1 <- lda(type ~ meoh + acet + bu1 + mepr + acal + lnpro1, data = eaudevie)
eaudevie<- predict(eaudevie.1) p
Then, make a dataframe containing everything needed for our graph:
- the true types from the original dataframe
- the cluster numbers from the K-means cluster analysis
- the LD scores from the predictions (in
x
)
I will do them in that order. The first two are vectors, so I can get them with tibble
; the last one is a matrix
so I use cbind
:
tibble(type = eaudevie$type, cluster = eaudevie_6$cluster) %>%
cbind(p$x) -> d_for_graph
d_for_graph
You can tell that I’m running out of good names for things.
All right, so the LD scores will be x
and y
on the graph; type
will be colour
, and cluster
will be label
, as if we were going to plot it with geom_text_repel
, but we’re not: we’ll plot it with geom_text
to plot it exactly where it should be instead of using geom_point
:20
ggplot(d_for_graph, aes(x = LD1, y = LD2, colour = type, label = cluster)) +
geom_text()
Clusters 3 and 4 are Kirsch, over on the left. They seem to be mixed up with each other, though; it is hard to see why cluster 3 is different from cluster 4. Cluster 1 (mainly Poire) is mostly somewhere near the middle of the plot. Cluster 2 (mainly Mirabelle) is mostly in the bottom right of the plot. Cluster 5 (mostly Poire) is mainly top right, though with exceptions. Finally, cluster 6 (mostly Mirabelle) is also mainly bottom right, and there is a lot of overlap with cluster 2.
This plot makes it look as if there are not really as many as six clusters. Recall that Hartigan said six, the scree plot said maybe four, and the gap test said two or four. This illustrates that picking the right number of clusters is still rather a guessing game, even if you have supposedly “objective” methods to work with.
Extra 1: I can think of two other ways to make the graph. They both use colour for one categorical thing and the shape of the points plotted for the other. Colour as cluster and shape as type:
ggplot(d_for_graph, aes(x = LD1, y = LD2, colour = factor(cluster), shape = type)) +
geom_point()
Remember to make the (numerical) clusters categorical for this plot, or else you’ll get shades of blue rather than six different colours.
The Kirsch (circles) are on the left, with the other two types somewhat mixed up on the right. The colours distinguish the clusters, at least somewhat.
The other way to do it is to switch the colour and shape around:
ggplot(d_for_graph, aes(x = LD1, y = LD2, shape = factor(cluster), colour = type)) +
geom_point()
My take is that it is easier to distinguish only three different colours, but now we are faced with six different shapes for the clusters, and they are getting hard to tell apart. So I prefer having six different colours and three different shapes.
Extra 2: well, I couldn’t resist trying this with two clusters after all. The code is the same as above, replacing a 6 by a 2, and the discriminant analysis is just the same one as before, so I don’t need to do it again:
%>%
eaudevie_scaled select(-type) %>%
kmeans(2, nstart = 20) -> eaudevie_2
$centers eaudevie_2
meoh acet bu1 mepr acal lnpro1
1 0.4803854 -0.0002036889 0.5070575 0.1531633 0.1086499 -0.2801827
2 -1.2810278 0.0005431703 -1.3521532 -0.4084354 -0.2897330 0.7471539
Cluster 2 is definitely lower on meoh
and bu1
, so we should be starting to have suspicions about what’s where.
table(type = eaudevie$type, cluster = eaudevie_2$cluster)
cluster
type 1 2
KIRSCH 0 18
MIRAB 28 1
POIRE 28 2
Well, there you go: Kirsch in cluster 2, everything else (more or less) in cluster 1.
Make a dataframe to plot, as above:
tibble(type = eaudevie$type, cluster = eaudevie_2$cluster) %>%
cbind(p$x) -> d_for_graph
d_for_graph
and then plot it:
ggplot(d_for_graph, aes(x = LD1, y = LD2, colour = type, label = cluster)) +
geom_text()
It’s actually not completely clear-cut: the Kirsch are all over on the left, in cluster 2, and cluster 1 is on the right, but there are three samples from cluster 2 on the left edge of cluster 1, with an LD1 score of about \(-0.5\). I don’t quite know why they ended up there; it is possible that there is something on the six variables for these that makes them look more like the Kirsch samples, or maybe I needed a bigger nstart
(or to run it again). Dunno.
\(\blacksquare\)
2
Footnotes
Presumably, a sample of each country’s inhabitants.↩︎
Fruit and vegetables is one category. I use the Oxford comma, so if fruit and vegetables had been two separate categories, I would have written “nuts, fruit, and vegetables” with an extra comma. Precision in writing is a good thing.↩︎
Now Russia, Ukraine, Belarus, Lithuania etc.↩︎
A lot of the functions in that package have girl’s names, for reasons that elude me.↩︎
Yes, I know.↩︎
For which the total within-cluster sum of squared distances should be bigger, because the countries within a smaller number of clusters will be more diverse.↩︎
To do this properly, it would also be useful to write a function to obtain the long dataframe of distances from a
dist
object. That was rather awkward to do, and the awkwardness would be usefully hidden away in a function.↩︎The computer scientist Phil Karlton said, “there are two hard problems in computer science, cache invalidation and naming things”. I have no idea what cache invalidation is, but naming things is certainly a hard problem.↩︎
It is no trouble to look at different numbers in a hierarchical analysis; there is basically no calculation needed to make this graph.↩︎
I find it rather amusing that France and the UK have anything in common food-wise!↩︎
Brandy and other distilled drinks have a very high alcohol content. In France, eau-de-vie is drunk in small portions after a meal as what they call a “digestif”. This being France, you might imagine that diners have already had several glasses of wine, so it is not good for them to overdo something that already has a high alcohol content.↩︎
Not to be confused with Montreal’s other airport.↩︎
Kirsch is the German word for cherry.↩︎
This is what you would expect, since these are amounts and cannot go below zero.↩︎
When I was an undergrad in England, we called this kind of thing “book work”, meaning that very little actual thinking was required, even though it was a certain amount of work.↩︎
They literally wrote the book on machine learning.↩︎
I have no idea whether this link will work.↩︎
Tibshirani et al have a reason to use uniforms here, but I will use normals and see what happens. This is easier for me because I know my scaled data will have mean 0 and sd 1.↩︎
I have a feeling that in the two-cluster solution, one cluster is Kirsch and the other is the other two types. If you feel like it, see whether I was right. You might like to read the later discussion first.↩︎
The usual thing is to plot the point with a dot and then label it with some text that is near to the point, rather than at the point. But here, we have no dot and
geom_text
to put the label at the point.↩︎