library(tidyverse)
library(MASS, exclude = "select")
Worksheet 10
Packages
Protein intake
For 25 European countries, the percent of protein its inhabitants1 get from various different food sources. The food sources are: red meat, white meat, eggs, milk, fish, cereals, starch, nuts, fruit and vegetables.2 The original data, which we are not going to use, are in http://www.biz.uiowa.edu/faculty/jledolter/DataMining/protein.csv. The values date from 1973, and so some of the country names reflect how things were in those days (eg. West Germany, Czechoslovakia, USSR3.) Optionally, take a look at these data to see the kind of thing we have.
We are going to do a hierarchical cluster analysis with these data, for which we need dissimilarities. See the first question below.
- I wanted to have you work with the standardized data, to retain the country names, and to make dissimilarities. This is beyond the scope of what I expect you to be able to do at this point, and so I made the
dist
object for you. It is in https://www.utsc.utoronto.ca/~butler/food-dist.rds. Read it in usingread_rds
and display its first six values usinghead
. (read_rds
works the same way as the otherread_
functions: give it a file name or a URL and save the thing read in into a variable.)
- Run a hierarchical cluster analysis using single linkage, and display the dendrogram.
- What characteristic of single linkage is displayed on your dendrogram? Explain briefly.
- Run a hierarchical cluster analysis using Ward’s method, and display the dendrogram.
- What seems to be a sensible number of clusters? Add these to your plot.
- What do the countries in each of your clusters seem to have in common, based on what you know or can find out? Explain briefly.
Eau-de-vie k-means
Eau-de-vie is an alcoholic drink. It is a type of brandy,4 but is distinct from traditional brandy (which is distilled from grapes) in that it is distilled from fruit. In the data set that we will investigate, 77 samples of three different types of eau-de-vie were obtained. These types were “poire”, made from pears, “mirabelle”, made from plums,5 and “kirsch”,6 made from cherries. For each sample, the content of six different chemicals was measured. Our aim is to find out whether the chemical content of an eau-de-vie could be used to identify what kind of fruit it was made from.
The chemicals measured were:
meoh
: Methanolacet
: Ethyl acetatebu1
: Isobutanolmepr
: Monoterpenesacal
: Acetaldehydelnpro1
: 1-propanol (apparently on a log scale)
The data are in http://ritsokiguess.site/datafiles/eau-de-vie.csv.
- Read in and display (some of) the data.
- By doing some calculation or making a graph, explain briefly why it would be a good idea to standardize the quantitative variables in this dataframe.
- Standardize all the quantitative variables. Explain briefly how you know that the standardization has worked.
- Make a scree plot for these data. For this, (i) copy the appropriate function from the lecture notes, (ii) run it for each number of clusters from 1 to 15, (iii) draw the scree plot.
- How many clusters does your scree plot suggest? Justify your choice briefly.
- (3 points) Run a K-means cluster analysis with six clusters. (This may or may not be the number of clusters you got from your scree plot.) Save the results.
Footnotes
Presumably, a sample of each country’s inhabitants.↩︎
Fruit and vegetables is one category. I use the Oxford comma, so if fruit and vegetables had been two separate categories, I would have written “nuts, fruit, and vegetables” with an extra comma. Precision in writing is a good thing.↩︎
Now Russia, Ukraine, Belarus, Lithuania etc.↩︎
Brandy and other distilled drinks have a very high alcohol content. In France, eau-de-vie is drunk in small portions after a meal as what they call a “digestif”. This being France, you might imagine that diners have already had several glasses of wine, so it is not good for them to overdo something that already has a high alcohol content.↩︎
Not to be confused with Montreal’s other airport.↩︎
Kirsch is the German word for cherry.↩︎