Cluster analysis

Cluster Analysis

One side-effect of discriminant analysis: could draw picture of data (if 1st 2s LDs told most of story) and see which individuals “close” to each other.
Discriminant analysis requires knowledge of groups.
Without knowledge of groups, use cluster analysis: see which individuals close together, which groups suggested by data.
Idea: see how individuals group into “clusters” of nearby individuals.
Base on “dissimilarities” between individuals.
Or base on standard deviations and correlations between variables (assesses dissimilarity behind scenes).

Packages

library(MASS) # for lda later
library(tidyverse)
library(spatstat) # for crossdist later
library(ggrepel)
library(conflicted)
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")

One to ten in 11 languages

	English	Norwegian	Danish	Dutch	German
1	one	en	en	een	eins
2	two	to	to	twee	zwei
3	three	tre	tre	drie	drei
4	four	fire	fire	vier	vier
5	five	fem	fem	vijf	funf
6	six	seks	seks	zes	sechs
7	seven	sju	syv	zeven	sieben
8	eight	atte	otte	acht	acht
9	nine	ni	ni	negen	neun
10	ten	ti	ti	tien	zehn

One to ten

	French	Spanish	Italian	Polish	Hungarian	Finnish
1	un	uno	uno	jeden	egy	yksi
2	deux	dos	due	dwa	ketto	kaksi
3	trois	tres	tre	trzy	harom	kolme
4	quatre	cuatro	quattro	cztery	negy	nelja
5	cinq	cinco	cinque	piec	ot	viisi
6	six	seis	sei	szesc	hat	kuusi
7	sept	siete	sette	siedem	het	seitseman
8	huit	ocho	otto	osiem	nyolc	kahdeksan
9	neuf	nueve	nove	dziewiec	kilenc	yhdeksan
10	dix	diez	dieci	dziesiec	tiz	kymmenen

Dissimilarities and languages example

Can define dissimilarities how you like (whatever makes sense in application).
Sometimes defining “similarity” makes more sense; can turn this into dissimilarity by subtracting from some maximum.
Example: numbers 1–10 in various European languages. Define similarity between two languages by counting how often the same number has a name starting with the same letter (and dissimilarity by how often number has names starting with different letter).
Crude (doesn’t even look at most of the words), but see how effective.

Two kinds of cluster analysis

Looking at process of forming clusters (of similar languages): hierarchical cluster analysis (hclust).
Start with each individual in cluster by itself.
Join “closest” clusters one by one until all individuals in one cluster.
How to define closeness of two clusters? Not obvious, investigate in a moment.
Know how many clusters: which division into that many clusters is “best” for individuals? K-means clustering (kmeans).

Two made-up clusters

How to measure distance between set of red points and set of blue ones?

Single-linkage distance

Find the red point and the blue point that are closest together:

Single-linkage distance between 2 clusters is distance between their closest points.

Complete linkage

Find the red and blue points that are farthest apart:

Complete-linkage distance is distance between farthest points.

Ward’s method

Work out mean of each cluster and join point to its mean:

Work out (i) sum of squared distances of points from means.

Ward’s method part 2

Now imagine combining the two clusters and working out overall mean. Join each point to this mean:

Calc sum of squared distances (ii) of points to combined mean.

Ward’s method part 3

Sum of squares (ii) will be bigger than (i) (points closer to own cluster mean than combined mean).
Ward’s distance is (ii) minus (i).
Think of as “cost” of combining clusters:
if clusters close together, (ii) only a little larger than (i)
if clusters far apart, (ii) a lot larger than (i) (as in example).

Hierarchical clustering revisited

Single linkage, complete linkage, Ward are ways of measuring closeness of clusters.
Use them, starting with each observation in own cluster, to repeatedly combine two closest clusters until all points in one cluster.
They will give different answers (clustering stories).
Single linkage tends to make “stringy” clusters because clusters can be very different apart from two closest points.
Complete linkage insists on whole clusters being similar.
Ward tends to form many small clusters first.

Dissimilarity data in R

Dissimilarities for language data were how many number names had different first letter:

my_url <- "http://ritsokiguess.site/datafiles/languages.txt"
(number.d <- read_table(my_url))

Making a distance object

number.d %>%
  select(-la) %>%
  as.dist() -> d
d

   en no dk nl de fr es it pl hu
no  2                           
dk  2  1                        
nl  7  5  6                     
de  6  4  5  5                  
fr  6  6  6  9  7               
es  6  6  5  9  7  2            
it  6  6  5  9  7  1  1         
pl  7  7  6 10  8  5  3  4      
hu  9  8  8  8  9 10 10 10 10   
fi  9  9  9  9  9  9  9  8  9  8

class(d)

[1] "dist"

Cluster analysis and dendrogram

d.hc <- hclust(d, method = "single")
plot(d.hc)

Comments

Tree shows how languages combined into clusters.
First (bottom), Spanish, French, Italian joined into one cluster, Norwegian and Danish into another.
Later, English joined to Norse languages, Polish to Romance group.
Then German, Dutch make a Germanic group.
Finally, Hungarian and Finnish joined to each other and everything else.

Clustering process 1/2

enframe(d.hc$labels)

Clustering process 2/2

d.hc$merge

      [,1] [,2]
 [1,]   -2   -3
 [2,]   -6   -8
 [3,]   -7    2
 [4,]   -1    1
 [5,]   -9    3
 [6,]   -5    4
 [7,]   -4    6
 [8,]    5    7
 [9,]  -10    8
[10,]  -11    9

Comments

Lines of merge show what was combined
- First, languages 2 and 3 (no and dk)
- Then languages 6 and 8 (fr and it)
- Then #7 combined with cluster formed at step 2 (es joined to fr and it).
- Then en joined to no and dk
- Finally fi joined to all others.

Complete linkage

d.hc <- hclust(d, method = "complete")
plot(d.hc)

Ward

d.hc <- hclust(d, method = "ward.D")
plot(d.hc)

Chopping the tree

Three clusters (from Ward) looks good:

cutree(d.hc, 3)

en no dk nl de fr es it pl hu fi 
 1  1  1  1  1  2  2  2  2  3  3

Turning the “named vector” into a data frame

cutree(d.hc, 3) %>% enframe(name="country", value="cluster")

Drawing those clusters on the tree

plot(d.hc)
rect.hclust(d.hc, 3)

Comparing single-linkage and Ward

In Ward, Dutch and German get joined earlier (before joining to Germanic cluster).
Also Hungarian and Finnish get combined earlier.

Making those dissimilarities

Original data:

my_url <- "http://ritsokiguess.site/datafiles/one-ten.txt"
lang <- read_delim(my_url, " ")
lang

It would be a lot easier to extract the first letter if the number names were all in one column.

Tidy, and extract first letter

lang %>% mutate(number=row_number()) %>%
    pivot_longer(-number, names_to="language", values_to="name") %>%
    mutate(first=str_sub(name, 1, 1)) -> lang.long
lang.long

Calculating dissimilarity

Suppose we wanted dissimilarity between English and Norwegian. It’s the number of first letters that are different.
First get the lines for English:

english <- lang.long %>% filter(language == "en")
english

And then the lines for Norwegian

norwegian <- lang.long %>% filter(language == "no")
norwegian

And now we want to put them side by side, matched by number. This is what left_join does. (A “join” is a lookup of values in one table using another.)

The join

english %>% left_join(norwegian, join_by(number))

first.x is 1st letter of English word, first.y 1st letter of Norwegian word.

Counting the different ones

english %>% left_join(norwegian, join_by(number)) %>% 
  count(different=(first.x != first.y))

english %>% left_join(norwegian, join_by(number)) %>% 
  count(different=(first.x != first.y)) %>% 
  filter(different) %>% pull(n) -> ans
ans

[1] 2

Words for 1 and 8 start with different letter; rest are same.

A language with itself

The answer should be zero:

english %>% left_join(english, join_by(number)) %>% 
  count(different=(first.x != first.y)) %>% 
  filter(different) %>% pull(n) -> ans
ans

integer(0)

but this is “an integer vector of length zero”.
so we have to allow for this possibility when we write a function to do it.

Function to do this for any two languages

countdiff <- function(lang.1, lang.2, d) {
  d %>% filter(language == lang.1) -> lang1d
  d %>% filter(language == lang.2) -> lang2d
  lang1d %>%
    left_join(lang2d, join_by(number)) %>%
    count(different = (first.x != first.y)) %>%
    filter(different) %>% pull(n) -> ans
  # if ans has length zero, set answer to (integer) zero.
  ifelse(length(ans)==0, 0L, ans) 
}

Testing

countdiff("en", "no", lang.long)

[1] 2

countdiff("en", "en", lang.long)

[1] 0

English and Norwegian have two different; English and English have none different.

Check.

For all pairs of languages?

First need all the languages:

languages <- names(lang)
languages

 [1] "en" "no" "dk" "nl" "de" "fr" "es" "it" "pl"
[10] "hu" "fi"

and then all pairs of languages:

pairs <- crossing(lang = languages, lang2 = languages)

The pairs

pairs

Run `countdiff` for all those language pairs

pairs %>% rowwise() %>% 
  mutate(diff = countdiff(lang, lang2, lang.long)) -> thediff
thediff

Make square table of these

thediff %>% pivot_wider(names_from=lang2, values_from=diff)

and that was where we began.

Another example

Birth, death and infant mortality rates for 97 countries (variables not dissimilarities):

24.7  5.7  30.8 Albania         12.5 11.9  14.4 Bulgaria
13.4 11.7  11.3 Czechoslovakia  12   12.4   7.6 Former_E._Germany
11.6 13.4  14.8 Hungary         14.3 10.2    16 Poland
13.6 10.7  26.9 Romania           14    9  20.2 Yugoslavia
17.7   10    23 USSR            15.2  9.5  13.1 Byelorussia_SSR
13.4 11.6    13 Ukrainian_SSR   20.7  8.4  25.7 Argentina
46.6   18   111 Bolivia         28.6  7.9    63 Brazil
23.4  5.8  17.1 Chile           27.4  6.1    40 Columbia
32.9  7.4    63 Ecuador         28.3  7.3    56 Guyana
...

Want to find groups of similar countries (and how many groups, which countries in each group).
Tree would be unwieldy with 97 countries.
More automatic way of finding given number of clusters?

Reading in

url <- "http://ritsokiguess.site/datafiles/birthrate.txt"
vital <- read_table(url)
vital

Standardizing

Infant mortality rate numbers bigger than others, consequence of measurement scale (arbitrary).
Standardize (numerical) columns of data frame to have mean 0, SD 1, done by scale.

vital %>% 
  mutate(across(where(is.numeric), \(x) scale(x))) -> vital.s
vital.s

Three clusters

Pretend we know 3 clusters is good. Take off the column of countries, and run kmeans on the resulting data frame, asking for 3 clusters:

vital.s %>% select(-country) %>% 
  kmeans(3) -> vital.km3
vital.km3

K-means clustering with 3 clusters of sizes 40, 25, 32

Cluster means:
       birth      death      infant
1 -1.0376994 -0.3289046 -0.90669032
2  1.1780071  1.3323130  1.32732200
3  0.3768062 -0.6297388  0.09639258

Clustering vector:
 [1] 3 1 1 1 1 1 2 1 3 3 1 2 1 1 1 1 1 1 1 1 1 2 2 1 3 3 3 2
[29] 1 3 1 3 3 1 1 3 3 3 2 2 3 3 2 2 3 2 2 2 3 1 1 1 1 1 1 3
[57] 3 3 3 3 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 1 2 1 3 3 2 3 1 3
[85] 2 2 2 2 3 2 2 2 2 2 3 2 2

Within cluster sum of squares by cluster:
[1] 17.21617 28.32560 21.53020
 (between_SS / total_SS =  76.7 %)

Available components:

[1] "cluster"      "centers"      "totss"       
[4] "withinss"     "tot.withinss" "betweenss"   
[7] "size"         "iter"         "ifault"

A lot of output, so look at these individually.

What’s in the output?

Cluster sizes:

vital.km3$size

[1] 40 25 32

Cluster centres:

vital.km3$centers

       birth      death      infant
1 -1.0376994 -0.3289046 -0.90669032
2  1.1780071  1.3323130  1.32732200
3  0.3768062 -0.6297388  0.09639258

Cluster 1 has lower than average rates on everything; cluster 2 has much higher than average.

Cluster sums of squares and membership

vital.km3$withinss

[1] 17.21617 28.32560 21.53020

Cluster 1 compact relative to others (countries in cluster 1 more similar).

vital.km3$cluster

 [1] 3 1 1 1 1 1 2 1 3 3 1 2 1 1 1 1 1 1 1 1 1 2 2 1 3 3 3 2
[29] 1 3 1 3 3 1 1 3 3 3 2 2 3 3 2 2 3 2 2 2 3 1 1 1 1 1 1 3
[57] 3 3 3 3 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 1 2 1 3 3 2 3 1 3
[85] 2 2 2 2 3 2 2 2 2 2 3 2 2

The cluster membership for each of the 97 countries.

Store countries and clusters to which they belong

vital.3 <- tibble(
  country = vital.s$country,
  cluster = vital.km3$cluster
)
vital.3

Which countries in which cluster?

Write function to extract cluster number i from dataframe d with a column called country:

get_countries <- function(i, d) {
  d %>% filter(cluster == i) %>% pull(country)
}

Cluster membership: cluster 2

get_countries(2, vital.3)

 [1] "Bolivia"      "Mexico"       "Afghanistan" 
 [4] "Iran"         "Bangladesh"   "Gabon"       
 [7] "Ghana"        "Namibia"      "Sierra_Leone"
[10] "Swaziland"    "Uganda"       "Zaire"       
[13] "Cambodia"     "Nepal"        "Angola"      
[16] "Congo"        "Ethiopia"     "Gambia"      
[19] "Malawi"       "Mozambique"   "Nigeria"     
[22] "Somalia"      "Sudan"        "Tanzania"    
[25] "Zambia"

Cluster 3

get_countries(3, vital.3)

 [1] "Albania"      "Ecuador"      "Paraguay"    
 [4] "Kuwait"       "Oman"         "Turkey"      
 [7] "India"        "Mongolia"     "Pakistan"    
[10] "Algeria"      "Botswana"     "Egypt"       
[13] "Libya"        "Morocco"      "South_Africa"
[16] "Zimbabwe"     "Brazil"       "Columbia"    
[19] "Guyana"       "Peru"         "Venezuela"   
[22] "Bahrain"      "Iraq"         "Jordan"      
[25] "Lebanon"      "Saudi_Arabia" "Indonesia"   
[28] "Malaysia"     "Philippines"  "Vietnam"     
[31] "Kenya"        "Tunisia"

Cluster 1

get_countries(1, vital.3)

 [1] "Czechoslovakia"       "Hungary"             
 [3] "Romania"              "USSR"                
 [5] "Ukrainian_SSR"        "Chile"               
 [7] "Uruguay"              "Finland"             
 [9] "France"               "Greece"              
[11] "Italy"                "Norway"              
[13] "Spain"                "Switzerland"         
[15] "Austria"              "Canada"              
[17] "Israel"               "China"               
[19] "Korea"                "Singapore"           
[21] "Thailand"             "Bulgaria"            
[23] "Former_E._Germany"    "Poland"              
[25] "Yugoslavia"           "Byelorussia_SSR"     
[27] "Argentina"            "Belgium"             
[29] "Denmark"              "Germany"             
[31] "Ireland"              "Netherlands"         
[33] "Portugal"             "Sweden"              
[35] "U.K."                 "Japan"               
[37] "U.S.A."               "United_Arab_Emirates"
[39] "Hong_Kong"            "Sri_Lanka"

Problem!

kmeans uses randomization. So result of one run might be different from another run.
Example: just run again on 3 clusters, table of results:

vital.s %>% 
  select(-country) %>% kmeans(3) -> vital.km3a
table(
  first = vital.km3$cluster,
  second = vital.km3a$cluster
)

     second
first  1  2  3
    1 40  0  0
    2  0 24  1
    3  4  0 28

Clusters are similar but not same.

Solution to this

nstart option on kmeans runs that many times, takes best. Should be same every time:

vital.s %>%
  select(-country) %>%
  kmeans(3, nstart = 20) -> vital.km3b

How many clusters?

Three was just a guess.
Idea: try a whole bunch of #clusters (say 2–20), obtain measure of goodness of fit for each, make plot.
Appropriate measure is tot.withinss.
Run kmeans for each #clusters, get tot.withinss each time.

Function to get `tot.withinss`

for an input number of clusters, taking only numeric columns of input data frame:

ss <- function(i, d) {
  d %>%
    select(where(is.numeric)) %>%
    kmeans(i, nstart = 20) -> km
  km$tot.withinss
}

Note: writing function to be as general as possible, so that we can re-use it later.

Constructing within-cluster SS

Make a data frame with desired numbers of clusters, and fill it with the total within-group sums of squares. ss expects a single number of clusters, not a vector of several, so run rowwise:

tibble(clusters = 2:20) %>%
  rowwise() %>% 
  mutate(wss = ss(clusters, vital.s)) -> ssd
ssd

Scree plot

ggplot(ssd, aes(x = clusters, y = wss)) + geom_point() +
  geom_line()

Interpreting scree plot

Lower wss better.
But lower for larger #clusters, harder to explain.
Compromise: low-ish wss and low-ish #clusters.
Look for “elbow” in plot.
Idea: this is where wss decreases fast then slow.
On our plot, small elbow at 6 clusters. Try this many clusters.

Six clusters, using `nstart`

set.seed(457299)

vital.s %>%
  select(-country) %>%
  kmeans(6, nstart = 20) -> vital.km6
vital.km6$size

[1] 24 18 15  2  8 30

vital.km6$centers

       birth      death     infant
1  0.4160993 -0.5169988  0.2648754
2  1.2092406  0.7441347  1.0278003
3 -0.4357690 -1.1438599 -0.7281108
4 -0.2199722  2.1116577 -0.4544435
5  1.3043848  2.1896567  1.9470306
6 -1.1737104 -0.1856375 -0.9534370

Make a data frame of countries and clusters

vital.6 <- tibble(
  country = vital.s$country,
  cluster = vital.km6$cluster
)
vital.6 %>% sample_n(10)

Cluster 1

Below-average death rate, though other rates a little higher than average:

get_countries(1, vital.6)

 [1] "Ecuador"      "Paraguay"     "Oman"        
 [4] "Turkey"       "India"        "Mongolia"    
 [7] "Pakistan"     "Algeria"      "Egypt"       
[10] "Libya"        "Morocco"      "South_Africa"
[13] "Zimbabwe"     "Brazil"       "Guyana"      
[16] "Peru"         "Iraq"         "Jordan"      
[19] "Lebanon"      "Saudi_Arabia" "Indonesia"   
[22] "Philippines"  "Vietnam"      "Tunisia"

Cluster 2

High on everything:

get_countries(2, vital.6)

 [1] "Bolivia"    "Iran"       "Bangladesh" "Botswana"  
 [5] "Gabon"      "Ghana"      "Namibia"    "Swaziland" 
 [9] "Uganda"     "Zaire"      "Cambodia"   "Nepal"     
[13] "Congo"      "Kenya"      "Nigeria"    "Sudan"     
[17] "Tanzania"   "Zambia"

Cluster 3

Low on everything:

get_countries(3, vital.6)

 [1] "Albania"              "Chile"               
 [3] "Israel"               "Kuwait"              
 [5] "China"                "Singapore"           
 [7] "Thailand"             "Argentina"           
 [9] "Columbia"             "Venezuela"           
[11] "Bahrain"              "United_Arab_Emirates"
[13] "Hong_Kong"            "Malaysia"            
[15] "Sri_Lanka"

Cluster 4

Very high death rate, just below average on all else:

get_countries(4, vital.6)

[1] "Mexico" "Korea"

Cluster 5

Very high on everything:

get_countries(5, vital.6)

[1] "Afghanistan"  "Sierra_Leone" "Angola"      
[4] "Ethiopia"     "Gambia"       "Malawi"      
[7] "Mozambique"   "Somalia"

Cluster 6

A bit below average on everything:

get_countries(6, vital.6)

 [1] "Czechoslovakia"    "Hungary"          
 [3] "Romania"           "USSR"             
 [5] "Ukrainian_SSR"     "Uruguay"          
 [7] "Finland"           "France"           
 [9] "Greece"            "Italy"            
[11] "Norway"            "Spain"            
[13] "Switzerland"       "Austria"          
[15] "Canada"            "Bulgaria"         
[17] "Former_E._Germany" "Poland"           
[19] "Yugoslavia"        "Byelorussia_SSR"  
[21] "Belgium"           "Denmark"          
[23] "Germany"           "Ireland"          
[25] "Netherlands"       "Portugal"         
[27] "Sweden"            "U.K."             
[29] "Japan"             "U.S.A."

Comparing our 3 and 6-cluster solutions

table(three = vital.km3$cluster, six = vital.km6$cluster)

     six
three  1  2  3  4  5  6
    1  0  0  9  1  0 30
    2  0 16  0  1  8  0
    3 24  2  6  0  0  0

Compared to 3-cluster solution:

most of (old) cluster 1 gone to (new) cluster 6
Old cluster 2 split into new clusters 2 and 5 (two types of “developing” countries)
Old cluster 3 split into new clusters 1 and 3 (two types of “intermediate” countries, divided by death rate).

Getting a picture from `kmeans`

Use discriminant analysis on clusters found, treating them as “known” groups.

Discriminant analysis

So what makes the groups different?
Uses package MASS (loaded):

vital.lda <- lda(vital.km6$cluster ~ birth + death + infant, 
                 data = vital.s)
vital.lda$svd

[1] 21.687195  8.851811  1.773006

vital.lda$scaling

             LD1        LD2        LD3
birth  2.6879695  1.1224202 -1.9483853
death  0.6652712 -2.7213044 -0.6049358
infant 2.1111801  0.7650912  2.3542296

LD1 is some of everything (high=poor, low=rich).
LD2 mainly death rate, high or low.

A data frame to make plot from

Get predictions first:

vital.pred <- predict(vital.lda)
d <- data.frame(
  country = vital.s$country,
  cluster = vital.km6$cluster, 
  vital.pred$x
)
d

What’s in there; making a plot

d contains country names, cluster memberships and discriminant scores.
Plot LD1 against LD2, colouring points by cluster and labelling by country:

g <- ggplot(d, aes(
  x = LD1, y = LD2, colour = factor(cluster),
  label = country
)) + geom_point() +
  geom_text_repel(size = 2, max.overlaps = Inf) + 
  guides(colour = "none")

The plot

It would be better to zoom in on parts of this plot.

Final example: a hockey league

An Ontario hockey league has teams in 21 cities. How can we arrange those teams into 4 geographical divisions?
Distance data in spreadsheet.
Take out spaces in team names.
Save as “text/csv”.
Distances, so back to hclust.

A map

Attempt 1

my_url <- 
  "http://ritsokiguess.site/datafiles/ontario-road-distances.csv"
ontario <- read_csv(my_url)
ontario.d <- ontario %>% select(-1) %>% as.dist()
ontario.hc <- hclust(ontario.d, method = "ward.D")

Plot, with 4 clusters

plot(ontario.hc)
rect.hclust(ontario.hc, 4)

Comments

Can’t have divisions of 1 team!
“Southern” divisions way too big!
Try splitting into more. I found 7 to be good:

Seven clusters

plot(ontario.hc)
rect.hclust(ontario.hc, 7)

Divisions now

I want to put Huntsville and North Bay together with northern teams.
I’ll put the Eastern teams together. Gives:
North: Sault Ste Marie, Sudbury, Huntsville, North Bay
East: Brockville, Cornwall, Ottawa, Peterborough, Belleville, Kingston
West: Windsor, London, Sarnia
Central: Owen Sound, Barrie, Toronto, Niagara Falls, St Catharines, Brantford, Hamilton, Kitchener
Getting them same size beyond us!

Cluster analysis

Cluster Analysis

Packages

One to ten in 11 languages

One to ten

Dissimilarities and languages example

Two kinds of cluster analysis

Two made-up clusters

Single-linkage distance

Complete linkage

Ward’s method

Ward’s method part 2

Ward’s method part 3

Hierarchical clustering revisited

Dissimilarity data in R

Making a distance object

Cluster analysis and dendrogram

Comments

Clustering process 1/2

Clustering process 2/2

Comments

Complete linkage

Ward

Chopping the tree

Turning the “named vector” into a data frame

Drawing those clusters on the tree

Comparing single-linkage and Ward

Making those dissimilarities

Tidy, and extract first letter

Calculating dissimilarity

And then the lines for Norwegian

The join

Counting the different ones

A language with itself

Function to do this for any two languages

Testing

For all pairs of languages?

The pairs

Run countdiff for all those language pairs

Make square table of these

Another example

Reading in

Standardizing

Three clusters

What’s in the output?

Cluster sums of squares and membership

Store countries and clusters to which they belong

Which countries in which cluster?

Cluster membership: cluster 2

Cluster 3

Cluster 1

Problem!

Solution to this

How many clusters?

Function to get tot.withinss

Constructing within-cluster SS

Scree plot

Interpreting scree plot

Six clusters, using nstart

Make a data frame of countries and clusters

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Comparing our 3 and 6-cluster solutions

Getting a picture from kmeans

Discriminant analysis

A data frame to make plot from

What’s in there; making a plot

The plot

Final example: a hockey league

A map

Attempt 1

Plot, with 4 clusters

Comments

Seven clusters

Divisions now

Another map

Run `countdiff` for all those language pairs

Function to get `tot.withinss`

Six clusters, using `nstart`

Getting a picture from `kmeans`