Discriminant Analysis

Discriminant analysis

  • ANOVA and MANOVA: predict a (counted/measured) response from group membership.

  • Discriminant analysis: predict group membership based on counted/measured variables.

  • Covers same ground as logistic regression (and its variations), but emphasis on classifying observed data into correct groups.

  • Does so by searching for linear combination of original variables that best separates data into groups (canonical variables).

  • Assumption here that groups are known (for data we have). If trying to “best separate” data into unknown groups, see cluster analysis.

Packages

library(MASS, exclude = "select")
library(tidyverse)
library(ggrepel)
library(ggbiplot)
library(MVTests) # for Box M test
library(conflicted)
conflict_prefer("arrange", "dplyr")
conflict_prefer("summarize", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
conflict_prefer("mutate", "dplyr")
  • ggrepel allows labelling points on a plot so they don’t overwrite each other.
  • ggbiplot uses plyr rather than dplyr, which has functions by similar names.

About select

  • Both dplyr (in tidyverse) and MASS have a function called select, and they do different things.

  • How do you know which select is going to get called?

  • With library, the one loaded last is visible, and others are not.

  • Thus we can access the select in dplyr but not the one in MASS. If we wanted that one, we’d have to say MASS::select.

  • Better: load conflicted package. Any time you load two packages containing functions with same name, you get error and have to choose between them.

Example 1: seed yields and weights

my_url <- "http://ritsokiguess.site/datafiles/manova1.txt"
hilo <- read_delim(my_url, " ")
g <- ggplot(hilo, aes(x = yield, y = weight,
  colour = fertilizer)) + geom_point(size = 4)

Recall data from MANOVA: needed a multivariate analysis to find difference in seed yield and weight based on whether they were high or low fertilizer.

Basic discriminant analysis

hilo.1 <- lda(fertilizer ~ yield + weight, data = hilo)
  • Uses lda from package MASS.

  • “Predicting” group membership from measured variables.

Output

hilo.1
Call:
lda(fertilizer ~ yield + weight, data = hilo)

Prior probabilities of groups:
high  low 
 0.5  0.5 

Group means:
     yield weight
high  35.0  13.25
low   32.5  12.00

Coefficients of linear discriminants:
              LD1
yield  -0.7666761
weight -1.2513563

Things to take from output

  • Group means: high-fertilizer plants have (slightly) higher mean yield and weight than low-fertilizer plants.

  • “Coefficients of linear discriminants”: are scores constructed from observed variables that best separate the groups.

  • For any plant, get LD1 score by taking \(-0.76\) times yield plus \(-1.25\) times weight, add up, standardize.

  • the LD1 coefficients are like slopes:

    • if yield higher, LD1 score for a plant lower
    • if weight higher, LD1 score for a plant lower
  • High-fertilizer plants have higher yield and weight, thus low (negative) LD1 score. Low-fertilizer plants have low yield and weight, thus high (positive) LD1 score.

  • One LD1 score for each observation. Plot with actual groups.

How many linear discriminants?

  • Smaller of these:

    • Number of variables

    • Number of groups minus 1

  • Seed yield and weight: 2 variables, 2 groups, \(\min(2,2-1)=1\).

Getting LD scores

Feed output from LDA into predict:

p <- predict(hilo.1)
hilo.2 <- cbind(hilo, p)
hilo.2

LD1 scores in order

Most positive LD1 score is most obviously low fertilizer, most negative is most obviously high:

hilo.2 %>% select(fertilizer, yield, weight, LD1) %>% 
  arrange(desc(LD1))

High fertilizer have yield and weight high, negative LD1 scores.

Plotting LD1 scores

With one LD score, plot against (true) groups, eg. boxplot:

ggplot(hilo.2, aes(x = fertilizer, y = LD1)) + geom_boxplot()

What else is in hilo.2?

  • class: predicted fertilizer level (based on values of yield and weight).

  • posterior: predicted probability of being low or high fertilizer given yield and weight.

  • LD1: scores for (each) linear discriminant (here is only LD1) on each observation.

Predictions and predicted groups

based on yield and weight:

hilo.2 %>% select(yield, weight, fertilizer, class)

Count up correct and incorrect classificationot()

with(hilo.2, table(obs = fertilizer, pred = class))
      pred
obs    high low
  high    4   0
  low     0   4
  • Each predicted fertilizer level is exactly same as observed one (perfect prediction).

  • Table shows no errors: all values on top-left to bottom-right diagonal.

Posterior probabilities

show how clear-cut the classification decisions were:

hilo.2 %>% 
  mutate(across(starts_with("posterior"), \(p) round(p, 4))) %>% 
  select(-LD1)

Only obs. 7 has any doubt: yield low for a high-fertilizer, but high weight makes up for it.

Example 2: the peanuts

my_url <- "http://ritsokiguess.site/datafiles/peanuts.txt"
peanuts <- read_delim(my_url, " ")
peanuts
  • Recall: location and variety both significant in MANOVA. Make combo of them (over):

Location-variety combos

peanuts %>%
   unite(combo, c(variety, location)) -> peanuts.combo
peanuts.combo

Discriminant analysis

# peanuts.1 <- lda(str_c(location, variety, sep = "_") ~ y + smk + w, data = peanuts)
peanuts.1 <- lda(combo ~ y + smk + w, data = peanuts.combo)
peanuts.1
Call:
lda(combo ~ y + smk + w, data = peanuts.combo)

Prior probabilities of groups:
      5_1       5_2       6_1       6_2       8_1       8_2 
0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 

Group means:
         y    smk     w
5_1 194.80 160.40 52.55
5_2 185.05 130.30 49.95
6_1 199.45 161.40 47.80
6_2 200.15 163.95 57.25
8_1 190.25 164.80 58.20
8_2 200.75 170.30 66.10

Coefficients of linear discriminants:
           LD1         LD2         LD3
y    0.4027356  0.02967881  0.18839237
smk  0.1727459 -0.06794271 -0.09386294
w   -0.5792456 -0.16300221  0.07341123

Proportion of trace:
   LD1    LD2    LD3 
0.8424 0.1317 0.0258 

Comments

  • Now 3 LDs (3 variables, 6 groups, \(\min(3,6-1)=3\)).

  • Relationship of LDs to original variables. Look for coeffs far from zero:

peanuts.1$scaling
           LD1         LD2         LD3
y    0.4027356  0.02967881  0.18839237
smk  0.1727459 -0.06794271 -0.09386294
w   -0.5792456 -0.16300221  0.07341123
  • high LD1 mainly high y or low w.

  • high LD2 mainly low w.

  • Proportion of trace values show relative importance of LDs: LD1 much more important than LD2; LD3 worthless.

The predictions and misclassification

p <- predict(peanuts.1)
peanuts.2 <- cbind(peanuts.combo, p)
peanuts.2
with(peanuts.2, table(obs = combo, pred = class))
     pred
obs   5_1 5_2 6_1 6_2 8_1 8_2
  5_1   2   0   0   0   0   0
  5_2   0   2   0   0   0   0
  6_1   0   0   2   0   0   0
  6_2   1   0   0   1   0   0
  8_1   0   0   0   0   2   0
  8_2   0   0   0   0   0   2

Actually classified very well. Only one 6_2 classified as a 5_1, rest all correct.

Posterior probabilities

peanuts.2 %>% 
  mutate(across(starts_with("posterior"), \(p) round(p, 2))) %>% 
  select(combo,  class, starts_with("posterior"))

Some doubt about which combo each plant belongs in, but not too much. The one misclassified plant was a close call.

Discriminant scores, again

  • How are discriminant scores related to original variables?

  • Construct data frame with original data and discriminant scores side by side:

peanuts.1$scaling
           LD1         LD2         LD3
y    0.4027356  0.02967881  0.18839237
smk  0.1727459 -0.06794271 -0.09386294
w   -0.5792456 -0.16300221  0.07341123
  • LD1 positive if y large and/or w small.

  • LD2 positive if w small.

Discriminant scores for data

peanuts.2 %>% select(y, w, starts_with("x"))
  • Obs. 5 and 6 have most positive LD1: large y, small w.

  • Obs. 4 has most positive LD2: small w.

Plot LD1 vs. LD2, labelling by combo

g <- ggplot(peanuts.2, aes(x = x.LD1, y = x.LD2, colour = combo, 
                    label = combo)) + geom_point() +
  geom_text_repel() + guides(colour = "none")
g

“Bi-plot” from ggbiplot

ggbiplot(peanuts.1, groups = factor(peanuts.combo$combo))

Installing ggbiplot

  • ggbiplot not on CRAN, so usual install.packages will not work.

  • Install package devtools first (once):

install.packages("devtools")
  • Then install ggbiplot (once):
library(devtools)
install_github("vqv/ggbiplot")

Cross-validation

  • So far, have predicted group membership from same data used to form the groups — dishonest!

  • Better: cross-validation: form groups from all observations except one, then predict group membership for that left-out observation.

  • No longer cheating!

  • Illustrate with peanuts data again.

Misclassifications

  • Fitting and prediction all in one go:
p <- lda(combo ~ y + smk + w,
  data = peanuts.combo, CV = TRUE)
peanuts.3 <- cbind(peanuts.combo, class = p$class, 
                   posterior = p$posterior)
with(peanuts.3, table(obs = combo, pred = class))
     pred
obs   5_1 5_2 6_1 6_2 8_1 8_2
  5_1   0   0   0   2   0   0
  5_2   0   1   0   0   1   0
  6_1   0   0   2   0   0   0
  6_2   1   0   0   1   0   0
  8_1   0   1   0   0   0   1
  8_2   0   0   0   0   0   2
  • Some more misclassification this time.

Repeat of LD plot

g

Posterior probabilities

peanuts.3 %>% 
  mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>% 
  select(combo, class, starts_with("posterior"))

Why more misclassification?

  • When predicting group membership for one observation, only uses the other one in that group.

  • So if two in a pair are far apart, or if two groups overlap, great potential for misclassification.

  • Groups 5_1 and 6_2 overlap.

  • 5_2 closest to 8_1s looks more like an 8_1 than a 5_2 (other one far away).

  • 8_1s relatively far apart and close to other things, so one appears to be a 5_2 and the other an 8_2.

Example 3: professions and leisure activities

  • 15 individuals from three different professions (politicians, administrators and belly dancers) each participate in four different leisure activities: reading, dancing, TV watching and skiing. After each activity they rate it on a 0–10 scale.

  • How can we best use the scores on the activities to predict a person’s profession?

  • Or, what combination(s) of scores best separate data into profession groups?

The data

my_url <- "http://ritsokiguess.site/datafiles/profile.txt"
active <- read_delim(my_url, " ")
active

Discriminant analysis

active.1 <- lda(job ~ reading + dance + tv + ski, data = active)
active.1
Call:
lda(job ~ reading + dance + tv + ski, data = active)

Prior probabilities of groups:
      admin bellydancer  politician 
  0.3333333   0.3333333   0.3333333 

Group means:
            reading dance  tv ski
admin           5.0   2.0 1.8 3.8
bellydancer     6.6   9.4 5.8 7.4
politician      5.0   4.8 5.2 5.0

Coefficients of linear discriminants:
                LD1        LD2
reading -0.01297465 -0.4748081
dance   -0.95212396 -0.4614976
tv      -0.47417264  1.2446327
ski      0.04153684 -0.2033122

Proportion of trace:
   LD1    LD2 
0.8917 0.1083 

Comments

  • Two discriminants, first fair bit more important than second.

  • LD1 depends (negatively) most on dance, a bit on tv.

  • LD2 depends mostly (negatively) on tv.

Misclassification

p <- predict(active.1)
active.2 <- cbind(active, p)
with(active.2, table(obs = job, pred = class))
             pred
obs           admin bellydancer politician
  admin           5           0          0
  bellydancer     0           5          0
  politician      0           0          5

Everyone correctly classified.

Plotting LDs

g <- ggplot(active.2, aes(x = x.LD1, y = x.LD2, colour = job, label = job)) + 
  geom_point() + geom_text_repel() + guides(colour = "none")
g

Biplot

ggbiplot(active.1, groups = active$job)

Comments on plot

  • Groups well separated: bellydancers top left, administrators top right, politicians lower middle.

  • Bellydancers most negative on LD1: like dancing most.

  • Administrators most positive on LD1: like dancing least.

  • Politicians most negative on LD2: like TV-watching most.

Plotting individual persons

Make label be identifier of person. Now need legend:

active.2 %>% mutate(person = row_number()) %>% 
  ggplot(aes(x = x.LD1, y = x.LD2,  colour = job, 
               label = person)) + 
  geom_point() + geom_text_repel()

Posterior probabilities

active.2 %>% mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>% 
  select(job, class, starts_with("posterior"))

Not much doubt.

Cross-validating the jobs-activities data

Recall: no need for predict:

p <- lda(job ~ reading + dance + tv + ski, data = active, CV = TRUE)
active.3 <- cbind(active, class = p$class, posterior = p$posterior)
with(active.3, table(obs = job, pred = class))
             pred
obs           admin bellydancer politician
  admin           5           0          0
  bellydancer     0           4          1
  politician      0           0          5

This time one of the bellydancers was classified as a politician.

and look at the posterior probabilities

active.3 %>% 
  mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>% 
  select(job, class, starts_with("post"))

Comments

  • Bellydancer was “definitely” a politician!

  • One of the administrators might have been a politician too.

Why did things get misclassified?

Go back to plot of discriminant scores:

  • one bellydancer much closer to the politicians,

  • one administrator a bit closer to the politicians.

Example 4: remote-sensing data

  • View 25 crops from air, measure 4 variables x1-x4.

  • Go back and record what each crop was.

  • Can we use the 4 variables to distinguish crops?

The data

my_url <- "http://ritsokiguess.site/datafiles/remote-sensing.txt"
crops <- read_table(my_url)
crops %>% print(n = 25)
# A tibble: 25 × 6
   crop          x1    x2    x3    x4 cr   
   <chr>      <dbl> <dbl> <dbl> <dbl> <chr>
 1 Corn          16    27    31    33 r    
 2 Corn          15    23    30    30 r    
 3 Corn          16    27    27    26 r    
 4 Corn          18    20    25    23 r    
 5 Corn          15    15    31    32 r    
 6 Corn          15    32    32    15 r    
 7 Corn          12    15    16    73 r    
 8 Soybeans      20    23    23    25 y    
 9 Soybeans      24    24    25    32 y    
10 Soybeans      21    25    23    24 y    
11 Soybeans      27    45    24    12 y    
12 Soybeans      12    13    15    42 y    
13 Soybeans      22    32    31    43 y    
14 Cotton        31    32    33    34 t    
15 Cotton        29    24    26    28 t    
16 Cotton        34    32    28    45 t    
17 Cotton        26    25    23    24 t    
18 Cotton        53    48    75    26 t    
19 Cotton        34    35    25    78 t    
20 Sugarbeets    22    23    25    42 g    
21 Sugarbeets    25    25    24    26 g    
22 Sugarbeets    34    25    16    52 g    
23 Sugarbeets    54    23    21    54 g    
24 Sugarbeets    25    43    32    15 g    
25 Sugarbeets    26    54     2    54 g    

Discriminant analysis

crops.1 <- lda(crop ~ x1 + x2 + x3 + x4, data = crops)
crops.1
Call:
lda(crop ~ x1 + x2 + x3 + x4, data = crops)

Prior probabilities of groups:
      Corn     Cotton   Soybeans Sugarbeets 
      0.28       0.24       0.24       0.24 

Group means:
                 x1       x2       x3       x4
Corn       15.28571 22.71429 27.42857 33.14286
Cotton     34.50000 32.66667 35.00000 39.16667
Soybeans   21.00000 27.00000 23.50000 29.66667
Sugarbeets 31.00000 32.16667 20.00000 40.50000

Coefficients of linear discriminants:
           LD1          LD2           LD3
x1  0.14077479  0.007780184 -0.0312610362
x2  0.03006972  0.007318386  0.0085401510
x3 -0.06363974 -0.099520895 -0.0005309869
x4 -0.00677414 -0.035612707  0.0577718649

Proportion of trace:
   LD1    LD2    LD3 
0.8044 0.1832 0.0124 

Assessing

  • 3 LDs (four variables, four groups).

  • 1st two important.

  • LD1 mostly x1 (minus)

  • LD2 x3 (minus)

Predictions

  • Thus:
p <- predict(crops.1)
crops.2 <- cbind(crops, p)
with(crops.2, table(obs = crop, pred = class))
            pred
obs          Corn Cotton Soybeans Sugarbeets
  Corn          6      0        1          0
  Cotton        0      4        2          0
  Soybeans      2      0        3          1
  Sugarbeets    0      0        3          3
  • Not very good, eg. only half the Soybeans and Sugarbeets classified correctly.

Plotting the LDs

ggplot(crops.2, aes(x = x.LD1, y = x.LD2, colour = crop)) +
  geom_point()

Corn (red) mostly left, cotton (green) sort of right, soybeans and sugarbeets (blue and purple) mixed up.

Biplot

ggbiplot(crops.1, groups = crops$crop)

Comments

  • Corn low on LD1 (left), hence low on x1

  • Cotton tends to be high on LD1 (high x1)

  • one cotton very low on LD2 (high x3?)

  • Rather mixed up.

Posterior probs (some)

crops.2 %>% mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>% 
  filter(crop != class) %>% 
  select(crop, class, starts_with("posterior"))

Comments

  • These were the misclassified ones, but the posterior probability of being correct was not usually too low.

  • The correctly-classified ones are not very clear-cut either.

MANOVA

Began discriminant analysis as a followup to MANOVA. Do our variables significantly separate the crops?

response <- with(crops, cbind(x1, x2, x3, x4))
crops.manova <- manova(response ~ crop, data = crops)
summary(crops.manova)
          Df Pillai approx F num Df den Df  Pr(>F)  
crop       3 0.9113   2.1815     12     60 0.02416 *
Residuals 21                                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Box’s M test

We should also run Box’s M test to check for equal variance of each variable across crops:

summary(BoxM(response, crops$crop))
       Box's M Test 

Chi-Squared Value = 69.42634 , df = 30  and p-value: 5.79e-05 
  • The P-value for the M test is smaller even than our guideline of 0.001. So we should not take the MANOVA seriously.

  • Apparently at least one of the crops differs (in means) from the others. So it is worth doing this analysis.

  • We did this the wrong way around, though!

The right way around

  • First, do a MANOVA to see whether any of the groups differ significantly on any of the variables.

  • Check that the MANOVA is believable by using Box’s M test.

  • If the MANOVA is significant, do a discriminant analysis in the hopes of understanding how the groups are different.

  • For remote-sensing data (without Clover):

    • LD1 a fair bit more important than LD2 (definitely ignore LD3).

    • LD1 depends mostly on x1, on which Cotton was high and Corn was low.

  • Discriminant analysis in MANOVA plays the same kind of role that Tukey does in ANOVA.