Discriminant Analysis

Discriminant analysis

ANOVA and MANOVA: predict a (counted/measured) response from group membership.
Discriminant analysis: predict group membership based on counted/measured variables.
Covers same ground as logistic regression (and its variations), but emphasis on classifying observed data into correct groups.

… continued

Does so by searching for linear combination of original variables that best separates data into groups (canonical variables).
Assumption here that groups are known (for data we have). If trying to “best separate” data into unknown groups, see cluster analysis.

Packages

library(MASS, exclude = "select")
library(tidyverse)
library(ggrepel)
library(ggbiplot) # this loads plyr (different from dplyr)
library(MVTests) # for Box M test
library(conflicted)
conflict_prefer("arrange", "dplyr")
conflict_prefer("summarize", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
conflict_prefer("mutate", "dplyr")
conflicts_prefer(dplyr::count)

ggrepel allows labelling points on a plot so they don’t overwrite each other.
ggbiplot uses plyr rather than dplyr, which has functions by similar names.

About `select`

Both dplyr (in tidyverse) and MASS have a function called select, and they do different things.
How do you know which select is going to get called?
With library: one loaded last visible, others not.
Thus we can access the select in dplyr but not the one in MASS.
Better: load conflicted package. Any time you load two packages containing functions with same name, get error, choose between them.

Example 1: seed yields and weights

my_url <- "http://ritsokiguess.site/datafiles/manova1.txt"
hilo <- read_delim(my_url, " ")
g <- ggplot(hilo, aes(x = yield, y = weight,
  colour = fertilizer)) + geom_point(size = 4)

Recall data from MANOVA: needed a multivariate analysis to find difference in seed yield and weight based on whether they were high or low fertilizer.

Basic discriminant analysis

hilo.1 <- lda(fertilizer ~ yield + weight, data = hilo)

Uses lda from package MASS.
“Predicting” group membership from measured variables.

Output (in `hilo.1`)

Call:
lda(fertilizer ~ yield + weight, data = hilo)

Prior probabilities of groups:
high  low 
 0.5  0.5 

Group means:
     yield weight
high  35.0  13.25
low   32.5  12.00

Coefficients of linear discriminants:
              LD1
yield  -0.7666761
weight -1.2513563

Things to take from output 1/2

Group means: high-fertilizer plants have (slightly) higher mean yield and weight than low-fertilizer plants.
“Coefficients of linear discriminants”: are scores constructed from observed variables that best separate the groups.
For any plant, get LD1 score by taking \(-0.76\) times yield plus \(-1.25\) times weight, add up, standardize.

Things to take from output 1/2

the LD1 coefficients are like slopes:
- if yield higher, LD1 score for a plant lower
- if weight higher, LD1 score for a plant lower
High-fertilizer plants have higher yield and weight, thus low (negative) LD1 score. Low-fertilizer plants have low yield and weight, thus high (positive) LD1 score.
One LD1 score for each observation. Plot with actual groups.

How many linear discriminants?

Smaller of these:
- Number of variables
- Number of groups minus 1
Seed yield and weight: 2 variables, 2 groups, \(\min(2,2-1)=1\).

Getting LD scores

Feed output from LDA into predict:

p <- predict(hilo.1)
hilo.2 <- cbind(hilo, p)

the LD scores

hilo.2

LD1 scores in order

hilo.2 %>% select(fertilizer, yield, weight, LD1) %>% 
  arrange(desc(LD1))

LD1 scores and fertilizer

Most positive LD1 score is most obviously low fertilizer, most negative is most obviously high.

High fertilizer have yield and weight high, negative LD1 scores.

Plotting LD1 scores

With one LD score, plot against (true) groups, eg. boxplot:

ggplot(hilo.2, aes(x = fertilizer, y = LD1)) + geom_boxplot()

What else is in `hilo.2`?

class: predicted fertilizer level (based on values of yield and weight).
posterior: predicted probability of being low or high fertilizer given yield and weight.
LD1: scores for (each) linear discriminant (here is only LD1) on each observation.

Predictions and predicted groups

based on yield and weight:

hilo.2 %>% select(yield, weight, fertilizer, class)

Count up correct and incorrect classification

with(hilo.2, table(obs = fertilizer, pred = class))

      pred
obs    high low
  high    4   0
  low     0   4

Each predicted fertilizer level is exactly same as observed one (perfect prediction).
Table shows no errors: all values on top-left to bottom-right diagonal.

Posterior probabilities

show how clear-cut the classification decisions were:

hilo.2 %>% 
  mutate(across(starts_with("posterior"), \(p) round(p, 4))) %>% 
  select(-LD1)

Comments

Only obs. 7 has any doubt: yield low for a high-fertilizer, but high weight makes up for it.

Example 2: the peanuts

my_url <- "http://ritsokiguess.site/datafiles/peanuts.txt"
peanuts <- read_delim(my_url, " ")
peanuts

Comment

Recall: location and variety both significant in MANOVA.
Make combo of them:

peanuts %>%
   unite(combo, c(variety, location)) -> peanuts.combo
peanuts.combo