---
title: "Discriminant Analysis"
editor:
markdown:
wrap: 72
---
## Discriminant analysis
- ANOVA and MANOVA: predict a (counted/measured) response from group
membership.
- Discriminant analysis: predict group membership based on
counted/measured variables.
- Covers same ground as logistic regression (and its variations), but
emphasis on classifying observed data into correct groups.
- Does so by searching for linear combination of original variables
that best separates data into groups (canonical variables).
- Assumption here that groups are known (for data we have). If trying
to "best separate" data into unknown groups, see *cluster analysis*.
## Packages
```{r bDiscrim-1, message=FALSE, warning=FALSE}
library(MASS, exclude = "select")
library(tidyverse)
library(ggrepel)
library(ggbiplot)
library(MVTests) # for Box M test
library(conflicted)
conflict_prefer("arrange", "dplyr")
conflict_prefer("summarize", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
conflict_prefer("mutate", "dplyr")
```
- `ggrepel` allows labelling points on a plot so they don't overwrite
each other.
- `ggbiplot` uses `plyr` rather than `dplyr`, which has functions by
similar names.
## About `select`
- Both `dplyr` (in `tidyverse`) and `MASS` have a function called
`select`, and *they do different things*.
- How do you know which `select` is going to get called?
- With `library`, the one loaded *last* is visible, and others are
not.
- Thus we can access the `select` in `dplyr` but not the one in
`MASS`. If we wanted that one, we'd have to say `MASS::select`.
- Better: load `conflicted` package. Any time you load two packages
containing functions with same name, you get error and have to
choose between them.
## Example 1: seed yields and weights
```{r bDiscrim-2, message=FALSE}
my_url <- "http://ritsokiguess.site/datafiles/manova1.txt"
hilo <- read_delim(my_url, " ")
g <- ggplot(hilo, aes(x = yield, y = weight,
colour = fertilizer)) + geom_point(size = 4)
```
::: columns
::: {.column width="40%"}
Recall data from MANOVA: needed a multivariate analysis to find
difference in seed yield and weight based on whether they were high or
low fertilizer.
:::
::: {.column width="60%"}
```{r}
#| echo = FALSE
g
```
:::
:::
## Basic discriminant analysis
```{r bDiscrim-4 }
hilo.1 <- lda(fertilizer ~ yield + weight, data = hilo)
```
- Uses `lda` from package MASS.
- "Predicting" group membership from measured variables.
## Output
\small
```{r bDiscrim-5}
hilo.1
```
\normalsize
## Things to take from output
- Group means: high-fertilizer plants have (slightly) higher mean
yield and weight than low-fertilizer plants.
- "Coefficients of linear discriminants": \texttt{LD1,
LD2,}\ldots are scores constructed from observed variables that best
separate the groups.
- For any plant, get LD1 score by taking $-0.76$ times yield plus
$-1.25$ times weight, add up, standardize.
- the LD1 coefficients are like slopes:
- if yield higher, LD1 score for a plant lower
- if weight higher, LD1 score for a plant lower
- High-fertilizer plants have higher yield and weight, thus low
(negative) LD1 score. Low-fertilizer plants have low yield and
weight, thus high (positive) LD1 score.
- One LD1 score for each observation. Plot with actual groups.
## How many linear discriminants?
- Smaller of these:
- Number of variables
- Number of groups *minus 1*
- Seed yield and weight: 2 variables, 2 groups, $\min(2,2-1)=1$.
## Getting LD scores
Feed output from LDA into `predict`:
\scriptsize
```{r bDiscrim-6 }
p <- predict(hilo.1)
hilo.2 <- cbind(hilo, p)
hilo.2
```
\normalsize
## LD1 scores in order
Most positive LD1 score is most obviously low fertilizer, most negative
is most obviously high:
\footnotesize
```{r bDiscrim-7}
hilo.2 %>% select(fertilizer, yield, weight, LD1) %>%
arrange(desc(LD1))
```
\normalsize
High fertilizer have yield and weight high, negative LD1 scores.
## Plotting LD1 scores
With one LD score, plot against (true) groups, eg. boxplot:
```{r bDiscrim-8, fig.height=3.4}
ggplot(hilo.2, aes(x = fertilizer, y = LD1)) + geom_boxplot()
```
## What else is in `hilo.2`?
- `class`: predicted fertilizer level (based on values of `yield` and
`weight`).
- `posterior`: predicted probability of being low or high fertilizer
given `yield` and `weight`.
- `LD1`: scores for (each) linear discriminant (here is only LD1) on
each observation.
## Predictions and predicted groups
\ldots based on `yield` and `weight`:
\footnotesize
```{r bDiscrim-12}
hilo.2 %>% select(yield, weight, fertilizer, class)
```
\normalsize
## Count up correct and incorrect classificationot()
```{r}
with(hilo.2, table(obs = fertilizer, pred = class))
```
- Each predicted fertilizer level is exactly same as observed one
(perfect prediction).
- Table shows no errors: all values on top-left to bottom-right
diagonal.
## Posterior probabilities
\footnotesize
show how clear-cut the classification decisions were:
```{r bDiscrim-13}
hilo.2 %>%
mutate(across(starts_with("posterior"), \(p) round(p, 4))) %>%
select(-LD1)
```
Only obs. 7 has any doubt: `yield` low for a high-fertilizer, but high
`weight` makes up for it.
\normalsize
## Example 2: the peanuts
\scriptsize
```{r bDiscrim-14, message=F}
my_url <- "http://ritsokiguess.site/datafiles/peanuts.txt"
peanuts <- read_delim(my_url, " ")
peanuts
```
\normalsize
- Recall: `location` and `variety` both significant in MANOVA. Make
combo of them (over):
## Location-variety combos
\footnotesize
```{r combos}
peanuts %>%
unite(combo, c(variety, location)) -> peanuts.combo
peanuts.combo
```
\normalsize
## Discriminant analysis
\tiny
```{r bDiscrim-15}
# peanuts.1 <- lda(str_c(location, variety, sep = "_") ~ y + smk + w, data = peanuts)
peanuts.1 <- lda(combo ~ y + smk + w, data = peanuts.combo)
peanuts.1
```
\normalsize
## Comments
- Now 3 LDs (3 variables, 6 groups, $\min(3,6-1)=3$).
- Relationship of LDs to original variables. Look for coeffs far from
zero:
```{r}
peanuts.1$scaling
```
- high `LD1` mainly high `y` or low `w`.
- high `LD2` mainly low `w`.
- Proportion of trace values show relative importance of LDs: `LD1`
much more important than `LD2`; `LD3` worthless.
## The predictions and misclassification
```{r bDiscrim-17 }
p <- predict(peanuts.1)
peanuts.2 <- cbind(peanuts.combo, p)
peanuts.2
with(peanuts.2, table(obs = combo, pred = class))
```
Actually classified very well. Only one `6_2` classified as a `5_1`,
rest all correct.
## Posterior probabilities
\scriptsize
```{r bDiscrim-18}
peanuts.2 %>%
mutate(across(starts_with("posterior"), \(p) round(p, 2))) %>%
select(combo, class, starts_with("posterior"))
```
\small
*Some* doubt about which combo each plant belongs in, but not too much.
The one misclassified plant was a close call.
\normalsize
## Discriminant scores, again
- How are discriminant scores related to original variables?
- Construct data frame with original data and discriminant scores side
by side:
\footnotesize
```{r bDiscrim-19}
peanuts.1$scaling
```
\normalsize
- LD1 positive if `y` large and/or `w` small.
- LD2 positive if `w` small.
## Discriminant scores for data
\footnotesize
```{r bDiscrim-20}
peanuts.2 %>% select(y, w, starts_with("x"))
```
- Obs. 5 and 6 have most positive `LD1`: large `y`, small `w`.
- Obs. 4 has most positive `LD2`: small `w`.
\normalsize
## Plot LD1 vs. LD2, labelling by combo
\small
```{r bDiscrim-24, fig.height=4}
g <- ggplot(peanuts.2, aes(x = x.LD1, y = x.LD2, colour = combo,
label = combo)) + geom_point() +
geom_text_repel() + guides(colour = "none")
g
```
\normalsize
## "Bi-plot" from `ggbiplot`
```{r bDiscrim-25, fig.height=3.3}
ggbiplot(peanuts.1, groups = factor(peanuts.combo$combo))
```
## Installing `ggbiplot`
- `ggbiplot` not on CRAN, so usual `install.packages` will not work.
- Install package `devtools` first (once):
```{r bDiscrim-26, eval=F}
install.packages("devtools")
```
- Then install `ggbiplot` (once):
```{r bDiscrim-27, eval=F}
library(devtools)
install_github("vqv/ggbiplot")
```
## Cross-validation
- So far, have predicted group membership from same data used to form
the groups --- dishonest!
- Better: *cross-validation*: form groups from all observations
*except one*, then predict group membership for that left-out
observation.
- No longer cheating!
- Illustrate with peanuts data again.
## Misclassifications
- Fitting and prediction all in one go:
\small
```{r bDiscrim-28 }
p <- lda(combo ~ y + smk + w,
data = peanuts.combo, CV = TRUE)
peanuts.3 <- cbind(peanuts.combo, class = p$class,
posterior = p$posterior)
with(peanuts.3, table(obs = combo, pred = class))
```
\normalsize
- Some more misclassification this time.
## Repeat of LD plot
```{r graziani,fig.height=4.7}
g
```
## Posterior probabilities
\footnotesize
```{r bDiscrim-29}
peanuts.3 %>%
mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>%
select(combo, class, starts_with("posterior"))
```
\normalsize
## Why more misclassification?
- When predicting group membership for one observation, only uses the
*other one* in that group.
- So if two in a pair are far apart, or if two groups overlap, great
potential for misclassification.
- Groups `5_1` and `6_2` overlap.
- `5_2` closest to `8_1`s looks more like an `8_1` than a `5_2` (other
one far away).
- `8_1`s relatively far apart and close to other things, so one
appears to be a `5_2` and the other an `8_2`.
## Example 3: professions and leisure activities
- 15 individuals from three different professions (politicians,
administrators and belly dancers) each participate in four different
leisure activities: reading, dancing, TV watching and skiing. After
each activity they rate it on a 0--10 scale.
- How can we best use the scores on the activities to predict a
person's profession?
- Or, what combination(s) of scores best separate data into profession
groups?
## The data
\footnotesize
```{r}
#| message = FALSE
my_url <- "http://ritsokiguess.site/datafiles/profile.txt"
active <- read_delim(my_url, " ")
active
```
\normalsize
## Discriminant analysis
\tiny
```{r bDiscrim-30, message=F}
active.1 <- lda(job ~ reading + dance + tv + ski, data = active)
active.1
```
\normalsize
## Comments
- Two discriminants, first fair bit more important than second.
- `LD1` depends (negatively) most on `dance`, a bit on `tv`.
- `LD2` depends mostly (negatively) on `tv`.
## Misclassification
```{r bDiscrim-31 }
p <- predict(active.1)
active.2 <- cbind(active, p)
with(active.2, table(obs = job, pred = class))
```
Everyone correctly classified.
## Plotting LDs
\small
```{r bDiscrim-32, fig.height=4}
g <- ggplot(active.2, aes(x = x.LD1, y = x.LD2, colour = job, label = job)) +
geom_point() + geom_text_repel() + guides(colour = "none")
g
```
\normalsize
## Biplot
```{r bDiscrim-33, fig.height=4.3}
ggbiplot(active.1, groups = active$job)
```
## Comments on plot
- Groups well separated: bellydancers top left, administrators top
right, politicians lower middle.
- Bellydancers most negative on `LD1`: like dancing most.
- Administrators most positive on `LD1`: like dancing least.
- Politicians most negative on `LD2`: like TV-watching most.
## Plotting individual `persons`
Make `label` be identifier of person. Now need legend:
```{r bDiscrim-34, fig.height=3.5}
active.2 %>% mutate(person = row_number()) %>%
ggplot(aes(x = x.LD1, y = x.LD2, colour = job,
label = person)) +
geom_point() + geom_text_repel()
```
## Posterior probabilities
\scriptsize
```{r bDiscrim-35}
active.2 %>% mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>%
select(job, class, starts_with("posterior"))
```
\normalsize
Not much doubt.
## Cross-validating the jobs-activities data
Recall: no need for `predict`:
```{r bDiscrim-36 }
p <- lda(job ~ reading + dance + tv + ski, data = active, CV = TRUE)
active.3 <- cbind(active, class = p$class, posterior = p$posterior)
with(active.3, table(obs = job, pred = class))
```
This time one of the bellydancers was classified as a politician.
## and look at the posterior probabilities
\scriptsize
```{r bDiscrim-37}
active.3 %>%
mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>%
select(job, class, starts_with("post"))
```
\normalsize
## Comments
- Bellydancer was "definitely" a politician!
- One of the administrators might have been a politician too.
## Why did things get misclassified?
::: columns
::: {.column width="40%"}
Go back to plot of discriminant scores:
- one bellydancer much closer to the politicians,
- one administrator a bit closer to the politicians.
:::
::: {.column width="60%"}
```{r}
#| echo = FALSE
g
```
:::
:::
## Example 4: remote-sensing data
- View 25 crops from air, measure 4 variables `x1-x4`.
- Go back and record what each crop was.
- Can we use the 4 variables to distinguish crops?
## The data
\footnotesize
```{r bDiscrim-39 }
#| message = FALSE
my_url <- "http://ritsokiguess.site/datafiles/remote-sensing.txt"
crops <- read_table(my_url)
crops %>% print(n = 25)
```
\normalsize
## Discriminant analysis
\tiny
```{r bDiscrim-40 }
crops.1 <- lda(crop ~ x1 + x2 + x3 + x4, data = crops)
crops.1
```
\normalsize
## Assessing
- 3 LDs (four variables, four groups).
- 1st two important.
- `LD1` mostly `x1` (minus)
- `LD2` `x3` (minus)
## Predictions
- Thus:
\footnotesize
```{r bDiscrim-43}
p <- predict(crops.1)
crops.2 <- cbind(crops, p)
with(crops.2, table(obs = crop, pred = class))
```
\normalsize
- Not very good, eg. only half the Soybeans and Sugarbeets classified
correctly.
## Plotting the LDs
```{r piacentini,fig.height=3.4}
ggplot(crops.2, aes(x = x.LD1, y = x.LD2, colour = crop)) +
geom_point()
```
Corn (red) mostly left, cotton (green) sort of right, soybeans and
sugarbeets (blue and purple) mixed up.
## Biplot
```{r bDiscrim-45, fig.height=6}
ggbiplot(crops.1, groups = crops$crop)
```
## Comments
- Corn low on LD1 (left), hence low on `x1`
- Cotton tends to be high on LD1 (high `x1`)
- one cotton very low on LD2 (high `x3`?)
- Rather mixed up.
## Posterior probs (some)
\scriptsize
```{r bDiscrim-52 }
crops.2 %>% mutate(across(starts_with("posterior"), \(p) round(p, 3))) %>%
filter(crop != class) %>%
select(crop, class, starts_with("posterior"))
```
\normalsize
## Comments
- These were the misclassified ones, but the posterior probability of
being correct was not usually too low.
- The correctly-classified ones are not very clear-cut either.
## MANOVA
Began discriminant analysis as a followup to MANOVA. Do our variables
significantly separate the crops?
```{r bDiscrim-53 }
response <- with(crops, cbind(x1, x2, x3, x4))
crops.manova <- manova(response ~ crop, data = crops)
summary(crops.manova)
```
## Box's M test
We should also run Box's M test to check for equal variance of each
variable across crops:
\small
```{r}
summary(BoxM(response, crops$crop))
```
\normalsize
- The P-value for the M test is smaller even than our guideline of
0.001. So we should not take the MANOVA seriously.
- *Apparently* at least one of the crops differs (in means) from the
others. So it is worth doing this analysis.
- We did this the wrong way around, though!
## The right way around
- *First*, do a MANOVA to see whether any of the groups differ
significantly on any of the variables.
- Check that the MANOVA is believable by using Box's M test.
- *If the MANOVA is significant*, do a discriminant analysis in the
hopes of understanding how the groups are different.
- For remote-sensing data (without Clover):
- LD1 a fair bit more important than LD2 (definitely ignore LD3).
- LD1 depends mostly on `x1`, on which Cotton was high and Corn
was low.
- Discriminant analysis in MANOVA plays the same kind of role that
Tukey does in ANOVA.