ANOVA and MANOVA: predict a (counted/measured) response from group membership.
Discriminant analysis: predict group membership based on counted/measured variables.
Covers same ground as logistic regression (and its variations), but emphasis on classifying observed data into correct groups.
Does so by searching for linear combination of original variables that best separates data into groups (canonical variables).
Assumption here that groups are known (for data we have). If trying to “best separate” data into unknown groups, see cluster analysis.
Packages
library(MASS, exclude ="select")library(tidyverse)library(ggrepel)library(ggbiplot) # this loads plyr (different from dplyr)library(MVTests) # for Box M testlibrary(conflicted)conflict_prefer("arrange", "dplyr")conflict_prefer("summarize", "dplyr")conflict_prefer("select", "dplyr")conflict_prefer("filter", "dplyr")conflict_prefer("mutate", "dplyr")conflicts_prefer(dplyr::count)
ggrepel allows labelling points on a plot so they don’t overwrite each other.
ggbiplot uses plyr rather than dplyr, which has functions by similar names.
About select
Both dplyr (in tidyverse) and MASS have a function called select, and they do different things.
How do you know which select is going to get called?
With library, the one loaded last is visible, and others are not.
Thus we can access the select in dplyr but not the one in MASS. If we wanted that one, we’d have to say MASS::select.
Better: load conflicted package. Any time you load two packages containing functions with same name, you get error and have to choose between them.
Recall data from MANOVA: needed a multivariate analysis to find difference in seed yield and weight based on whether they were high or low fertilizer.
Basic discriminant analysis
hilo.1<-lda(fertilizer ~ yield + weight, data = hilo)
Uses lda from package MASS.
“Predicting” group membership from measured variables.
Output
hilo.1
Call:
lda(fertilizer ~ yield + weight, data = hilo)
Prior probabilities of groups:
high low
0.5 0.5
Group means:
yield weight
high 35.0 13.25
low 32.5 12.00
Coefficients of linear discriminants:
LD1
yield -0.7666761
weight -1.2513563
Things to take from output
Group means: high-fertilizer plants have (slightly) higher mean yield and weight than low-fertilizer plants.
“Coefficients of linear discriminants”: are scores constructed from observed variables that best separate the groups.
For any plant, get LD1 score by taking \(-0.76\) times yield plus \(-1.25\) times weight, add up, standardize.
the LD1 coefficients are like slopes:
if yield higher, LD1 score for a plant lower
if weight higher, LD1 score for a plant lower
High-fertilizer plants have higher yield and weight, thus low (negative) LD1 score. Low-fertilizer plants have low yield and weight, thus high (positive) LD1 score.
One LD1 score for each observation. Plot with actual groups.
How many linear discriminants?
Smaller of these:
Number of variables
Number of groups minus 1
Seed yield and weight: 2 variables, 2 groups, \(\min(2,2-1)=1\).
Getting LD scores
Feed output from LDA into predict:
p <-predict(hilo.1)p
$class
[1] low low low low high high high high
Levels: high low
$posterior
high low
1 2.108619e-05 9.999789e-01
2 1.245320e-03 9.987547e-01
3 2.315016e-02 9.768498e-01
4 4.579036e-02 9.542096e-01
5 9.817958e-01 1.820422e-02
6 9.998195e-01 1.804941e-04
7 9.089278e-01 9.107216e-02
8 9.999109e-01 8.914534e-05
$x
LD1
1 3.0931414
2 1.9210963
3 1.0751090
4 0.8724245
5 -1.1456079
6 -2.4762756
7 -0.6609276
8 -2.6789600
hilo.2<-cbind(hilo, p)hilo.2
LD1 scores in order
Most positive LD1 score is most obviously low fertilizer, most negative is most obviously high:
When predicting group membership for one observation, only uses the other one in that group.
So if two in a pair are far apart, or if two groups overlap, great potential for misclassification.
Groups 5_1 and 6_2 overlap.
5_2 closest to 8_1s looks more like an 8_1 than a 5_2 (other one far away).
8_1s relatively far apart and close to other things, so one appears to be a 5_2 and the other an 8_2.
Example 3: professions and leisure activities
15 individuals from three different professions (politicians, administrators and belly dancers) each participate in four different leisure activities: reading, dancing, TV watching and skiing. After each activity they rate it on a 0–10 scale.
How can we best use the scores on the activities to predict a person’s profession?
Or, what combination(s) of scores best separate data into profession groups?
Comments
Now 3 LDs (3 variables, 6 groups, \(\min(3,6-1)=3\)).
Relationship of LDs to original variables. Look for coeffs far from zero:
high
LD1
mainly highy
or loww
.high
LD2
mainly loww
.Proportion of trace values show relative importance of LDs:
LD1
much more important thanLD2
;LD3
worthless.