Recall data from MANOVA: needed a multivariate analysis to find difference in seed yield and weight based on whether they were high or low fertilizer.
Basic discriminant analysis
hilo.1<-lda(fertilizer ~ yield + weight, data = hilo)
Uses lda from package MASS.
“Predicting” group membership from measured variables.
Output
hilo.1
Call:
lda(fertilizer ~ yield + weight, data = hilo)
Prior probabilities of groups:
high low
0.5 0.5
Group means:
yield weight
high 35.0 13.25
low 32.5 12.00
Coefficients of linear discriminants:
LD1
yield -0.7666761
weight -1.2513563
Things to take from output
Group means: high-fertilizer plants have (slightly) higher mean yield and weight than low-fertilizer plants.
“Coefficients of linear discriminants”: are scores constructed from observed variables that best separate the groups.
For any plant, get LD1 score by taking \(-0.76\) times yield plus \(-1.25\) times weight, add up, standardize.
the LD1 coefficients are like slopes:
if yield higher, LD1 score for a plant lower
if weight higher, LD1 score for a plant lower
High-fertilizer plants have higher yield and weight, thus low (negative) LD1 score. Low-fertilizer plants have low yield and weight, thus high (positive) LD1 score.
One LD1 score for each observation. Plot with actual groups.
How many linear discriminants?
Smaller of these:
Number of variables
Number of groups minus 1
Seed yield and weight: 2 variables, 2 groups, \(\min(2,2-1)=1\).
Getting LD scores
Feed output from LDA into predict:
p <-predict(hilo.1)hilo.2<-cbind(hilo, p)hilo.2
LD1 scores in order
Most positive LD1 score is most obviously low fertilizer, most negative is most obviously high:
When predicting group membership for one observation, only uses the other one in that group.
So if two in a pair are far apart, or if two groups overlap, great potential for misclassification.
Groups 5_1 and 6_2 overlap.
5_2 closest to 8_1s looks more like an 8_1 than a 5_2 (other one far away).
8_1s relatively far apart and close to other things, so one appears to be a 5_2 and the other an 8_2.
Example 3: professions and leisure activities
15 individuals from three different professions (politicians, administrators and belly dancers) each participate in four different leisure activities: reading, dancing, TV watching and skiing. After each activity they rate it on a 0–10 scale.
How can we best use the scores on the activities to predict a person’s profession?
Or, what combination(s) of scores best separate data into profession groups?
Comments
Now 3 LDs (3 variables, 6 groups, \(\min(3,6-1)=3\)).
Relationship of LDs to original variables. Look for coeffs far from zero:
high
LD1
mainly highy
or loww
.high
LD2
mainly loww
.Proportion of trace values show relative importance of LDs:
LD1
much more important thanLD2
;LD3
worthless.