Recall data from MANOVA: needed a multivariate analysis to find difference in seed yield and weight based on whether they were high or low fertilizer.
Basic discriminant analysis
hilo.1<-lda(fertilizer ~ yield + weight, data = hilo)
Uses lda from package MASS.
“Predicting” group membership from measured variables.
Output
hilo.1
Call:
lda(fertilizer ~ yield + weight, data = hilo)
Prior probabilities of groups:
high low
0.5 0.5
Group means:
yield weight
high 35.0 13.25
low 32.5 12.00
Coefficients of linear discriminants:
LD1
yield -0.7666761
weight -1.2513563
Things to take from output
Group means: high-fertilizer plants have (slightly) higher mean yield and weight than low-fertilizer plants.
“Coefficients of linear discriminants”: are scores constructed from observed variables that best separate the groups.
For any plant, get LD1 score by taking \(-0.76\) times yield plus \(-1.25\) times weight, add up, standardize.
the LD1 coefficients are like slopes:
if yield higher, LD1 score for a plant lower
if weight higher, LD1 score for a plant lower
High-fertilizer plants have higher yield and weight, thus low (negative) LD1 score. Low-fertilizer plants have low yield and weight, thus high (positive) LD1 score.
One LD1 score for each observation. Plot with actual groups.
How many linear discriminants?
Smaller of these:
Number of variables
Number of groups minus 1
Seed yield and weight: 2 variables, 2 groups, \(\min(2,2-1)=1\).
Getting LD scores
Feed output from LDA into predict:
p <-predict(hilo.1)hilo.2<-cbind(hilo, p)hilo.2
fertilizer yield weight class posterior.high posterior.low LD1
1 low 34 10 low 2.108619e-05 9.999789e-01 3.0931414
2 low 29 14 low 1.245320e-03 9.987547e-01 1.9210963
3 low 35 11 low 2.315016e-02 9.768498e-01 1.0751090
4 low 32 13 low 4.579036e-02 9.542096e-01 0.8724245
5 high 33 14 high 9.817958e-01 1.820422e-02 -1.1456079
6 high 38 12 high 9.998195e-01 1.804941e-04 -2.4762756
7 high 34 13 high 9.089278e-01 9.107216e-02 -0.6609276
8 high 35 14 high 9.999109e-01 8.914534e-05 -2.6789600
LD1 scores in order
Most positive LD1 score is most obviously low fertilizer, most negative is most obviously high:
High fertilizer have yield and weight high, negative LD1 scores.
Plotting LD1 scores
With one LD score, plot against (true) groups, eg. boxplot:
ggplot(hilo.2, aes(x = fertilizer, y = LD1)) +geom_boxplot()
What else is in hilo.2?
class: predicted fertilizer level (based on values of yield and weight).
posterior: predicted probability of being low or high fertilizer given yield and weight.
LD1: scores for (each) linear discriminant (here is only LD1) on each observation.
Predictions and predicted groups
based on yield and weight:
hilo.2%>%select(yield, weight, fertilizer, class)
yield weight fertilizer class
1 34 10 low low
2 29 14 low low
3 35 11 low low
4 32 13 low low
5 33 14 high high
6 38 12 high high
7 34 13 high high
8 35 14 high high
Count up correct and incorrect classificationot()
with(hilo.2, table(obs = fertilizer, pred = class))
pred
obs high low
high 4 0
low 0 4
Each predicted fertilizer level is exactly same as observed one (perfect prediction).
Table shows no errors: all values on top-left to bottom-right diagonal.
Posterior probabilities
show how clear-cut the classification decisions were:
When predicting group membership for one observation, only uses the other one in that group.
So if two in a pair are far apart, or if two groups overlap, great potential for misclassification.
Groups 5_1 and 6_2 overlap.
5_2 closest to 8_1s looks more like an 8_1 than a 5_2 (other one far away).
8_1s relatively far apart and close to other things, so one appears to be a 5_2 and the other an 8_2.
Example 3: professions and leisure activities
15 individuals from three different professions (politicians, administrators and belly dancers) each participate in four different leisure activities: reading, dancing, TV watching and skiing. After each activity they rate it on a 0–10 scale.
How can we best use the scores on the activities to predict a person’s profession?
Or, what combination(s) of scores best separate data into profession groups?
Comments
Now 3 LDs (3 variables, 6 groups, \(\min(3,6-1)=3\)).
Relationship of LDs to original variables. Look for coeffs far from zero:
high
LD1
mainly highy
or loww
.high
LD2
mainly loww
.Proportion of trace values show relative importance of LDs:
LD1
much more important thanLD2
;LD3
worthless.