ANOVA and MANOVA: predict a (counted/measured) response from group membership.
Discriminant analysis: predict group membership based on counted/measured variables.
Covers same ground as logistic regression (and its variations), but emphasis on classifying observed data into correct groups.
… continued
Does so by searching for linear combination of original variables that best separates data into groups (canonical variables).
Assumption here that groups are known (for data we have). If trying to “best separate” data into unknown groups, see cluster analysis.
Packages
library(MASS, exclude ="select")library(tidyverse)library(ggrepel)library(ggbiplot) # this loads plyr (different from dplyr)library(MVTests) # for Box M testlibrary(conflicted)conflict_prefer("arrange", "dplyr")conflict_prefer("summarize", "dplyr")conflict_prefer("select", "dplyr")conflict_prefer("filter", "dplyr")conflict_prefer("mutate", "dplyr")conflicts_prefer(dplyr::count)
ggrepel allows labelling points on a plot so they don’t overwrite each other.
ggbiplot uses plyr rather than dplyr, which has functions by similar names.
About select
Both dplyr (in tidyverse) and MASS have a function called select, and they do different things.
How do you know which select is going to get called?
With library: one loaded last visible, others not.
Thus we can access the select in dplyr but not the one in MASS.
Better: load conflicted package. Any time you load two packages containing functions with same name, get error, choose between them.
Recall data from MANOVA: needed a multivariate analysis to find difference in seed yield and weight based on whether they were high or low fertilizer.
Basic discriminant analysis
hilo.1<-lda(fertilizer ~ yield + weight, data = hilo)
Uses lda from package MASS.
“Predicting” group membership from measured variables.
Output (in hilo.1)
Call:
lda(fertilizer ~ yield + weight, data = hilo)
Prior probabilities of groups:
high low
0.5 0.5
Group means:
yield weight
high 35.0 13.25
low 32.5 12.00
Coefficients of linear discriminants:
LD1
yield -0.7666761
weight -1.2513563
Things to take from output 1/2
Group means: high-fertilizer plants have (slightly) higher mean yield and weight than low-fertilizer plants.
“Coefficients of linear discriminants”: are scores constructed from observed variables that best separate the groups.
For any plant, get LD1 score by taking \(-0.76\) times yield plus \(-1.25\) times weight, add up, standardize.
Things to take from output 1/2
the LD1 coefficients are like slopes:
if yield higher, LD1 score for a plant lower
if weight higher, LD1 score for a plant lower
High-fertilizer plants have higher yield and weight, thus low (negative) LD1 score. Low-fertilizer plants have low yield and weight, thus high (positive) LD1 score.
One LD1 score for each observation. Plot with actual groups.
How many linear discriminants?
Smaller of these:
Number of variables
Number of groups minus 1
Seed yield and weight: 2 variables, 2 groups, \(\min(2,2-1)=1\).
When predicting group membership for one observation, only uses the other one in that group.
So if two in a pair are far apart, or if two groups overlap, great potential for misclassification.
Groups 5_1 and 6_2 overlap.
5_2 closest to 8_1s looks more like an 8_1 than a 5_2 (other one far away).
8_1s relatively far apart and close to other things, so one appears to be a 5_2 and the other an 8_2.
Example 3: professions and leisure activities
15 individuals from three different professions (politicians, administrators and belly dancers) each participate in four different leisure activities: reading, dancing, TV watching and skiing. After each activity they rate it on a 0–10 scale.
How can we best use the scores on the activities to predict a person’s profession?
Or, what combination(s) of scores best separate data into profession groups?
Comments
Only obs. 7 has any doubt:
yield
low for a high-fertilizer, but highweight
makes up for it.