---
title: "Factor analysis"
---
## Vs. principal components
* Principal components:
* Purely mathematical.
* Find eigenvalues, eigenvectors of correlation matrix.
* No testing whether observed components reproducible, or even probability model behind it.
* Factor analysis:
* some way towards fixing this (get test of appropriateness)
* In factor analysis, each variable modelled as: "common factor" (eg. verbal ability) and "specific factor" (left over).
* Choose the common factors to "best" reproduce pattern seen in correlation matrix.
* Iterative procedure, different answer from principal components.
## Packages
```{r bFactor-1, warning=F, message=F}
library(ggbiplot)
library(tidyverse)
library(conflicted)
conflict_prefer("mutate", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
conflict_prefer("arrange", "dplyr")
```
## Example
* 145 children given 5 tests, called PARA, SENT, WORD, ADD and DOTS. 3 linguistic tasks (paragraph comprehension, sentence completion and word meaning), 2 mathematical ones (addition and counting dots).
* Correlation matrix of scores on the tests:
```
para 1 0.722 0.714 0.203 0.095
sent 0.722 1 0.685 0.246 0.181
word 0.714 0.685 1 0.170 0.113
add 0.203 0.246 0.170 1 0.585
dots 0.095 0.181 0.113 0.585 1
```
* Is there small number of underlying "constructs" (unobservable) that explains this pattern of correlations?
## To start: principal components
Using correlation matrix. Read that first:
```{r kids-scree,message=F}
my_url <- "http://ritsokiguess.site/datafiles/rex2.txt"
kids <- read_delim(my_url, " ")
kids
```
## Principal components on correlation matrix
Turn into R `matrix`, using column `test` as column names:
```{r}
kids %>%
column_to_rownames("test") %>%
as.matrix() -> m
```
Principal components:
```{r bFactor-2}
kids.0 <- princomp(covmat = m)
```
I used `kids.0` here since I want `kids.1` and `kids.2` later.
## Scree plot
```{r bFactor-3, fig.height=3.5}
ggscreeplot(kids.0)
```
## Principal component results
* Need 2 components. Loadings:
\footnotesize
```{r bFactor-4}
kids.0$loadings
```
\normalsize
## Comments
* First component has a bit of everything, though especially the
first three tests.
* Second component rather more clearly `add` and `dots`.
* No scores, plots since no actual data.
- See how factor analysis compares on these data.
## Factor analysis
* Specify number of factors first, get solution with exactly
that many factors.
* Includes hypothesis test, need to specify how many children
wrote the tests.
* Works from correlation matrix via `covmat` or actual
data, like `princomp`.
* Introduces extra feature, *rotation*, to make
interpretation of loadings (factor-variable relation) easier.
## Factor analysis for the kids data
* Create "covariance list" to include number of children who
wrote the tests.
* Feed this into `factanal`, specifying how many factors (2).
- Start with the matrix we made before.
```{r bFactor-5 }
m
ml <- list(cov = m, n.obs = 145)
kids.2 <- factanal(factors = 2, covmat = ml)
```
## Uniquenesses
```{r bFactor-6 }
kids.2$uniquenesses
```
* Uniquenesses say how "unique" a variable is (size of
specific factor). Small
uniqueness means that the variable is summarized by a factor (good).
* Very large uniquenesses are bad; `add`'s uniqueness is largest but not large enough to be worried about.
* Also see "communality" for this idea, where *large* is good and *small* is bad.
## Loadings
\footnotesize
```{r bFactor-7}
kids.2$loadings
```
\normalsize
* Loadings show how each factor depends on variables. Blanks
indicate "small", less than 0.1.
## Comments
* Factor 1 clearly the "linguistic" tasks, factor 2 clearly
the "mathematical" ones.
* Two factors together explain 68\% of variability (like
regression R-squared).
- Which variables belong to which factor is *much* clearer than with principal components.
## Are 2 factors enough?
```{r bFactor-8 }
kids.2$STATISTIC
kids.2$dof
kids.2$PVAL
```
P-value not small, so 2 factors OK.
## 1 factor
```{r bFactor-9 }
kids.1 <- factanal(factors = 1, covmat = ml)
kids.1$STATISTIC
kids.1$dof
kids.1$PVAL
```
1 factor rejected (P-value small). Definitely need more than 1.
## Places rated, again
- Read data, transform, rerun principal components, get biplot:
```{r bFactor-10, message=FALSE, fig.height=6}
my_url <- "http://ritsokiguess.site/datafiles/places.txt"
places0 <- read_table2(my_url)
places0 %>%
mutate(across(-id, \(x) log(x))) -> places
places %>% select(-id) -> places_numeric
places.1 <- princomp(places_numeric, cor = TRUE)
g <- ggbiplot(places.1, labels = places$id,
labels.size = 0.8)
```
- This is all exactly as for principal components (nothing new here).
## The biplot
```{r bFactor-11, fig.height=3}
g
```
## Comments
- Most of the criteria are part of components 1 *and* 2.
- If we can rotate the arrows counterclockwise:
- economy and crime would point straight up
- part of component 2 only
- health and education would point to the right
- part of component 1 only
- would be easier to see which variables belong to which component.
- Factor analysis includes a rotation to help with interpretation.
## Factor analysis
- Have to pick a number of factors *first*.
- Do this by running principal components and looking at scree plot.
- In this case, 3 factors seemed good (revisit later):
```{r bFactor-12}
places.3 <- factanal(places_numeric, 3, scores = "r")
```
- There are different ways to get factor scores. These called "regression" scores.
## A bad biplot
```{r bFactor-13, fig.height=4}
biplot(places.3$scores, places.3$loadings,
xlabs = places$id)
```
## Comments
- I have to find a way to make a better biplot!
- Some of the variables now point straight up and some straight across (if you look carefully for the red arrows among the black points).
- This should make the factors more interpretable than the components were.
## Factor loadings
\footnotesize
```{r bFactor-14}
places.3$loadings
```
\normalsize
## Comments on loadings
- These are at least somewhat clearer than for the principal components:
- Factor 1: health, education, arts: "well-being"
- Factor 2: housing, transportation, arts (again), recreation: "places to be"
- Factor 3: climate (only): "climate"
- In this analysis, economic factors don't seem to be important.
## Factor scores
- Make a dataframe with the city IDs and factor scores:
```{r bFactor-15}
cbind(id = places$id, places.3$scores) %>%
as_tibble() -> places_scores
```
- Make percentile ranks again (for checking):
```{r bFactor-16}
places %>%
mutate(across(-id, \(x) percent_rank(x))) -> places_pr
```
## Highest scores on factor 1, "well-being":
- for the top 4 places:
```{r bFactor-17}
places_scores %>%
slice_max(Factor1, n = 4)
```
## Check percentile ranks for factor 1
```{r bFactor-18}
places_pr %>%
select(id, health, educate, arts) %>%
filter(id %in% c(213, 65, 234, 314))
```
- These are definitely high on the well-being variables.
- City #213 is not so high on education, but is highest of all on the others.
## Highest scores on factor 2, "places to be":
```{r bFactor-19}
places_scores %>%
slice_max(Factor2, n = 4)
```
## Check percentile ranks for factor 2
```{r bFactor-20}
places_pr %>%
select(id, housing, trans, arts, recreate) %>%
filter(id %in% c(318, 12, 168, 44))
```
- These are definitely high on housing and recreation.
- Some are (very) high on transportation, but not so much on arts.
- Could look at more cities to see if #168 being low on arts is a fluke.
## Highest scores on factor 3, "climate":
```{r bFactor-21}
places_scores %>%
slice_max(Factor3, n = 4)
```
## Check percentile ranks for factor 3
```{r bFactor-22}
places_pr %>%
select(id, climate) %>%
filter(id %in% c(227, 218, 269, 270))
```
This is very clear.
## Uniquenesses
- We said earlier that the economy was not part of any of our factors:
```{r bFactor-23}
places.3$uniquenesses
```
- The higher the uniqueness, the less the variable concerned is part of any of our factors (and that maybe another factor is needed to accommodate it).
- This includes economy and maybe crime.
## Test of significance
We can test whether the three factors that we have is enough, or whether we need more to describe our data:
```{r bFactor-24}
places.3$PVAL
```
- 3 factors are not enough.
- What would 5 factors look like?
## Five factors
\footnotesize
```{r bFactor-25}
places.5 <- factanal(places_numeric, 5, scores = "r")
places.5$loadings
```
\normalsize
## Comments 1/2
- On (new) 5 factors:
- Factor 1 is health, education, arts: same as factor 1 before.
- Factor 2 is housing, transportation, arts, recreation: as factor 2 before.
- Factor 3 is economy.
- Factor 4 is crime.
- Factor 5 is climate and housing: like factor 3 before.
## Comments 2/2
- The two added factors include the two "missing" variables.
- Is this now enough?
```{r bFactor-26}
places.5$PVAL
```
- No. My guess is that the authors of Places Rated chose their 9 criteria to capture different aspects of what makes a city good or bad to live in, and so it was too much to hope that a small number of factors would come out of these.
## A bigger example: BEM sex role inventory
* 369 women asked to rate themselves on 60 traits, like "self-reliant" or "shy".
* Rating 1 "never or almost never true of me" to 7 ``always or
almost always true of me''.
* 60 personality traits is a lot. Can we find a smaller number
of factors that capture aspects of personality?
* The whole BEM sex role inventory on next page.
## The whole inventory
![](bem.png){width=450px}
## Some of the data
\scriptsize
```{r bem-scree, message=F}
my_url <- "http://ritsokiguess.site/datafiles/factor.txt"
bem <- read_tsv(my_url)
bem
```
\normalsize
## Principal components first
\ldots to decide on number of factors:
```{r bFactor-27 }
bem.pc <- bem %>%
select(-subno) %>%
princomp(cor = T)
```
## The scree plot
```{r genoa,fig.height=3.7}
(g <- ggscreeplot(bem.pc))
```
* No obvious elbow.
## Zoom in to search for elbow
Possible elbows at 3 (2 factors) and 6 (5):
```{r bem-scree-two,fig.height=3.3,warning=F}
g + scale_x_continuous(limits = c(0, 8))
```
## but is 2 really good?
```{r bFactor-28, include=FALSE}
options(width = 80)
```
\scriptsize
```{r bFactor-29 }
summary(bem.pc)
```
\normalsize
```{r bFactor-30, include=FALSE}
options(width = 60)
```
## Comments
* Want overall fraction of variance explained (``cumulative
proportion'') to be reasonably high.
* 2 factors, 28.5\%. Terrible!
* Even 56\% (10 factors) not that good!
* Have to live with that.
## Biplot
```{r bem-biplot,fig.height=3.5}
ggbiplot(bem.pc, alpha = 0.3)
```
## Comments
* Ignore individuals for now.
* Most variables point to 1 o'clock or 4 o'clock.
* Suggests factor analysis with rotation will get interpretable
factors (rotate to 12 o'clock and 3 o'clock, for example).
* Try for 2-factor solution (rough interpretation, will be bad):
```{r bFactor-31 }
bem %>%
select(-subno) %>%
factanal(factors = 2) -> bem.2
```
* Show output in pieces (just print `bem.2` to see all of it).
## Uniquenesses, sorted
\scriptsize
```{r bFactor-32, echo=-1}
options(width = 60)
sort(bem.2$uniquenesses)
```
\normalsize
## Comments
* Mostly high or very high (bad).
* Some smaller, eg.: Leadership ability (0.409),
Acts like leader (0.417),
Warm (0.476),
Tender (0.493).
* Smaller uniquenesses captured by one of our two factors.
- Larger uniquenesses are not: need more factors to capture them.
## Factor loadings, some
\scriptsize
```{r bFactor-33}
bem.2$loadings
```
\normalsize
## Making a data frame
There are too many to read easily, so make a data frame. A
bit tricky:
\footnotesize
```{r bFactor-34}
bem.2$loadings %>%
unclass() %>%
as_tibble() %>%
mutate(trait = rownames(bem.2$loadings)) -> loadings
loadings %>% slice(1:8)
```
\normalsize
## Pick out the big ones on factor 1
Arbitrarily defining $>0.4$ or $<-0.4$ as "big":
\scriptsize
```{r bFactor-35}
loadings %>% filter(abs(Factor1) > 0.4)
```
\normalsize
## Factor 2, the big ones
\footnotesize
```{r bFactor-36}
loadings %>% filter(abs(Factor2) > 0.4)
```
\normalsize
## Plotting the two factors
- A bi-plot, this time with the variables reduced in size. Looking for
unusual individuals.
- Have to run `factanal` again to get factor scores for plotting.
```{r biplot-two-again, eval=F}
bem %>% select(-subno) %>%
factanal(factors = 2, scores = "r") -> bem.2a
biplot(bem.2a$scores, bem.2a$loadings, cex = c(0.5, 0.5))
```
- Numbers on plot are row numbers of `bem`
data frame.
## The (awful) biplot
```{r biplot-two-ag,fig.height=4,echo=F}
bem.2a <- factanal(bem[, -1], factors = 2, scores = "r")
biplot(bem.2a$scores, bem.2a$loadings, cex = c(0.5, 0.5))
```
## Comments
* Variables mostly up ("feminine") and right ("masculine"),
accomplished by rotation.
* Some unusual individuals: 311, 214 (low on factor 2), 366
(high on factor 2),
359, 258
(low on factor 1), 230 (high on factor 1).
## Individual 366
\tiny
```{r bFactor-37}
bem %>% slice(366) %>% glimpse()
```
\normalsize
## Comments
* Individual 366 high on factor 2, but hard to see which traits should have high scores
(unless we remember).
- Idea 1: use percentile ranks as before.
* Idea 2: Rating scale is easy to interpret. So
*tidy* original data frame to make easier to look
things up.
## Tidying original data
\scriptsize
```{r bFactor-38}
bem %>%
ungroup() %>%
mutate(row = row_number()) %>%
pivot_longer(c(-subno, -row), names_to="trait",
values_to="score") -> bem_tidy
bem_tidy
```
\normalsize
## Recall data frame of loadings
\footnotesize
```{r bFactor-39}
loadings %>% slice(1:10)
```
\normalsize
Want to add the factor scores for each trait to our tidy data frame
`bem_tidy`. This is a left-join (over), matching on the column
`trait` that is in both data frames (thus, the default):
## Looking up loadings
\scriptsize
```{r bFactor-40}
bem_tidy %>% left_join(loadings) -> bem_tidy
bem_tidy %>% sample_n(12)
```
\normalsize
## Individual 366, high on Factor 2
So now pick out the rows of the tidy data frame that belong to
individual 366 (`row=366`) and for which the `Factor2`
score exceeds 0.4 in absolute value (our "big" from before):
\scriptsize
```{r bFactor-41}
bem_tidy %>% filter(row == 366, abs(Factor2) > 0.4)
```
\normalsize
As expected, high scorer on these.
## Several individuals
Rows 311 and 214 were *low* on Factor 2, so their scores should
be low. Can we do them all at once?
\scriptsize
```{r bFactor-42}
bem_tidy %>% filter(
row %in% c(366, 311, 214),
abs(Factor2) > 0.4
)
```
\normalsize
Can we display each individual in own column?
## Individual by column
Un-`tidy`, that is, `pivot_wider`:
\tiny
```{r bFactor-43}
bem_tidy %>%
filter(
row %in% c(366, 311, 214),
abs(Factor2) > 0.4
) %>%
select(-subno, -Factor1, -Factor2) %>%
pivot_wider(names_from=row, values_from=score)
```
\normalsize
366 high, 311 middling, 214 (sometimes) low.
## Individuals 230, 258, 359
These were high, low, low on factor 1. Adapt code:
\tiny
```{r bFactor-44}
bem_tidy %>%
filter(row %in% c(359, 258, 230), abs(Factor1) > 0.4) %>%
select(-subno, -Factor1, -Factor2) %>%
pivot_wider(names_from=row, values_from=score)
```
\normalsize
## Is 2 factors enough?
Suspect not:
```{r bFactor-45 }
bem.2$PVAL
```
2 factors resoundingly rejected. Need more. Have to go all the way to
15 factors to not reject:
```{r bFactor-46 }
bem %>%
select(-subno) %>%
factanal(factors = 15) -> bem.15
bem.15$PVAL
```
Even then, only just over 50\% of variability explained.
## What's important in 15 factors?
- Let's take a look at the important things in those 15 factors.
- Get 15-factor loadings into a data frame, as before:
\small
```{r bFactor-47}
bem.15$loadings %>%
unclass() %>%
as_tibble() %>%
mutate(trait = rownames(bem.15$loadings)) -> loadings
```
\normalsize
- then show the highest few loadings on each factor.
## Factor 1 (of 15)
\footnotesize
```{r bFactor-48}
loadings %>%
arrange(desc(abs(Factor1))) %>%
select(Factor1, trait) %>%
slice(1:10)
```
\normalsize
Compassionate, understanding, sympathetic, soothing: thoughtful of
others.
## Factor 2
\footnotesize
```{r bFactor-49}
loadings %>%
arrange(desc(abs(Factor2))) %>%
select(Factor2, trait) %>%
slice(1:10)
```
\normalsize
Strong personality, forceful, assertive, dominant: getting ahead.
## Factor 3
\footnotesize
```{r bFactor-50}
loadings %>%
arrange(desc(abs(Factor3))) %>%
select(Factor3, trait) %>%
slice(1:10)
```
\normalsize
Self-reliant, self-sufficient, independent: going it alone.
## Factor 4
\footnotesize
```{r bFactor-51}
loadings %>%
arrange(desc(abs(Factor4))) %>%
select(Factor4, trait) %>%
slice(1:10)
```
\normalsize
Gentle, tender, warm (affectionate): caring for others.
## Factor 5
\scriptsize
```{r bFactor-52}
loadings %>%
arrange(desc(abs(Factor5))) %>%
select(Factor5, trait) %>%
slice(1:10)
```
\normalsize
Ambitious, competitive (with a bit of risk-taking and individualism):
Being the best.
## Factor 6
\scriptsize
```{r bFactor-53}
loadings %>%
arrange(desc(abs(Factor6))) %>%
select(Factor6, trait) %>%
slice(1:10)
```
\normalsize
Acts like a leader, leadership ability (with a bit of Dominant):
Taking charge.
## Factor 7
\footnotesize
```{r bFactor-54}
loadings %>%
arrange(desc(abs(Factor7))) %>%
select(Factor7, trait) %>%
slice(1:10)
```
\normalsize
Happy and cheerful.
## Factor 8
\footnotesize
```{r bFactor-55}
loadings %>%
arrange(desc(abs(Factor8))) %>%
select(Factor8, trait) %>%
slice(1:10)
```
\normalsize
Affectionate, flattering: Making others feel good.
## Factor 9
\footnotesize
```{r bFactor-56}
loadings %>%
arrange(desc(abs(Factor9))) %>%
select(Factor9, trait) %>%
slice(1:10)
```
\normalsize
Taking a stand.
## Factor 10
\footnotesize
```{r bFactor-57}
loadings %>%
arrange(desc(abs(Factor10))) %>%
select(Factor10, trait) %>%
slice(1:10)
```
\normalsize
Feminine. (A little bit of not-masculine!)
## Factor 11
\footnotesize
```{r bFactor-58}
loadings %>%
arrange(desc(abs(Factor11))) %>%
select(Factor11, trait) %>%
slice(1:10)
```
\normalsize
Loyal.
## Factor 12
\footnotesize
```{r bFactor-59}
loadings %>%
arrange(desc(abs(Factor12))) %>%
select(Factor12, trait) %>%
slice(1:10)
```
\normalsize
Childlike. (With a bit of moody, shy, not-self-sufficient, not-conscientious.)
## Factor 13
\footnotesize
```{r bFactor-60}
loadings %>%
arrange(desc(abs(Factor13))) %>%
select(Factor13, trait) %>%
slice(1:10)
```
\normalsize
Truthful. (With a bit of happy and not-gullible.)
## Factor 14
\footnotesize
```{r bFactor-61}
loadings %>%
arrange(desc(abs(Factor14))) %>%
select(Factor14, trait) %>%
slice(1:10)
```
\normalsize
Decisive. (With a bit of self-sufficient and not-soft-spoken.)
## Factor 15
\footnotesize
```{r bFactor-62}
loadings %>%
arrange(desc(abs(Factor15))) %>%
select(Factor15, trait) %>%
slice(1:10)
```
\normalsize
Not-compassionate, athletic, sensitive: A mixed bag. ("Cares about self"?)
## Anything left out? Uniquenesses
\scriptsize
```{r bFactor-63}
enframe(bem.15$uniquenesses, name="quality", value="uniq") %>%
slice_max(uniq, n = 10)
```
\normalsize
Uses foul language especially, also loves children and analytical. So
could use even more factors.