library(ggbiplot)
library(tidyverse)
41 Factor Analysis
Packages for this chapter:
41.1 The Interpersonal Circumplex
The “IPIP Interpersonal Circumplex” (see link) is a personal behaviour survey, where respondents have to rate how accurately a number of statements about behaviour apply to them, on a scale from 1 (“very inaccurate”) to 5 (“very accurate”). A survey was done on 459 people, using a 44-item variant of the above questionnaire, where the statements were as follows. Put an “I” or an “I am” in front of each one:
talkative
find fault
do a thorough job
depressed
original
reserved
helpful
careless
relaxed
curious
full of energy
start quarrels
reliable
tense
ingenious
generate enthusiasm in others
forgiving
disorganized
worry
imaginative
quiet
trusting
lazy
emotionally stable
inventive
assertive
cold and aloof
persevere
moody
value artistic experiences
shy
considerate
efficient
calm in tense situations
prefer routine work
outgoing
sometimes rude
stick to plans
nervous
reflective
have few artistic interests
co-operative
distractible
sophisticated in art and music
I don’t know what a “circumplex” is, but I know it’s not one of those “hat” accents that they have in French. The data are in link. The columns PERS01
through PERS44
represent the above traits.
Read in the data and check that you have the right number of rows and columns.
There are some missing values among these responses. Eliminate all the individuals with any missing values (since
princomp
can’t handle them).Carry out a principal components analysis and obtain a scree plot.
How many components/factors should you use? Explain briefly.
* Using your preferred number of factors, run a factor analysis. Obtain “r”-type factor scores, as in class. You don’t need to look at any output yet.
Obtain the factor loadings. How much of the variability does your chosen number of factors explain?
Interpret each of your chosen number of factors. That is, for each factor, identify the items that load heavily on it (you can be fairly crude about this, eg. use a cutoff like 0.4 in absolute value), and translate these items into the statements given in each item. Then, if you can, name what the items loading heavily on each factor have in common. Interpret a negative loading as “not” whatever the item says.
Find a person who is extreme on each of your first three factors (one at a time). For each of these people, what kind of data should they have for the relevant ones of variables
PERS01
throughPERS44
? Do they? Explain reasonably briefly.Check the uniquenesses. Which one(s) seem unusually high? Check these against the factor loadings. Are these what you would expect?
41.2 A correlation matrix
Here is a correlation matrix between five variables. This correlation matrix was based on \(n=50\) observations. Save the data into a file.
1.00 0.90 -0.40 0.28 -0.05
0.90 1.00 -0.60 0.43 -0.20
-0.40 -0.60 1.00 -0.80 0.40
0.28 0.43 -0.80 1.00 -0.70
-0.05 -0.20 0.40 -0.70 1.00
Read in the data, using
col_names=F
(why?). Check that you have five variables with names invented by R.Run a principal components analysis from this correlation matrix.
* Obtain a scree plot. Can you justify the use of two components (later, factors), bearing in mind that we have only five variables?
Take a look at the first two component loadings. Which variables appear to feature in which component? Do they have a positive or negative effect?
Create a “covariance list” (for the purposes of performing a factor analysis on the correlation matrix).
Carry out a factor analysis with two factors. We’ll investigate the bits of it in a moment.
* Look at the factor loadings. Describe how the factors are related to the original variables. Is the interpretation clearer than for the principal components analysis?
Look at the uniquenesses. Are there any that are unusually high? Does that surprise you, given your answer to (here)? (You will probably have to make a judgement call here.)
41.3 Air pollution
The data in link are measurements of air-pollution variables recorded at 12 noon on 42 different days at a location in Los Angeles. The file is in .csv
format, since it came from a spreadsheet. Specifically, the variables (in suitable units), in the same order as in the data file, are:
wind speed
solar radiation
carbon monoxide
Nitric oxide (also known as nitrogen monoxide)
Nitrogen dioxide
Ozone
Hydrocarbons
The aim is to describe pollution using fewer than these seven variables.
Read in the data and demonstrate that you have the right number of rows and columns in your data frame.
* Obtain a five-number summary for each variable. You can do this in one go for all seven variables.
Obtain a principal components analysis. Do it on the correlation matrix, since the variables are measured on different scales. You don’t need to look at the results yet.
Obtain a scree plot. How many principal components might be worth looking at? Explain briefly. (There might be more than one possibility. If so, discuss them all.)
Look at the
summary
of the principal components object. What light does this shed on the choice of number of components? Explain briefly.* How do each of your preferred number of components depend on the variables that were measured? Explain briefly.
Make a data frame that contains (i) the original data, (ii) a column of row numbers, (iii) the principal component scores. Display some of it.
Display the row of your new data frame for the observation with the smallest (most negative) score on component 1. Which row is this? What makes this observation have the most negative score on component 1?
Which observation has the lowest (most negative) value on component 2? Which variables ought to be high or low for this observation? Are they? Explain briefly.
Obtain a biplot, with the row numbers labelled, and explain briefly how your conclusions from the previous two parts are consistent with it.
Run a factor analysis on the same data, obtaining two factors. Look at the factor loadings. Is it clearer which variables belong to which factor, compared to the principal components analysis? Explain briefly.
My solutions follow:
41.4 The Interpersonal Circumplex
The “IPIP Interpersonal Circumplex” (see link) is a personal behaviour survey, where respondents have to rate how accurately a number of statements about behaviour apply to them, on a scale from 1 (“very inaccurate”) to 5 (“very accurate”). A survey was done on 459 people, using a 44-item variant of the above questionnaire, where the statements were as follows. Put an “I” or an “I am” in front of each one:
talkative
find fault
do a thorough job
depressed
original
reserved
helpful
careless
relaxed
curious
full of energy
start quarrels
reliable
tense
ingenious
generate enthusiasm in others
forgiving
disorganized
worry
imaginative
quiet
trusting
lazy
emotionally stable
inventive
assertive
cold and aloof
persevere
moody
value artistic experiences
shy
considerate
efficient
calm in tense situations
prefer routine work
outgoing
sometimes rude
stick to plans
nervous
reflective
have few artistic interests
co-operative
distractible
sophisticated in art and music
I don’t know what a “circumplex” is, but I know it’s not one of those “hat” accents that they have in French. The data are in link. The columns PERS01
through PERS44
represent the above traits.
- Read in the data and check that you have the right number of rows and columns.
Solution
Separated by single spaces.
<- "http://ritsokiguess.site/datafiles/personality.txt"
my_url <- read_delim(my_url, " ") pers
Rows: 459 Columns: 45
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
dbl (45): id, PERS01, PERS02, PERS03, PERS04, PERS05, PERS06, PERS07, PERS08...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pers
Yep, 459 people (in rows), and 44 items (in columns), plus one column of id
s for the people who took the survey.
In case you were wondering about the “I” vs. “I am” thing, the story seems to be that each behaviour needs to have a verb. If the behaviour has a verb, “I” is all you need, but if it doesn’t, you have to add one, ie. “I am”.
Another thing you might be concerned about is whether these data are “tidy” or not. To some extent, it depends on what you are going to do with it. You could say that the PERS
columns are all survey-responses, just to different questions, and you might think of doing something like this:
%>% pivot_longer(-id, names_to="item", values_to="response") pers
to get a really long and skinny data frame. It all depends on what you are doing with it. Long-and-skinny is ideal if you are going to summarize the responses by survey item, or draw something like bar charts of responses facetted by item:
%>%
pers pivot_longer(-id, names_to="item", values_to="response") %>%
ggplot(aes(x = response)) + geom_bar() + facet_wrap(~item)
Warning: Removed 371 rows containing non-finite outside the scale range
(`stat_count()`).
The first time I did this, item PERS36
appeared out of order at the end, and I was wondering what happened, until I realized it was actually misspelled as PES36
! I corrected it in the data file, and it should be good now (though I wonder how many years that error persisted for).
For us, in this problem, though, we need the wide format.
\(\blacksquare\)
- There are some missing values among these responses. Eliminate all the individuals with any missing values (since
princomp
can’t handle them).
Solution
This is actually much easier than it was in the past. A way of asking “are there any missing values anywhere?” is:
any(is.na(pers))
[1] TRUE
There are. To remove them, just this:
%>% drop_na() -> pers.ok pers
Are there any missings left?
any(is.na(pers.ok))
[1] FALSE
Nope. Extra: you might also have thought of the “tidy, remove, untidy” strategy here. The trouble with that here is that you want to remove all the observations for a subject who has any missing ones. This is unlike the multidimensional scaling one where we wanted to remove all the distances for two cities that we knew ahead of time.
That gives me an idea, though.
%>%
pers pivot_longer(-id, names_to="item", values_to="rating")
To find out which subjects have any missing values, we can do a group_by
and summarize
on subjects (that means, the id
column; the PERS
in the column I called item
means “personality”, not “person”!). What do we summarize? Any one of the standard things like mean
will return NA
if the thing whose mean you are finding has any NA values in it anywhere, and a number if it’s “complete”, so this kind of thing, adding to my pipeline:
%>%
pers pivot_longer(-id, names_to="item", values_to="rating") %>%
group_by(id) %>%
summarize(m = mean(rating)) %>%
filter(is.na(m))
This is different from drop_na
, which would remove any rows (of the long data frame) that have a missing response. This, though, is exactly what we don’t want, since we are trying to keep track of the subjects that have missing values.
Most of the subjects had an actual numerical mean here, whose value we don’t care about; all we care about here is whether the mean is missing, which implies that one (or more) of the responses was missing.
So now we define a column has_missing
that is true if the subject has any missing values and false otherwise:
%>%
pers pivot_longer(-id, names_to="item", values_to="rating") %>%
group_by(id) %>%
summarize(m = mean(rating)) %>%
mutate(has_missing = is.na(m)) -> pers.hm
pers.hm
This data frame pers.hm
has the same number of rows as the original data frame pers
, one per subject, so we can just glue it onto the end of that:
%>% bind_cols(pers.hm) pers
New names:
• `id` -> `id...1`
• `id` -> `id...46`