library(ggbiplot)
library(tidyverse)
41 Factor Analysis
Packages for this chapter:
41.1 The Interpersonal Circumplex
The “IPIP Interpersonal Circumplex” (see link) is a personal behaviour survey, where respondents have to rate how accurately a number of statements about behaviour apply to them, on a scale from 1 (“very inaccurate”) to 5 (“very accurate”). A survey was done on 459 people, using a 44-item variant of the above questionnaire, where the statements were as follows. Put an “I” or an “I am” in front of each one:
talkative
find fault
do a thorough job
depressed
original
reserved
helpful
careless
relaxed
curious
full of energy
start quarrels
reliable
tense
ingenious
generate enthusiasm in others
forgiving
disorganized
worry
imaginative
quiet
trusting
lazy
emotionally stable
inventive
assertive
cold and aloof
persevere
moody
value artistic experiences
shy
considerate
efficient
calm in tense situations
prefer routine work
outgoing
sometimes rude
stick to plans
nervous
reflective
have few artistic interests
co-operative
distractible
sophisticated in art and music
I don’t know what a “circumplex” is, but I know it’s not one of those “hat” accents that they have in French. The data are in link. The columns PERS01
through PERS44
represent the above traits.
Read in the data and check that you have the right number of rows and columns.
There are some missing values among these responses. Eliminate all the individuals with any missing values (since
princomp
can’t handle them).Carry out a principal components analysis and obtain a scree plot.
How many components/factors should you use? Explain briefly.
* Using your preferred number of factors, run a factor analysis. Obtain “r”-type factor scores, as in class. You don’t need to look at any output yet.
Obtain the factor loadings. How much of the variability does your chosen number of factors explain?
Interpret each of your chosen number of factors. That is, for each factor, identify the items that load heavily on it (you can be fairly crude about this, eg. use a cutoff like 0.4 in absolute value), and translate these items into the statements given in each item. Then, if you can, name what the items loading heavily on each factor have in common. Interpret a negative loading as “not” whatever the item says.
Find a person who is extreme on each of your first three factors (one at a time). For each of these people, what kind of data should they have for the relevant ones of variables
PERS01
throughPERS44
? Do they? Explain reasonably briefly.Check the uniquenesses. Which one(s) seem unusually high? Check these against the factor loadings. Are these what you would expect?
41.2 A correlation matrix
Here is a correlation matrix between five variables. This correlation matrix was based on \(n=50\) observations. Save the data into a file.
1.00 0.90 -0.40 0.28 -0.05
0.90 1.00 -0.60 0.43 -0.20
-0.40 -0.60 1.00 -0.80 0.40
0.28 0.43 -0.80 1.00 -0.70
-0.05 -0.20 0.40 -0.70 1.00
Read in the data, using
col_names=F
(why?). Check that you have five variables with names invented by R.Run a principal components analysis from this correlation matrix.
* Obtain a scree plot. Can you justify the use of two components (later, factors), bearing in mind that we have only five variables?
Take a look at the first two component loadings. Which variables appear to feature in which component? Do they have a positive or negative effect?
Create a “covariance list” (for the purposes of performing a factor analysis on the correlation matrix).
Carry out a factor analysis with two factors. We’ll investigate the bits of it in a moment.
* Look at the factor loadings. Describe how the factors are related to the original variables. Is the interpretation clearer than for the principal components analysis?
Look at the uniquenesses. Are there any that are unusually high? Does that surprise you, given your answer to (here)? (You will probably have to make a judgement call here.)
41.3 Air pollution
The data in link are measurements of air-pollution variables recorded at 12 noon on 42 different days at a location in Los Angeles. The file is in .csv
format, since it came from a spreadsheet. Specifically, the variables (in suitable units), in the same order as in the data file, are:
wind speed
solar radiation
carbon monoxide
Nitric oxide (also known as nitrogen monoxide)
Nitrogen dioxide
Ozone
Hydrocarbons
The aim is to describe pollution using fewer than these seven variables.
Read in the data and demonstrate that you have the right number of rows and columns in your data frame.
* Obtain a five-number summary for each variable. You can do this in one go for all seven variables.
Obtain a principal components analysis. Do it on the correlation matrix, since the variables are measured on different scales. You don’t need to look at the results yet.
Obtain a scree plot. How many principal components might be worth looking at? Explain briefly. (There might be more than one possibility. If so, discuss them all.)
Look at the
summary
of the principal components object. What light does this shed on the choice of number of components? Explain briefly.* How do each of your preferred number of components depend on the variables that were measured? Explain briefly.
Make a data frame that contains (i) the original data, (ii) a column of row numbers, (iii) the principal component scores. Display some of it.
Display the row of your new data frame for the observation with the smallest (most negative) score on component 1. Which row is this? What makes this observation have the most negative score on component 1?
Which observation has the lowest (most negative) value on component 2? Which variables ought to be high or low for this observation? Are they? Explain briefly.
Obtain a biplot, with the row numbers labelled, and explain briefly how your conclusions from the previous two parts are consistent with it.
Run a factor analysis on the same data, obtaining two factors. Look at the factor loadings. Is it clearer which variables belong to which factor, compared to the principal components analysis? Explain briefly.
My solutions follow:
41.4 The Interpersonal Circumplex
The “IPIP Interpersonal Circumplex” (see link) is a personal behaviour survey, where respondents have to rate how accurately a number of statements about behaviour apply to them, on a scale from 1 (“very inaccurate”) to 5 (“very accurate”). A survey was done on 459 people, using a 44-item variant of the above questionnaire, where the statements were as follows. Put an “I” or an “I am” in front of each one:
talkative
find fault
do a thorough job
depressed
original
reserved
helpful
careless
relaxed
curious
full of energy
start quarrels
reliable
tense
ingenious
generate enthusiasm in others
forgiving
disorganized
worry
imaginative
quiet
trusting
lazy
emotionally stable
inventive
assertive
cold and aloof
persevere
moody
value artistic experiences
shy
considerate
efficient
calm in tense situations
prefer routine work
outgoing
sometimes rude
stick to plans
nervous
reflective
have few artistic interests
co-operative
distractible
sophisticated in art and music
I don’t know what a “circumplex” is, but I know it’s not one of those “hat” accents that they have in French. The data are in link. The columns PERS01
through PERS44
represent the above traits.
- Read in the data and check that you have the right number of rows and columns.
Solution
Separated by single spaces.
<- "http://ritsokiguess.site/datafiles/personality.txt"
my_url <- read_delim(my_url, " ") pers
Rows: 459 Columns: 45
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
dbl (45): id, PERS01, PERS02, PERS03, PERS04, PERS05, PERS06, PERS07, PERS08...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pers
Yep, 459 people (in rows), and 44 items (in columns), plus one column of id
s for the people who took the survey.
In case you were wondering about the “I” vs. “I am” thing, the story seems to be that each behaviour needs to have a verb. If the behaviour has a verb, “I” is all you need, but if it doesn’t, you have to add one, ie. “I am”.
Another thing you might be concerned about is whether these data are “tidy” or not. To some extent, it depends on what you are going to do with it. You could say that the PERS
columns are all survey-responses, just to different questions, and you might think of doing something like this:
%>% pivot_longer(-id, names_to="item", values_to="response") pers
to get a really long and skinny data frame. It all depends on what you are doing with it. Long-and-skinny is ideal if you are going to summarize the responses by survey item, or draw something like bar charts of responses facetted by item:
%>%
pers pivot_longer(-id, names_to="item", values_to="response") %>%
ggplot(aes(x = response)) + geom_bar() + facet_wrap(~item)
Warning: Removed 371 rows containing non-finite outside the scale range
(`stat_count()`).
The first time I did this, item PERS36
appeared out of order at the end, and I was wondering what happened, until I realized it was actually misspelled as PES36
! I corrected it in the data file, and it should be good now (though I wonder how many years that error persisted for).
For us, in this problem, though, we need the wide format.
\(\blacksquare\)
- There are some missing values among these responses. Eliminate all the individuals with any missing values (since
princomp
can’t handle them).
Solution
This is actually much easier than it was in the past. A way of asking “are there any missing values anywhere?” is:
any(is.na(pers))
[1] TRUE
There are. To remove them, just this:
%>% drop_na() -> pers.ok pers
Are there any missings left?
any(is.na(pers.ok))
[1] FALSE
Nope. Extra: you might also have thought of the “tidy, remove, untidy” strategy here. The trouble with that here is that you want to remove all the observations for a subject who has any missing ones. This is unlike the multidimensional scaling one where we wanted to remove all the distances for two cities that we knew ahead of time.
That gives me an idea, though.
%>%
pers pivot_longer(-id, names_to="item", values_to="rating")
To find out which subjects have any missing values, we can do a group_by
and summarize
on subjects (that means, the id
column; the PERS
in the column I called item
means “personality”, not “person”!). What do we summarize? Any one of the standard things like mean
will return NA
if the thing whose mean you are finding has any NA values in it anywhere, and a number if it’s “complete”, so this kind of thing, adding to my pipeline:
%>%
pers pivot_longer(-id, names_to="item", values_to="rating") %>%
group_by(id) %>%
summarize(m = mean(rating)) %>%
filter(is.na(m))
This is different from drop_na
, which would remove any rows (of the long data frame) that have a missing response. This, though, is exactly what we don’t want, since we are trying to keep track of the subjects that have missing values.
Most of the subjects had an actual numerical mean here, whose value we don’t care about; all we care about here is whether the mean is missing, which implies that one (or more) of the responses was missing.
So now we define a column has_missing
that is true if the subject has any missing values and false otherwise:
%>%
pers pivot_longer(-id, names_to="item", values_to="rating") %>%
group_by(id) %>%
summarize(m = mean(rating)) %>%
mutate(has_missing = is.na(m)) -> pers.hm
pers.hm
This data frame pers.hm
has the same number of rows as the original data frame pers
, one per subject, so we can just glue it onto the end of that:
%>% bind_cols(pers.hm) pers
New names:
• `id` -> `id...1`
• `id` -> `id...46`
and then filter out the rows for which has_missing
is true. What we did here is really a way of mimicking complete.cases
, which is the way we used to have to do it, before drop_na
came on the scene.
\(\blacksquare\)
- Carry out a principal components analysis and obtain a scree plot.
Solution
This ought to be straightforward, but we’ve got to remember to use only the columns with actual data in them: that is, PERS01
through PERS44
:
.1 <- pers.ok %>%
persselect(starts_with("PERS")) %>%
princomp(cor = T)
ggscreeplot(pers.1)
\(\blacksquare\)
- How many components/factors should you use? Explain briefly.
Solution
I think the clearest elbow is at 7, so we should use 6 components/factors. You could make a case that the elbow at 6 is also part of the scree, and therefore you should use 5 components/factors. Another one of those judgement calls. Ignore the “backwards elbow” at 5: this is definitely part of the mountain rather than the scree. Backwards elbows, as you’ll recall, don’t count as elbows anyway. When I drew this in R Studio, the elbow at 6 was clearer than the one at 7, so I went with 5 components/factors below. The other way to go is to take the number of eigenvalues bigger than 1:
summary(pers.1)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation 2.6981084 2.04738207 1.74372011 1.59610543 1.50114586
Proportion of Variance 0.1654497 0.09526758 0.06910363 0.05789892 0.05121452
Cumulative Proportion 0.1654497 0.26071732 0.32982096 0.38771988 0.43893440
Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
Standard deviation 1.2627066 1.14816136 1.10615404 1.07405521 1.02180353
Proportion of Variance 0.0362370 0.02996078 0.02780856 0.02621806 0.02372915
Cumulative Proportion 0.4751714 0.50513218 0.53294074 0.55915880 0.58288795
Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
Standard deviation 0.98309198 0.97514006 0.94861102 0.90832065 0.90680594
Proportion of Variance 0.02196522 0.02161132 0.02045143 0.01875105 0.01868857
Cumulative Proportion 0.60485317 0.62646449 0.64691592 0.66566698 0.68435554
Comp.16 Comp.17 Comp.18 Comp.19 Comp.20
Standard deviation 0.86798188 0.85762608 0.84515849 0.82819534 0.8123579
Proportion of Variance 0.01712256 0.01671642 0.01623393 0.01558881 0.0149983
Cumulative Proportion 0.70147810 0.71819452 0.73442845 0.75001726 0.7650156
Comp.21 Comp.22 Comp.23 Comp.24 Comp.25
Standard deviation 0.80910333 0.80435744 0.76594963 0.75946741 0.75434835
Proportion of Variance 0.01487837 0.01470434 0.01333361 0.01310888 0.01293276
Cumulative Proportion 0.77989393 0.79459827 0.80793188 0.82104076 0.83397352
Comp.26 Comp.27 Comp.28 Comp.29 Comp.30
Standard deviation 0.74494825 0.73105470 0.6956473 0.68327155 0.67765233
Proportion of Variance 0.01261245 0.01214639 0.0109983 0.01061045 0.01043665
Cumulative Proportion 0.84658597 0.85873236 0.8697307 0.88034111 0.89077776
Comp.31 Comp.32 Comp.33 Comp.34
Standard deviation 0.66847179 0.660473737 0.651473777 0.629487724
Proportion of Variance 0.01015578 0.009914217 0.009645865 0.009005791
Cumulative Proportion 0.90093355 0.910847763 0.920493629 0.929499420
Comp.35 Comp.36 Comp.37 Comp.38
Standard deviation 0.618765271 0.605892700 0.594231727 0.581419871
Proportion of Variance 0.008701601 0.008343317 0.008025258 0.007682933
Cumulative Proportion 0.938201021 0.946544338 0.954569596 0.962252530
Comp.39 Comp.40 Comp.41 Comp.42
Standard deviation 0.568951666 0.560084703 0.547059522 0.524949694
Proportion of Variance 0.007356955 0.007129429 0.006801685 0.006263004
Cumulative Proportion 0.969609484 0.976738913 0.983540598 0.989803602
Comp.43 Comp.44
Standard deviation 0.490608152 0.456010047
Proportion of Variance 0.005470372 0.004726026
Cumulative Proportion 0.995273974 1.000000000
There are actually 10 of these. But if you look at the scree plot, there really seems to be no reason to take 10 factors rather than, say, 11 or 12. There are a lot of eigenvalues (standard deviations) close to (but just below) 1, and no obvious “bargains” in terms of variance explained: the “cumulative proportion” just keeps going gradually up.
\(\blacksquare\)
- * Using your preferred number of factors, run a factor analysis. Obtain “r”-type factor scores, as in class. You don’t need to look at any output yet.
Solution
I’m going to do the 5 factors that I preferred the first time I looked at this. Don’t forget to grab only the appropriate columns from pers.ok
:
.1 <- pers.ok %>%
pers.okselect(starts_with("PERS")) %>%
factanal(5, scores = "r")
If you think 6 is better, you should feel free to use that here.
\(\blacksquare\)
- Obtain the factor loadings. How much of the variability does your chosen number of factors explain?
Solution
.1$loadings pers.ok
Loadings:
Factor1 Factor2 Factor3 Factor4 Factor5
PERS01 0.130 0.752 0.279
PERS02 0.199 0.202 -0.545
PERS03 0.677 0.160 0.126
PERS04 -0.143 -0.239 0.528 -0.195
PERS05 0.191 0.258 -0.180 0.159 0.520
PERS06 0.104 -0.658 0.185 -0.143
PERS07 0.313 0.113 0.533 0.130
PERS08 -0.558 0.168 -0.173
PERS09 -0.641 0.136
PERS10 0.149 0.169 0.445
PERS11 0.106 0.282 -0.224 -0.107 0.271
PERS12 -0.157 -0.404
PERS13 0.577 0.331 0.126
PERS14 0.685 -0.135
PERS15 0.133 0.480
PERS16 0.106 0.481 0.335
PERS17 0.277 0.102
PERS18 -0.641
PERS19 -0.159 0.596 0.109
PERS20 0.215 0.103 0.498
PERS21 -0.805 0.121
PERS22 0.153 0.175 0.533
PERS23 -0.622 0.121 -0.259
PERS24 0.135 -0.582 0.153
PERS25 0.176 -0.124 0.517
PERS26 0.144 0.515 -0.215 -0.260 0.244
PERS27 -0.205 0.136 -0.438
PERS28 0.623 0.245 0.132
PERS29 0.477 -0.231
PERS30 0.500
PERS31 -0.619 0.191
PERS32 0.322 0.553 0.134
PERS33 0.622
PERS34 0.105 -0.566
PERS35 -0.180 -0.161
PERS36 0.675 -0.146 0.118 0.107
PERS37 -0.300 0.134 -0.495
PERS38 0.521 0.169
PERS39 -0.333 0.530
PERS40 0.162 0.585
PERS41 -0.171 -0.326
PERS42 0.271 0.179 0.552 0.139
PERS43 -0.465 0.236 -0.174
PERS44 0.449
Factor1 Factor2 Factor3 Factor4 Factor5
SS loadings 3.810 3.740 3.174 2.939 2.627
Proportion Var 0.087 0.085 0.072 0.067 0.060
Cumulative Var 0.087 0.172 0.244 0.311 0.370
The Cumulative Var line at the bottom says that our five factors together have explained 37% of the variability. This is not great, but is the kind of thing we have to live with in this kind of analysis (like the personality one in class).
\(\blacksquare\)
- Interpret each of your chosen number of factors. That is, for each factor, identify the items that load heavily on it (you can be fairly crude about this, eg. use a cutoff like 0.4 in absolute value), and translate these items into the statements given in each item. Then, if you can, name what the items loading heavily on each factor have in common. Interpret a negative loading as “not” whatever the item says.
Solution
This is a lot of work, but I figure you should go through it at least once! If you have some number of factors other than 5, your results will be different from mine. Keep going as long as you reasonably can. Factor 1: 3, 8 (negatively), 13, 18 (negative), 23 (negative), 28, 33, 38 and maybe 43 (negatively). These are: do a thorough job, not-careless, reliable, not-disorganized, not-lazy, persevere, efficient, stick to plans, not-distractible. These have the common theme of paying attention to detail and getting the job done properly.
Factor 2: 1, not-6, 16, not-21, 26, not-31, 36. Talkative, not-reserved, generates enthusiasm, not-quiet, assertive, not-shy, outgoing. “Extravert” seems to capture all of those. Factor 3: 4, not-9, 14, 19, not-24, 29, not-34, 39. Depressed, not-relaxed, tense, worried, not emotionally stable, moody, not-calm-when-tense, nervous. “Not happy” or something like that.
Notice how these seem to be jumping in steps of 5? The psychology scoring of assessments like this is that a person’s score on some dimension is found by adding up their scores on certain of the questions and subtracting their scores on others (“reverse-coded questions”). I’m guessing that these guys have 5 dimensions they are testing for, corresponding presumably to my five factors. The questionnaire at link is different, but you see the same idea there. (The jumps there seem to be in steps of 8, since they have 8 dimensions.)
Factor 4: not-2, 7, not-12 (just), 22, not-27, 32, not-37, 42. Doesn’t find fault, helpful, doesn’t start quarrels, trusting, not-cold-and-aloof, considerate, not-sometimes-rude, co-operative. “Helps without judgement” or similar.
Factor 5: 5, 10, 15, 20, 25, 30, 40, 44. Original, curious, ingenious, imaginative, inventive, values artistic experiences, reflective, sophisticated in art and music. Creative.
I remembered that psychologists like to talk about the “big 5” personality traits. These are extraversion (factor 2 here), agreeableness (factor 4), openness (factor 5?), conscientiousness (factor 1), and neuroticism (factor 3). The correspondence appears to be pretty good. (I wrote my answers above last year without thinking about “big 5” at all.)
I wonder whether 6 factors is different?
.2 <- pers.ok %>%
pers.okselect(starts_with("PERS")) %>%
factanal(6, scores = "r")
.2$loadings pers.ok
Loadings:
Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
PERS01 0.114 -0.668 0.343 0.290
PERS02 0.204 -0.451 0.354
PERS03 0.658 0.200 0.131
PERS04 -0.141 0.215 0.538 -0.196
PERS05 0.212 -0.271 -0.168 0.115 0.571
PERS06 0.694 0.171 -0.102
PERS07 0.315 -0.114 0.530 0.138
PERS08 -0.602 0.145 0.180
PERS09 -0.669 0.118 0.106
PERS10 0.125 0.398 0.264
PERS11 -0.131 -0.257 0.173 0.445
PERS12 -0.178 -0.348 0.188
PERS13 0.572 0.332 0.129
PERS14 0.142 0.675 0.121
PERS15 0.141 0.495
PERS16 -0.322 -0.130 0.114 0.248 0.491
PERS17 -0.111 0.346 0.139
PERS18 -0.655
PERS19 0.139 0.598 0.101 0.104
PERS20 -0.165 0.489 0.166
PERS21 0.843 -0.110
PERS22 0.123 -0.105 0.613 0.110
PERS23 -0.631 0.115 -0.232
PERS24 -0.603 0.113 0.143
PERS25 -0.135 -0.119 0.513 0.166
PERS26 0.113 -0.385 -0.225 -0.163 0.176 0.454
PERS27 0.275 0.128 -0.365 0.177
PERS28 0.617 0.253 0.129
PERS29 0.466 -0.161 0.164
PERS30 0.497
PERS31 0.659 0.166
PERS32 0.297 0.630 0.102
PERS33 0.597 0.239
PERS34 -0.584 0.153
PERS35 0.221 -0.210
PERS36 -0.562 -0.171 0.214 0.385
PERS37 -0.320 -0.439 0.220
PERS38 0.501 0.123 0.130 0.182
PERS39 -0.123 0.410 0.512 0.118
PERS40 0.146 0.598
PERS41 -0.106 -0.378 0.117
PERS42 0.259 -0.139 0.587 0.118
PERS43 -0.515 0.210 0.253
PERS44 0.454
Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
SS loadings 3.830 3.284 3.213 2.865 2.527 1.659
Proportion Var 0.087 0.075 0.073 0.065 0.057 0.038
Cumulative Var 0.087 0.162 0.235 0.300 0.357 0.395
Much of the same kind of thing seems to be happening, though it’s a bit fuzzier. I suspect the devisers of this survey were adherents to the “big 5” theory. The factor 6 here is items 11, 16 and 26, which would be expected to belong to factor 2 here, given what we found before. I think these particular items are about generating enthusiasm in others, rather than (necessarily) about being extraverted oneself.
\(\blacksquare\)
- Find a person who is extreme on each of your first three factors (one at a time). For each of these people, what kind of data should they have for the relevant ones of variables
PERS01
throughPERS44
? Do they? Explain reasonably briefly.
Solution
For this, we need the factor scores obtained in part (here).1 I’m thinking that I will create a data frame with the original data (with the missing values removed) and the factor scores together, and then look in there. This will have a lot of columns, but we’ll only need to display some each time. This is based on my 5-factor solution. I’m adding a column id
so that I know which of the individuals (with no missing data) we are looking at:
.1 <- as_tibble(pers.ok.1$scores) %>%
scoresbind_cols(pers.ok) %>%
mutate(id = row_number())
.1 scores
I did it this way, rather than using data.frame
, so that I would end up with a tibble
that would display nicely rather than running off the page. This meant turning the matrix of factor scores into a tibble
first and then gluing everything onto it. (There’s no problem in using data.frame
here if you prefer. You could even begin with data.frame
and pipe the final result into as_tibble
to end up with a tibble
.) Let’s start with factor 1. There are several ways to find the person who scores highest and/or lowest on that factor:
.1 %>% filter(Factor1 == max(Factor1)) scores
to display the maximum, or
.1 %>% arrange(Factor1) %>% slice(1:5) scores
to display the minimum (and in this case the five smallest ones), or
.1 %>% filter(abs(Factor1) == max(abs(Factor1))) scores
to display the largest one in size, whether positive or negative. The code is a bit obfuscated because I have to take absolute values twice. Maybe it would be clearer to create a column with the absolute values in it and look at that:
.1 %>%
scoresmutate(abso = abs(Factor1)) %>%
filter(abso == max(abso))
Does this work too?
.1 %>% arrange(desc(abs(Factor1))) %>% slice(1:5) scores
It looks as if it does: “sort the Factor 1 scores in descending order by absolute value, and display the first few”. The most extreme scores on Factor 1 are all negative: the most positive one (found above) was only about 1.70.
For you, you don’t have to be this sophisticated. It’s enough to eyeball the factor scores on factor 1 and find one that’s reasonably far away from zero. Then you note its row and slice
that row, later.
I think I like the idea of creating a new column with the absolute values in it and finding the largest of that. Before we pursue that, though, remember that we don’t need to look at all the PERS
columns, because only some of them load highly on each factor. These are the ones I defined into f1
first; the first ones have positive loadings and the last three have negative loadings:
<- c(3, 13, 28, 33, 38, 8, 18, 23, 43)
f1 .1 %>%
scoresmutate(abso = abs(Factor1)) %>%
filter(abso == max(abso)) %>%
select(id, Factor1, num_range("PERS", f1, width = 2))
I don’t think I’ve used num_range
before, like, ever. It is one of the select-helpers like starts_with
. It is used when you have column names that are some constant text followed by variable numbers, which is exactly what we have here: we want to select the PERS
columns with numbers that we specify. num_range
requires (at least) two things: the text prefix, followed by a vector of numbers that are to be glued onto the prefix. I defined that first so as not to clutter up the select
line. The third thing here is width
: all the PERS
columns have a name that ends with two digits, so PERS03
rather than PERS3
, and using width
makes sure that the zero gets inserted.
Individual 340 is a low scorer on factor 1, so they should have low scores on the first five items (the ones with positive loading on factor 1) and high scores on the last four. This is indeed what happened: 1s, 2s and 3s on the first five items and 4s and 5s on the last four. Having struggled through that, factors 2 and 3 are repeats of this. The high loaders on factor 2 are the ones shown in f2
below, with the first five loading positively and the last three negatively.2
<- c(1, 11, 16, 26, 36, 6, 21, 31)
f2 .1 %>%
scoresmutate(abso = abs(Factor2)) %>%
filter(abso == max(abso)) %>%
select(id, Factor2, num_range("PERS", f2, width = 2))
What kind of idiot, I was thinking, named the data frame of scores scores.1
when there are going to be three factors to assess?
The first five scores are lowish, but the last three are definitely high (three 5s). This idea of a low score on the positive-loading items and a high score on the negatively-loading ones is entirely consistent with a negative factor score.
Finally, factor 3, which loads highly on items 4, 9 (neg), 14, 19, 24 (neg), 29, 34 (neg), 39. (Item 44, which you’d expect to be part of this factor, is actually in factor 5.) First we see which individual this is:
<- c(4, 14, 19, 29, 39, 9, 24, 34)
f3 .1 %>%
scoresmutate(abso = abs(Factor3)) %>%
filter(abso == max(abso)) %>%
select(id, Factor3, num_range("PERS", f3, width = 2))
The only mysterious one there is item 19, which ought to be low, because it has a positive loading and the factor score is unusually negative. But it is 4 on a 5-point scale. The others that are supposed to be low are 1 and the ones that are supposed to be high are 4 or 5, so those all match up.
\(\blacksquare\)
- Check the uniquenesses. Which one(s) seem unusually high? Check these against the factor loadings. Are these what you would expect?
Solution
Mine are
.1$uniquenesses pers.ok
PERS01 PERS02 PERS03 PERS04 PERS05 PERS06 PERS07 PERS08
0.3276244 0.6155884 0.4955364 0.6035655 0.5689691 0.4980334 0.5884146 0.6299781
PERS09 PERS10 PERS11 PERS12 PERS13 PERS14 PERS15 PERS16
0.5546981 0.7460655 0.7740590 0.8016644 0.5336047 0.5035412 0.7381636 0.6352166
PERS17 PERS18 PERS19 PERS20 PERS21 PERS22 PERS23 PERS24
0.8978624 0.5881834 0.5949740 0.6900378 0.3274366 0.6564542 0.5279346 0.6107080
PERS25 PERS26 PERS27 PERS28 PERS29 PERS30 PERS31 PERS32
0.6795545 0.5412962 0.7438329 0.5289192 0.7114735 0.7386601 0.5762901 0.5592906
PERS33 PERS34 PERS35 PERS36 PERS37 PERS38 PERS39 PERS40
0.6029914 0.6573411 0.9306766 0.4966071 0.6396371 0.6804821 0.5933423 0.6204610
PERS41 PERS42 PERS43 PERS44
0.8560531 0.5684220 0.6898732 0.7872295
Yours will be different if you used a different number of factors. But the procedure you follow will be the same as mine.
I think the highest of these is 0.9307, for item 35. Also high is item 17, 0.8979. If you look back at the table of loadings, item 35 has low loadings on all the factors: the largest in size is only 0.180. The largest loading for item 17 is 0.277, on factor 4. This is not high either.
Looking down the loadings table, also item 41 has only a loading of \(-0.326\) on factor 5, so its uniqueness ought to be pretty high as well. At 0.8561, it is.
\(\blacksquare\)
41.5 A correlation matrix
Here is a correlation matrix between five variables. This correlation matrix was based on \(n=50\) observations. Save the data into a file.
1.00 0.90 -0.40 0.28 -0.05
0.90 1.00 -0.60 0.43 -0.20
-0.40 -0.60 1.00 -0.80 0.40
0.28 0.43 -0.80 1.00 -0.70
-0.05 -0.20 0.40 -0.70 1.00
- Read in the data, using
col_names=F
(why?). Check that you have five variables with names invented by R.
Solution
I saved my data into cov5.txt
,3 delimited by single spaces, so:
<- read_delim("cov5.txt", " ", col_names = F) corr
Rows: 5 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
dbl (5): X1, X2, X3, X4, X5
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
corr
I needed to say that I have no variable names and I want R to provide some. As you see, it did: X1
through X5
. You can also supply your own names in this fashion:
<- c("first", "second", "third", "fourth", "fifth")
my_names <- read_delim("cov5.txt", " ", col_names = my_names) corr2
Rows: 5 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
dbl (5): first, second, third, fourth, fifth
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
corr2
\(\blacksquare\)
- Run a principal components analysis from this correlation matrix.
Solution
Two lines, these:
<- as.matrix(corr)
corr.mat .1 <- princomp(covmat = corr.mat) corr
Or do it in one step as
.1a <- princomp(as.matrix(corr)) corr
if you like, but I think it’s less clear what’s going on.
\(\blacksquare\)
- * Obtain a scree plot. Can you justify the use of two components (later, factors), bearing in mind that we have only five variables?
Solution
ggscreeplot(corr.1)
There is kind of an elbow at 3, which would suggest two components/factors.4
You can also use the eigenvalue-bigger-than-1 thing:
summary(corr.1)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation 1.7185460 1.1686447 0.70207741 0.36584870 0.2326177
Proportion of Variance 0.5906801 0.2731461 0.09858254 0.02676905 0.0108222
Cumulative Proportion 0.5906801 0.8638262 0.96240875 0.98917780 1.0000000
Only the first two eigenvalues are bigger than 1, and the third is quite a bit smaller. So this would suggest two factors also. The third eigenvalue is in that kind of nebulous zone between between being big and being small, and the percent of variance explained is also ambiguous: is 86% good enough or should I go for 96%? These questions rarely have good answers. But an issue is that you want to summarize your variables with a (much) smaller number of factors; with 5 variables, having two factors rather than more than two seems to be a way of gaining some insight.
\(\blacksquare\)
- Take a look at the first two component loadings. Which variables appear to feature in which component? Do they have a positive or negative effect?
Solution
.1$loadings corr
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
X1 0.404 0.571 0.287 0.363 0.545
X2 0.482 0.439 0.144 -0.363 -0.650
X3 -0.501 0.122 0.652 0.412 -0.375
X4 0.490 -0.390 -0.171 0.689 -0.323
X5 -0.338 0.561 -0.665 0.304 -0.189
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
SS loadings 1.0 1.0 1.0 1.0 1.0
Proportion Var 0.2 0.2 0.2 0.2 0.2
Cumulative Var 0.2 0.4 0.6 0.8 1.0
This is messy.
In the first component, the loadings are all about the same in size, but the ones for X3
and X5
are negative and the rest are positive. Thus, component 1 is contrasting X3
and X5
with the others.
In the second component, the emphasis is on X1
, X2
and X5
, all with negative loadings, and possibly X4
, with a positive loading.
Note that the first component is basically “size”, since the component loadings are all almost equal in absolute value. This often happens in principal components; for example, it also happened with the running records in class.
I hope the factor analysis, with its rotation, will straighten things out some.
\(\blacksquare\)
- Create a “covariance list” (for the purposes of performing a factor analysis on the correlation matrix).
Solution
This is about the most direct way:
<- list(cov = corr.mat, n.obs = 50) corr.list
recalling that there were 50 observations. The idea is that we feed this into factanal
instead of the correlation matrix, so that the factor analysis knows how many individuals there were (for testing and such).
Note that you need the correlation matrix as a matrix
, not as a data frame. If you ran the princomp
all in one step, you’ll have to create the correlation matrix again, for example like this:
<- list(cov = as.matrix(corr), n.obs = 50) corr.list2
The actual list looks like this:
corr.list
$cov
X1 X2 X3 X4 X5
[1,] 1.00 0.90 -0.4 0.28 -0.05
[2,] 0.90 1.00 -0.6 0.43 -0.20
[3,] -0.40 -0.60 1.0 -0.80 0.40
[4,] 0.28 0.43 -0.8 1.00 -0.70
[5,] -0.05 -0.20 0.4 -0.70 1.00
$n.obs
[1] 50
An R list is a collection of things not all of the same type, here a matrix and a number, and is a handy way of keeping a bunch of connected things together. You use the same dollar-sign notation as for a data frame to access the things in a list:
$n.obs corr.list
[1] 50
and logically this is because, to R, a data frame is a special kind of a list, so anything that works for a list also works for a data frame, plus some extra things besides.5
The same idea applies to extracting things from the output of a regression (with lm
) or something like a t.test
: the output from those is also a list. But for those, I like tidy
from broom
better.
\(\blacksquare\)
- Carry out a factor analysis with two factors. We’ll investigate the bits of it in a moment.
Solution
.2 <- factanal(factors = 2, covmat = corr.list) corr
\(\blacksquare\)
- * Look at the factor loadings. Describe how the factors are related to the original variables. Is the interpretation clearer than for the principal components analysis?
Solution
.2$loadings corr
Loadings:
Factor1 Factor2
[1,] 0.905
[2,] -0.241 0.968
[3,] 0.728 -0.437
[4,] -0.977 0.201
[5,] 0.709
Factor1 Factor2
SS loadings 2.056 1.987
Proportion Var 0.411 0.397
Cumulative Var 0.411 0.809
Oh yes, this is a lot clearer. Factor 1 is variables 3 and 5 contrasted with variable 4; factor 2 is variables 1 and 2. No hand-waving required.
Perhaps now is a good time to look back at the correlation matrix and see why the factor analysis came out this way:
corr
Variable X1
is highly correlated with X2
but not really any of the others. Likewise variables X3, X4, X5
are more or less highly correlated among themselves but not with the others (X2
and X3
being an exception, but the big picture is as I described). So variables that appear in the same factor should be highly correlated with each other and not with variables that are in different factors. But it took the factor analysis to really show this up.
\(\blacksquare\)
- Look at the uniquenesses. Are there any that are unusually high? Does that surprise you, given your answer to (here)? (You will probably have to make a judgement call here.)
Solution
.2$uniquenesses corr
X1 X2 X3 X4 X5
0.1715682 0.0050000 0.2789445 0.0050000 0.4961028
The ones for X3
and X5
are higher than the rest; this is because their loadings on factor 1 are lower than the rest. Since those loadings are still high, I wouldn’t be worried about the uniquenesses.
The point here is that an (excessively) high uniqueness indicates a variable that doesn’t appear in any factor. The easy link to make is “all the variables appear in a factor, so there shouldn’t be any very high uniquenesses”. If, say, X3
doesn’t have a high loading on any factor, X3
would have a high uniqueness (like 0.9, and none of these values approach that).
\(\blacksquare\)
41.6 Air pollution
The data in link are measurements of air-pollution variables recorded at 12 noon on 42 different days at a location in Los Angeles. The file is in .csv
format, since it came from a spreadsheet. Specifically, the variables (in suitable units), in the same order as in the data file, are:
wind speed
solar radiation
carbon monoxide
Nitric oxide (also known as nitrogen monoxide)
Nitrogen dioxide
Ozone
Hydrocarbons
The aim is to describe pollution using fewer than these seven variables.
- Read in the data and demonstrate that you have the right number of rows and columns in your data frame.
Solution
This is a .csv
file, so:
<- "http://ritsokiguess.site/datafiles/airpollution.csv"
my_url <- read_csv(my_url) air
Rows: 42 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (7): wind, solar.radiation, CO, NO, NO2, O3, HC
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
air
There should be 42 rows (for the 42 days), and 7 columns (for the 7 variables), and there are.
\(\blacksquare\)
- * Obtain a five-number summary for each variable. You can do this in one go for all seven variables.
Solution
Like this (the cleanest):
%>%
air summarize(across(everything(), \(x) quantile(x)))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
I have to figure out how to identify which number from the five number summary each of these is, but in this case you can easily figure it out since the min is the smallest and the max has to be the biggest in each column.
Or, with some more work, this:
%>%
air pivot_longer(everything(), names_to="xname", values_to="x") %>%
nest_by(xname) %>%
rowwise() %>%
mutate(q = list(enframe(quantile(data$x)))) %>%
unnest(q) %>%
pivot_wider(names_from=name, values_from=value) %>%
select(-data)
There’s a lot here. Run it one line at a time to see what it does:
put the names of the variables in one column and the values in a second. This is the same trick as when we want to make plots of all the variables facetted.
the
nest_by
says: for each variable (whose names are now inxname
), make a dataframe calleddata
of the observations (inx
) for that variablethe rest of the way, work one row at a time
work out the five-number summary for each variable, using the values
x
in the data framedata
of each row of the list-column, one at a time. This is the base Rquantile
, working on a vector (the columnx
of the data framedata
), so it gives you back a named vector. If you are not familiar with that, try runningquantile(1:10)
and see how the output has both the percentiles and, above them, the percents that they go with. The tidyverse doesn’t like names, so my favourite way of keeping them with a named vector is to run it throughenframe
. This makes a two-column dataframe, with a column calledname
that is in this case the percents, and a column calledvalue
that is the percentiles. This is a dataframe rather than a single number, so it needs alist
on the front as well (to make another list-column). There are rather a lot of brackets to close here; if you are not sure you have enough, type another close bracket, pause, and see what it matches (R Studio will show you). If it matches nothing, you have too many close brackets.show the values of the five-number summary for each variable (in long format, but with the percentages attached)
for human consumption, put the percentiles in columns, one row for each variable
finally, get rid of the dataframes of original values (that we don’t need any more now that we have summarized them).
Extra: say you wanted to make facetted histograms of each variable. You would begin the same way, with the pivot_longer
, and at the end, facet_wrap
with scales = "free"
(since the variables are measured on different scales):
%>%
air pivot_longer(everything(), names_to="xname", values_to="x") %>%
ggplot(aes(x=x)) + geom_histogram(bins = 6) +
facet_wrap(~xname, scales = "free")
Extra extra: I originally put a pipe symbol on the end of the line with the geom_histogram
on it, and got an impenetrable error. However, googling the error message (often a good plan) gave me a first hit that told me exactly what I had done.
\(\blacksquare\)
- Obtain a principal components analysis. Do it on the correlation matrix, since the variables are measured on different scales. You don’t need to look at the results yet.
Solution
This too is all rather like the previous question:
.1 <- princomp(air, cor = T) air
\(\blacksquare\)
- Obtain a scree plot. How many principal components might be worth looking at? Explain briefly. (There might be more than one possibility. If so, discuss them all.)
Solution
ggscreeplot
the thing you just obtained, having loaded package ggbiplot
:
ggscreeplot(air.1)
There is a technicality here, which is that ggbiplot
, the package, loads plyr
, which contains a lot of the same things as dplyr
(the latter is a cut-down version of the former). If you load dplyr
and then plyr
(that is to say, if you load the tidyverse
first and then ggbiplot
), you will end up with trouble, and probably the wrong version of a lot of functions. To avoid this, load ggbiplot
first, and then you’ll be OK.
Now, finally, we might diverge from the previous question. There are actually two elbows on this plot, at 2 and at 4, which means that we should entertain the idea of either 1 or 3 components. I would be inclined to say that the elbow at 2 is still “too high up” the mountain — there is still some more mountain below it.
The points at 3 and 6 components look like elbows too, but they are pointing the wrong way. What you are looking for when you search for elbows are points that are the end of the mountain and the start of the scree. The elbows at 2 (maybe) and 4 (definitely) are this kind of thing, but the elbows at 3 and at 6 are not.
\(\blacksquare\)
- Look at the
summary
of the principal components object. What light does this shed on the choice of number of components? Explain briefly.
Solution
summary(air.1)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation 1.5286539 1.1772853 1.0972994 0.8526937 0.80837896
Proportion of Variance 0.3338261 0.1980001 0.1720094 0.1038695 0.09335379
Cumulative Proportion 0.3338261 0.5318262 0.7038356 0.8077051 0.90105889
Comp.6 Comp.7
Standard deviation 0.73259047 0.39484041
Proportion of Variance 0.07666983 0.02227128
Cumulative Proportion 0.97772872 1.00000000
The first component only explains 33% of the variability, not very much, but the first three components together explain 70%, which is much more satisfactory. So I would go with 3 components.
There are two things here: finding an elbow, and explaining a sensible fraction of the variability. You could explain more of the variability by taking more components, but if you are not careful you end up explaining seven variables with, um, seven variables.
If you go back and look at the scree plot, you’ll see that the first elbow is really rather high up the mountain, and it’s really the second elbow that is the start of the scree.
If this part doesn’t persuade you that three components is better than one, you need to pick a number of components to use for the rest of the question, and stick to it all the way through.
\(\blacksquare\)
- * How do each of your preferred number of components depend on the variables that were measured? Explain briefly.
Solution
When this was a hand-in question, there were three marks for it, which was a bit of a giveaway! Off we go:
.1$loadings air
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
wind 0.237 0.278 0.643 0.173 0.561 0.224 0.241
solar.radiation -0.206 -0.527 0.224 0.778 -0.156
CO -0.551 -0.114 0.573 0.110 -0.585
NO -0.378 0.435 -0.407 0.291 0.450 0.461
NO2 -0.498 0.200 0.197 -0.745 0.338
O3 -0.325 -0.567 0.160 -0.508 0.331 0.417
HC -0.319 0.308 0.541 -0.143 -0.566 0.266 -0.314
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143 0.143
Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857 1.000
You’ll have to decide where to draw the line between “zero” and “nonzero”. It doesn’t matter so much where you put the line, so your answers can differ from mine and still be correct.
We need to pick the loadings that are “nonzero”, however we define that, for example:
component 1 depends (negatively) on carbon monoxide and nitrogen dioxide.
component 2 depends (negatively) on solar radiation and ozone and possibly positively on nitric oxide.
component 3 depends (positively) on wind and hydrocarbons.
It is a good idea to translate the variable names (which are abbreviated) back into the long forms.
\(\blacksquare\)
- Make a data frame that contains (i) the original data, (ii) a column of row numbers, (iii) the principal component scores. Display some of it.
Solution
All the columns contain numbers, so cbind
will do it. (The component scores are seven columns, so bind_cols
won’t do it unless you are careful.):
cbind(air, air.1$scores) %>%
mutate(row = row_number()) -> d
head(d)
This is probably the easiest way, but you see that there is a mixture of base R and Tidyverse. The result is actually a base R data.frame
, so displaying it will display all of it, hence my use of head
. If you want to do it the all-Tidyverse way6 then you need to bear in mind that bind_cols
only accepts vectors or data frames, not matrices, so a bit of care is needed first:
.1$scores %>%
airas_tibble() %>%
bind_cols(air) %>%
mutate(row = row_number()) -> dd
dd
I think the best way to think about this is to start with what is farthest from being a data frame or a vector (the matrix of principal component scores, here), bash that into shape first, and then glue the rest of the things to it.
Note that we used all Tidyverse stuff here, so the result is a tibble
, and displaying it for me displays the first ten rows as you’d expect.
\(\blacksquare\)
- Display the row of your new data frame for the observation with the smallest (most negative) score on component 1. Which row is this? What makes this observation have the most negative score on component 1?
Solution
I think the best strategy is to sort by component 1 score (in the default ascending order), and then display the first row:
%>% arrange(Comp.1) %>% slice(1) d
It’s row 8.
We said earlier that component 1 depends negatively on carbon monoxide and nitrogen dioxide, so that an observation that is low on component 1 should be high on these things.7
So are these values high or low? That was the reason for having you make the five-number summary here. For observation 8, CO
is 6 and NO2
is 21; looking back at the five-number summary, the value of CO
is above Q3, and the value of NO2
is the highest of all. So this is entirely what we’d expect.
\(\blacksquare\)
- Which observation has the lowest (most negative) value on component 2? Which variables ought to be high or low for this observation? Are they? Explain briefly.
Solution
This is a repeat of the ideas we just saw:
%>% arrange(Comp.2) %>% slice(1) d
and for convenience, we’ll grab the quantiles again:
%>%
air summarize(across(everything(), \(x) quantile(x)))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
Day 34 (at the end of the line). We said that component 2 depends (negatively) on solar radiation and ozone and possibly positively on nitric oxide. This means that day 34 ought to be high on the first two and low on the last one (since it’s at the low end of component 2). Solar radiation is, surprisingly, close to the median (75), but ozone, 24, is very near the highest, and nitric oxide, 1, is one of a large number of values equal to the lowest. So day 34 is pointing the right way, even if its variable values are not quite what you’d expect. This business about figuring out whether values on variables are high or low is kind of fiddly, since you have to refer back to the five-number summary to see where the values for a particular observation come. Another way to approach this is to calculate percentile ranks for everything. Let’s go back to our original data frame and replace everything with its percent rank:
%>% mutate(across(everything(), \(x) percent_rank(x))) -> pct_rank
air pct_rank
Observation 34 is row 34 of this:
%>% slice(34) pct_rank
Very high on ozone, (joint) lowest on nitric oxide, but middling on solar radiation. The one we looked at before, observation 8, is this:
%>% slice(8) pct_rank
High on carbon monoxide, the highest on nitrogen dioxide.
\(\blacksquare\)
- Obtain a biplot, with the row numbers labelled, and explain briefly how your conclusions from the previous two parts are consistent with it.
Solution
ggbiplot(air.1, labels = d$row)
Day 8 is way over on the left. The things that point in the direction of observation 8 (NO2, CO
and to a lesser extent NO
and HC
) are the things that observation 8 is high on. On the other hand, observation 8 is around the middle of the arrows for wind
, solar.radiation
and O3
, so that day is not especially remarkable for those.
Observation 34 is nearest the bottom, so we’d expect it to be high on ozone (yes), high on solar radiation (no), low on nitric oxide (since that points the most upward, yes) and also maybe low on wind, since observation 34 is at the “back end” of that arrow. Wind is 6, which is at the first quartile, low indeed.
The other thing that you see from the biplot is that there are four variables pointing more or less up and to the left, and at right angles to them, three other variables pointing up-and-right or down-and-left. You could imagine rotating those arrows so that the group of 4 point upwards, and the other three point left and right. This is what factor analysis does, so you might imagine that this technique might give a clearer picture of which variables belong in which factor than principal components does. Hence what follows.
\(\blacksquare\)
- Run a factor analysis on the same data, obtaining two factors. Look at the factor loadings. Is it clearer which variables belong to which factor, compared to the principal components analysis? Explain briefly.
Solution
.2 <- factanal(air, 2, scores = "r")
air.2$loadings air
Loadings:
Factor1 Factor2
wind -0.176 -0.249
solar.radiation 0.319
CO 0.797 0.391
NO 0.692 -0.152
NO2 0.602 0.152
O3 0.997
HC 0.251 0.147
Factor1 Factor2
SS loadings 1.573 1.379
Proportion Var 0.225 0.197
Cumulative Var 0.225 0.422
I got the factor scores since I’m going to look at a biplot shortly. If you aren’t, you don’t need them.
Factor 1 is rather more clearly carbon monoxide, nitric oxide and nitrogen dioxide. Factor 2 is mostly ozone, with a bit of solar radiation and carbon monoxide. I’d say this is clearer than before.
A biplot would tell us whether the variables are better aligned with the axes now:
biplot(air.2$scores, air.2$loadings)
At least somewhat. Ozone points straight up, since it is the dominant part of factor 2 and not part of factor 1 at all. Carbon monoxide and the two oxides of nitrogen point to the right.
Extra: wind
, solar.radiation
and HC
don’t appear in either of our factors, which also shows up here:
.2$uniquenesses air
wind solar.radiation CO NO NO2
0.9070224 0.8953343 0.2126417 0.4983564 0.6144170
O3 HC
0.0050000 0.9152467
Those variables all have high uniquenesses.
What with the high uniquenesses, and the fact that two factors explain only 42% of the variability, we really ought to look at 3 factors, the same way that we said we should look at 3 components:
.3 <- factanal(air, 3)
air.3$loadings air
Loadings:
Factor1 Factor2 Factor3
wind -0.210 -0.334
solar.radiation 0.318
CO 0.487 0.318 0.507
NO 0.238 -0.269 0.931
NO2 0.989
O3 0.987 0.124
HC 0.427 0.103 0.172
Factor1 Factor2 Factor3
SS loadings 1.472 1.312 1.288
Proportion Var 0.210 0.187 0.184
Cumulative Var 0.210 0.398 0.582
In case you are wondering, factanal
automatically uses the correlation matrix, and so takes care of variables measured on different scales without our having to worry about that.
The rotation has only helped somewhat here. Factor 1 is mainly NO2
with some influence of CO
and HC
; factor 2 is mainly ozone (with a bit of solar radiation and carbon monoxide), and factor 3 is mainly NO
with a bit of CO
.
I think I mentioned most of the variables in there, so the uniquenesses should not be too bad:
.3$uniquenesses air
wind solar.radiation CO NO NO2
0.8404417 0.8905074 0.4046425 0.0050000 0.0050000
O3 HC
0.0050000 0.7776557
Well, not great: wind
and solar.radiation
still have high uniquenesses because they are not strongly part of any factors.
If you wanted to, you could obtain the factor scores for the 3-factor solution, and plot them on a three-dimensional plot using rgl
, rotating them to see the structure. A three dimensional “biplot”8 would also be a cool thing to look at.
\(\blacksquare\)
There are two types of scores here: a person’s scores on the psychological test, 1 through 5, and their factor scores, which are decimal numbers centred at zero. Try not to get these confused.↩︎
I think the last four items in the entire survey are different; otherwise the total number of items would be a multiple of 5.↩︎
Not to be confused with covfefe, which was current news when I wrote this question.↩︎
There is also kind of an elbow at 4, which would suggest three factors, but that’s really too many with only 5 variables. That wouldn’t be much of a reduction in the number of variables, which is what principal components is trying to achieve.↩︎
In computer science terms, a data frame is said to inherit from a list: it is a list plus extra stuff.↩︎
There really ought to be a radio station CTDY: All Tidyverse, All The Time.↩︎
You might have said that component 1 depended on other things as well, in which case you ought to consider whether observation 8 is, as appropriate, high or low on these as well.↩︎
A three-dimensional biplot ought to be called a triplot.↩︎