41  Factor Analysis

Packages for this chapter:

library(ggbiplot)
library(tidyverse)

41.1 The Interpersonal Circumplex

The “IPIP Interpersonal Circumplex” (see link) is a personal behaviour survey, where respondents have to rate how accurately a number of statements about behaviour apply to them, on a scale from 1 (“very inaccurate”) to 5 (“very accurate”). A survey was done on 459 people, using a 44-item variant of the above questionnaire, where the statements were as follows. Put an “I” or an “I am” in front of each one:

  • talkative

  • find fault

  • do a thorough job

  • depressed

  • original

  • reserved

  • helpful

  • careless

  • relaxed

  • curious

  • full of energy

  • start quarrels

  • reliable

  • tense

  • ingenious

  • generate enthusiasm in others

  • forgiving

  • disorganized

  • worry

  • imaginative

  • quiet

  • trusting

  • lazy

  • emotionally stable

  • inventive

  • assertive

  • cold and aloof

  • persevere

  • moody

  • value artistic experiences

  • shy

  • considerate

  • efficient

  • calm in tense situations

  • prefer routine work

  • outgoing

  • sometimes rude

  • stick to plans

  • nervous

  • reflective

  • have few artistic interests

  • co-operative

  • distractible

  • sophisticated in art and music

I don’t know what a “circumplex” is, but I know it’s not one of those “hat” accents that they have in French. The data are in link. The columns PERS01 through PERS44 represent the above traits.

  1. Read in the data and check that you have the right number of rows and columns.

  2. There are some missing values among these responses. Eliminate all the individuals with any missing values (since princomp can’t handle them).

  3. Carry out a principal components analysis and obtain a scree plot.

  4. How many components/factors should you use? Explain briefly.

  5. * Using your preferred number of factors, run a factor analysis. Obtain “r”-type factor scores, as in class. You don’t need to look at any output yet.

  6. Obtain the factor loadings. How much of the variability does your chosen number of factors explain?

  7. Interpret each of your chosen number of factors. That is, for each factor, identify the items that load heavily on it (you can be fairly crude about this, eg. use a cutoff like 0.4 in absolute value), and translate these items into the statements given in each item. Then, if you can, name what the items loading heavily on each factor have in common. Interpret a negative loading as “not” whatever the item says.

  8. Find a person who is extreme on each of your first three factors (one at a time). For each of these people, what kind of data should they have for the relevant ones of variables PERS01 through PERS44? Do they? Explain reasonably briefly.

  9. Check the uniquenesses. Which one(s) seem unusually high? Check these against the factor loadings. Are these what you would expect?

41.2 A correlation matrix

Here is a correlation matrix between five variables. This correlation matrix was based on \(n=50\) observations. Save the data into a file.


1.00 0.90 -0.40 0.28 -0.05
0.90 1.00 -0.60 0.43 -0.20
-0.40 -0.60 1.00 -0.80 0.40
0.28 0.43 -0.80 1.00 -0.70
-0.05 -0.20 0.40 -0.70 1.00
  1. Read in the data, using col_names=F (why?). Check that you have five variables with names invented by R.

  2. Run a principal components analysis from this correlation matrix.

  3. * Obtain a scree plot. Can you justify the use of two components (later, factors), bearing in mind that we have only five variables?

  4. Take a look at the first two component loadings. Which variables appear to feature in which component? Do they have a positive or negative effect?

  5. Create a “covariance list” (for the purposes of performing a factor analysis on the correlation matrix).

  6. Carry out a factor analysis with two factors. We’ll investigate the bits of it in a moment.

  7. * Look at the factor loadings. Describe how the factors are related to the original variables. Is the interpretation clearer than for the principal components analysis?

  8. Look at the uniquenesses. Are there any that are unusually high? Does that surprise you, given your answer to (here)? (You will probably have to make a judgement call here.)

41.3 Air pollution

The data in link are measurements of air-pollution variables recorded at 12 noon on 42 different days at a location in Los Angeles. The file is in .csv format, since it came from a spreadsheet. Specifically, the variables (in suitable units), in the same order as in the data file, are:

  • wind speed

  • solar radiation

  • carbon monoxide

  • Nitric oxide (also known as nitrogen monoxide)

  • Nitrogen dioxide

  • Ozone

  • Hydrocarbons

The aim is to describe pollution using fewer than these seven variables.

  1. Read in the data and demonstrate that you have the right number of rows and columns in your data frame.

  2. * Obtain a five-number summary for each variable. You can do this in one go for all seven variables.

  3. Obtain a principal components analysis. Do it on the correlation matrix, since the variables are measured on different scales. You don’t need to look at the results yet.

  4. Obtain a scree plot. How many principal components might be worth looking at? Explain briefly. (There might be more than one possibility. If so, discuss them all.)

  5. Look at the summary of the principal components object. What light does this shed on the choice of number of components? Explain briefly.

  6. * How do each of your preferred number of components depend on the variables that were measured? Explain briefly.

  7. Make a data frame that contains (i) the original data, (ii) a column of row numbers, (iii) the principal component scores. Display some of it.

  8. Display the row of your new data frame for the observation with the smallest (most negative) score on component 1. Which row is this? What makes this observation have the most negative score on component 1?

  9. Which observation has the lowest (most negative) value on component 2? Which variables ought to be high or low for this observation? Are they? Explain briefly.

  10. Obtain a biplot, with the row numbers labelled, and explain briefly how your conclusions from the previous two parts are consistent with it.

  11. Run a factor analysis on the same data, obtaining two factors. Look at the factor loadings. Is it clearer which variables belong to which factor, compared to the principal components analysis? Explain briefly.

My solutions follow:

41.4 The Interpersonal Circumplex

The “IPIP Interpersonal Circumplex” (see link) is a personal behaviour survey, where respondents have to rate how accurately a number of statements about behaviour apply to them, on a scale from 1 (“very inaccurate”) to 5 (“very accurate”). A survey was done on 459 people, using a 44-item variant of the above questionnaire, where the statements were as follows. Put an “I” or an “I am” in front of each one:

  • talkative

  • find fault

  • do a thorough job

  • depressed

  • original

  • reserved

  • helpful

  • careless

  • relaxed

  • curious

  • full of energy

  • start quarrels

  • reliable

  • tense

  • ingenious

  • generate enthusiasm in others

  • forgiving

  • disorganized

  • worry

  • imaginative

  • quiet

  • trusting

  • lazy

  • emotionally stable

  • inventive

  • assertive

  • cold and aloof

  • persevere

  • moody

  • value artistic experiences

  • shy

  • considerate

  • efficient

  • calm in tense situations

  • prefer routine work

  • outgoing

  • sometimes rude

  • stick to plans

  • nervous

  • reflective

  • have few artistic interests

  • co-operative

  • distractible

  • sophisticated in art and music

I don’t know what a “circumplex” is, but I know it’s not one of those “hat” accents that they have in French. The data are in link. The columns PERS01 through PERS44 represent the above traits.

  1. Read in the data and check that you have the right number of rows and columns.

Solution

Separated by single spaces.

my_url <- "http://ritsokiguess.site/datafiles/personality.txt"
pers <- read_delim(my_url, " ")
Rows: 459 Columns: 45
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
dbl (45): id, PERS01, PERS02, PERS03, PERS04, PERS05, PERS06, PERS07, PERS08...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pers

Yep, 459 people (in rows), and 44 items (in columns), plus one column of ids for the people who took the survey.

In case you were wondering about the “I” vs. “I am” thing, the story seems to be that each behaviour needs to have a verb. If the behaviour has a verb, “I” is all you need, but if it doesn’t, you have to add one, ie. “I am”.

Another thing you might be concerned about is whether these data are “tidy” or not. To some extent, it depends on what you are going to do with it. You could say that the PERS columns are all survey-responses, just to different questions, and you might think of doing something like this:

pers %>% pivot_longer(-id, names_to="item", values_to="response")

to get a really long and skinny data frame. It all depends on what you are doing with it. Long-and-skinny is ideal if you are going to summarize the responses by survey item, or draw something like bar charts of responses facetted by item:

pers %>%
  pivot_longer(-id, names_to="item", values_to="response") %>%
  ggplot(aes(x = response)) + geom_bar() + facet_wrap(~item)
Warning: Removed 371 rows containing non-finite outside the scale range
(`stat_count()`).

The first time I did this, item PERS36 appeared out of order at the end, and I was wondering what happened, until I realized it was actually misspelled as PES36! I corrected it in the data file, and it should be good now (though I wonder how many years that error persisted for).

For us, in this problem, though, we need the wide format.

\(\blacksquare\)

  1. There are some missing values among these responses. Eliminate all the individuals with any missing values (since princomp can’t handle them).

Solution

This is actually much easier than it was in the past. A way of asking “are there any missing values anywhere?” is:

any(is.na(pers))
[1] TRUE

There are. To remove them, just this:

pers %>% drop_na() -> pers.ok

Are there any missings left?

any(is.na(pers.ok))
[1] FALSE

Nope. Extra: you might also have thought of the “tidy, remove, untidy” strategy here. The trouble with that here is that you want to remove all the observations for a subject who has any missing ones. This is unlike the multidimensional scaling one where we wanted to remove all the distances for two cities that we knew ahead of time.

That gives me an idea, though.

pers %>%
  pivot_longer(-id, names_to="item", values_to="rating")

To find out which subjects have any missing values, we can do a group_by and summarize on subjects (that means, the id column; the PERS in the column I called item means “personality”, not “person”!). What do we summarize? Any one of the standard things like mean will return NA if the thing whose mean you are finding has any NA values in it anywhere, and a number if it’s “complete”, so this kind of thing, adding to my pipeline:

pers %>%
  pivot_longer(-id, names_to="item", values_to="rating") %>% 
  group_by(id) %>%
  summarize(m = mean(rating)) %>%
  filter(is.na(m))

This is different from drop_na, which would remove any rows (of the long data frame) that have a missing response. This, though, is exactly what we don’t want, since we are trying to keep track of the subjects that have missing values.

Most of the subjects had an actual numerical mean here, whose value we don’t care about; all we care about here is whether the mean is missing, which implies that one (or more) of the responses was missing.

So now we define a column has_missing that is true if the subject has any missing values and false otherwise:

pers %>%
  pivot_longer(-id, names_to="item", values_to="rating") %>% 
  group_by(id) %>%
  summarize(m = mean(rating)) %>%
  mutate(has_missing = is.na(m)) -> pers.hm
pers.hm 

This data frame pers.hm has the same number of rows as the original data frame pers, one per subject, so we can just glue it onto the end of that:

pers %>% bind_cols(pers.hm)
New names:
• `id` -> `id...1`
• `id` -> `id...46`

and then filter out the rows for which has_missing is true. What we did here is really a way of mimicking complete.cases, which is the way we used to have to do it, before drop_na came on the scene.

\(\blacksquare\)

  1. Carry out a principal components analysis and obtain a scree plot.

Solution

This ought to be straightforward, but we’ve got to remember to use only the columns with actual data in them: that is, PERS01 through PERS44:

pers.1 <- pers.ok %>%
  select(starts_with("PERS")) %>%
  princomp(cor = T)
ggscreeplot(pers.1)

\(\blacksquare\)

  1. How many components/factors should you use? Explain briefly.

Solution

I think the clearest elbow is at 7, so we should use 6 components/factors. You could make a case that the elbow at 6 is also part of the scree, and therefore you should use 5 components/factors. Another one of those judgement calls. Ignore the “backwards elbow” at 5: this is definitely part of the mountain rather than the scree. Backwards elbows, as you’ll recall, don’t count as elbows anyway. When I drew this in R Studio, the elbow at 6 was clearer than the one at 7, so I went with 5 components/factors below. The other way to go is to take the number of eigenvalues bigger than 1:

summary(pers.1)
Importance of components:
                          Comp.1     Comp.2     Comp.3     Comp.4     Comp.5
Standard deviation     2.6981084 2.04738207 1.74372011 1.59610543 1.50114586
Proportion of Variance 0.1654497 0.09526758 0.06910363 0.05789892 0.05121452
Cumulative Proportion  0.1654497 0.26071732 0.32982096 0.38771988 0.43893440
                          Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
Standard deviation     1.2627066 1.14816136 1.10615404 1.07405521 1.02180353
Proportion of Variance 0.0362370 0.02996078 0.02780856 0.02621806 0.02372915
Cumulative Proportion  0.4751714 0.50513218 0.53294074 0.55915880 0.58288795
                          Comp.11    Comp.12    Comp.13    Comp.14    Comp.15
Standard deviation     0.98309198 0.97514006 0.94861102 0.90832065 0.90680594
Proportion of Variance 0.02196522 0.02161132 0.02045143 0.01875105 0.01868857
Cumulative Proportion  0.60485317 0.62646449 0.64691592 0.66566698 0.68435554
                          Comp.16    Comp.17    Comp.18    Comp.19   Comp.20
Standard deviation     0.86798188 0.85762608 0.84515849 0.82819534 0.8123579
Proportion of Variance 0.01712256 0.01671642 0.01623393 0.01558881 0.0149983
Cumulative Proportion  0.70147810 0.71819452 0.73442845 0.75001726 0.7650156
                          Comp.21    Comp.22    Comp.23    Comp.24    Comp.25
Standard deviation     0.80910333 0.80435744 0.76594963 0.75946741 0.75434835
Proportion of Variance 0.01487837 0.01470434 0.01333361 0.01310888 0.01293276
Cumulative Proportion  0.77989393 0.79459827 0.80793188 0.82104076 0.83397352
                          Comp.26    Comp.27   Comp.28    Comp.29    Comp.30
Standard deviation     0.74494825 0.73105470 0.6956473 0.68327155 0.67765233
Proportion of Variance 0.01261245 0.01214639 0.0109983 0.01061045 0.01043665
Cumulative Proportion  0.84658597 0.85873236 0.8697307 0.88034111 0.89077776
                          Comp.31     Comp.32     Comp.33     Comp.34
Standard deviation     0.66847179 0.660473737 0.651473777 0.629487724
Proportion of Variance 0.01015578 0.009914217 0.009645865 0.009005791
Cumulative Proportion  0.90093355 0.910847763 0.920493629 0.929499420
                           Comp.35     Comp.36     Comp.37     Comp.38
Standard deviation     0.618765271 0.605892700 0.594231727 0.581419871
Proportion of Variance 0.008701601 0.008343317 0.008025258 0.007682933
Cumulative Proportion  0.938201021 0.946544338 0.954569596 0.962252530
                           Comp.39     Comp.40     Comp.41     Comp.42
Standard deviation     0.568951666 0.560084703 0.547059522 0.524949694
Proportion of Variance 0.007356955 0.007129429 0.006801685 0.006263004
Cumulative Proportion  0.969609484 0.976738913 0.983540598 0.989803602
                           Comp.43     Comp.44
Standard deviation     0.490608152 0.456010047
Proportion of Variance 0.005470372 0.004726026
Cumulative Proportion  0.995273974 1.000000000

There are actually 10 of these. But if you look at the scree plot, there really seems to be no reason to take 10 factors rather than, say, 11 or 12. There are a lot of eigenvalues (standard deviations) close to (but just below) 1, and no obvious “bargains” in terms of variance explained: the “cumulative proportion” just keeps going gradually up.

\(\blacksquare\)

  1. * Using your preferred number of factors, run a factor analysis. Obtain “r”-type factor scores, as in class. You don’t need to look at any output yet.

Solution

I’m going to do the 5 factors that I preferred the first time I looked at this. Don’t forget to grab only the appropriate columns from pers.ok:

pers.ok.1 <- pers.ok %>%
  select(starts_with("PERS")) %>%
  factanal(5, scores = "r")

If you think 6 is better, you should feel free to use that here.

\(\blacksquare\)

  1. Obtain the factor loadings. How much of the variability does your chosen number of factors explain?

Solution

pers.ok.1$loadings

Loadings:
       Factor1 Factor2 Factor3 Factor4 Factor5
PERS01  0.130   0.752           0.279         
PERS02          0.199   0.202  -0.545         
PERS03  0.677                   0.160   0.126 
PERS04 -0.143  -0.239   0.528  -0.195         
PERS05  0.191   0.258  -0.180   0.159   0.520 
PERS06  0.104  -0.658   0.185  -0.143         
PERS07  0.313   0.113           0.533   0.130 
PERS08 -0.558           0.168  -0.173         
PERS09                 -0.641           0.136 
PERS10  0.149   0.169                   0.445 
PERS11  0.106   0.282  -0.224  -0.107   0.271 
PERS12 -0.157                  -0.404         
PERS13  0.577                   0.331   0.126 
PERS14                  0.685  -0.135         
PERS15  0.133                           0.480 
PERS16  0.106   0.481                   0.335 
PERS17                          0.277   0.102 
PERS18 -0.641                                 
PERS19         -0.159   0.596   0.109         
PERS20          0.215   0.103           0.498 
PERS21         -0.805   0.121                 
PERS22  0.153   0.175           0.533         
PERS23 -0.622           0.121  -0.259         
PERS24          0.135  -0.582           0.153 
PERS25          0.176  -0.124           0.517 
PERS26  0.144   0.515  -0.215  -0.260   0.244 
PERS27         -0.205   0.136  -0.438         
PERS28  0.623                   0.245   0.132 
PERS29                  0.477  -0.231         
PERS30                                  0.500 
PERS31         -0.619   0.191                 
PERS32  0.322                   0.553   0.134 
PERS33  0.622                                 
PERS34  0.105          -0.566                 
PERS35         -0.180                  -0.161 
PERS36          0.675  -0.146   0.118   0.107 
PERS37 -0.300   0.134          -0.495         
PERS38  0.521                           0.169 
PERS39         -0.333   0.530                 
PERS40                          0.162   0.585 
PERS41                         -0.171  -0.326 
PERS42  0.271   0.179           0.552   0.139 
PERS43 -0.465           0.236  -0.174         
PERS44                                  0.449 

               Factor1 Factor2 Factor3 Factor4 Factor5
SS loadings      3.810   3.740   3.174   2.939   2.627
Proportion Var   0.087   0.085   0.072   0.067   0.060
Cumulative Var   0.087   0.172   0.244   0.311   0.370

The Cumulative Var line at the bottom says that our five factors together have explained 37% of the variability. This is not great, but is the kind of thing we have to live with in this kind of analysis (like the personality one in class).

\(\blacksquare\)

  1. Interpret each of your chosen number of factors. That is, for each factor, identify the items that load heavily on it (you can be fairly crude about this, eg. use a cutoff like 0.4 in absolute value), and translate these items into the statements given in each item. Then, if you can, name what the items loading heavily on each factor have in common. Interpret a negative loading as “not” whatever the item says.

Solution

This is a lot of work, but I figure you should go through it at least once! If you have some number of factors other than 5, your results will be different from mine. Keep going as long as you reasonably can. Factor 1: 3, 8 (negatively), 13, 18 (negative), 23 (negative), 28, 33, 38 and maybe 43 (negatively). These are: do a thorough job, not-careless, reliable, not-disorganized, not-lazy, persevere, efficient, stick to plans, not-distractible. These have the common theme of paying attention to detail and getting the job done properly.

Factor 2: 1, not-6, 16, not-21, 26, not-31, 36. Talkative, not-reserved, generates enthusiasm, not-quiet, assertive, not-shy, outgoing. “Extravert” seems to capture all of those. Factor 3: 4, not-9, 14, 19, not-24, 29, not-34, 39. Depressed, not-relaxed, tense, worried, not emotionally stable, moody, not-calm-when-tense, nervous. “Not happy” or something like that.

Notice how these seem to be jumping in steps of 5? The psychology scoring of assessments like this is that a person’s score on some dimension is found by adding up their scores on certain of the questions and subtracting their scores on others (“reverse-coded questions”). I’m guessing that these guys have 5 dimensions they are testing for, corresponding presumably to my five factors. The questionnaire at link is different, but you see the same idea there. (The jumps there seem to be in steps of 8, since they have 8 dimensions.)

Factor 4: not-2, 7, not-12 (just), 22, not-27, 32, not-37, 42. Doesn’t find fault, helpful, doesn’t start quarrels, trusting, not-cold-and-aloof, considerate, not-sometimes-rude, co-operative. “Helps without judgement” or similar.

Factor 5: 5, 10, 15, 20, 25, 30, 40, 44. Original, curious, ingenious, imaginative, inventive, values artistic experiences, reflective, sophisticated in art and music. Creative.

I remembered that psychologists like to talk about the “big 5” personality traits. These are extraversion (factor 2 here), agreeableness (factor 4), openness (factor 5?), conscientiousness (factor 1), and neuroticism (factor 3). The correspondence appears to be pretty good. (I wrote my answers above last year without thinking about “big 5” at all.)

I wonder whether 6 factors is different?

pers.ok.2 <- pers.ok %>%
  select(starts_with("PERS")) %>%
  factanal(6, scores = "r")
pers.ok.2$loadings

Loadings:
       Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
PERS01  0.114  -0.668           0.343           0.290 
PERS02                  0.204  -0.451           0.354 
PERS03  0.658                   0.200           0.131 
PERS04 -0.141   0.215   0.538  -0.196                 
PERS05  0.212  -0.271  -0.168   0.115   0.571         
PERS06          0.694   0.171  -0.102                 
PERS07  0.315  -0.114           0.530   0.138         
PERS08 -0.602           0.145                   0.180 
PERS09                 -0.669   0.118           0.106 
PERS10  0.125                           0.398   0.264 
PERS11         -0.131  -0.257           0.173   0.445 
PERS12 -0.178                  -0.348           0.188 
PERS13  0.572                   0.332   0.129         
PERS14          0.142   0.675                   0.121 
PERS15  0.141                           0.495         
PERS16         -0.322  -0.130   0.114   0.248   0.491 
PERS17                 -0.111   0.346           0.139 
PERS18 -0.655                                         
PERS19          0.139   0.598   0.101   0.104         
PERS20         -0.165                   0.489   0.166 
PERS21          0.843                          -0.110 
PERS22  0.123  -0.105           0.613           0.110 
PERS23 -0.631           0.115  -0.232                 
PERS24                 -0.603           0.113   0.143 
PERS25         -0.135  -0.119           0.513   0.166 
PERS26  0.113  -0.385  -0.225  -0.163   0.176   0.454 
PERS27          0.275   0.128  -0.365           0.177 
PERS28  0.617                   0.253   0.129         
PERS29                  0.466  -0.161           0.164 
PERS30                                  0.497         
PERS31          0.659   0.166                         
PERS32  0.297                   0.630           0.102 
PERS33  0.597                                   0.239 
PERS34                 -0.584                   0.153 
PERS35          0.221                  -0.210         
PERS36         -0.562  -0.171   0.214           0.385 
PERS37 -0.320                  -0.439           0.220 
PERS38  0.501                   0.123   0.130   0.182 
PERS39 -0.123   0.410   0.512                   0.118 
PERS40                          0.146   0.598         
PERS41                         -0.106  -0.378   0.117 
PERS42  0.259  -0.139           0.587   0.118         
PERS43 -0.515           0.210                   0.253 
PERS44                                  0.454         

               Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
SS loadings      3.830   3.284   3.213   2.865   2.527   1.659
Proportion Var   0.087   0.075   0.073   0.065   0.057   0.038
Cumulative Var   0.087   0.162   0.235   0.300   0.357   0.395

Much of the same kind of thing seems to be happening, though it’s a bit fuzzier. I suspect the devisers of this survey were adherents to the “big 5” theory. The factor 6 here is items 11, 16 and 26, which would be expected to belong to factor 2 here, given what we found before. I think these particular items are about generating enthusiasm in others, rather than (necessarily) about being extraverted oneself.

\(\blacksquare\)

  1. Find a person who is extreme on each of your first three factors (one at a time). For each of these people, what kind of data should they have for the relevant ones of variables PERS01 through PERS44? Do they? Explain reasonably briefly.

Solution

For this, we need the factor scores obtained in part (here).1 I’m thinking that I will create a data frame with the original data (with the missing values removed) and the factor scores together, and then look in there. This will have a lot of columns, but we’ll only need to display some each time. This is based on my 5-factor solution. I’m adding a column id so that I know which of the individuals (with no missing data) we are looking at:

scores.1 <- as_tibble(pers.ok.1$scores) %>%
  bind_cols(pers.ok) %>%
  mutate(id = row_number())
scores.1

I did it this way, rather than using data.frame, so that I would end up with a tibble that would display nicely rather than running off the page. This meant turning the matrix of factor scores into a tibble first and then gluing everything onto it. (There’s no problem in using data.frame here if you prefer. You could even begin with data.frame and pipe the final result into as_tibble to end up with a tibble.) Let’s start with factor 1. There are several ways to find the person who scores highest and/or lowest on that factor:

scores.1 %>% filter(Factor1 == max(Factor1))

to display the maximum, or

scores.1 %>% arrange(Factor1) %>% slice(1:5)

to display the minimum (and in this case the five smallest ones), or

scores.1 %>% filter(abs(Factor1) == max(abs(Factor1)))

to display the largest one in size, whether positive or negative. The code is a bit obfuscated because I have to take absolute values twice. Maybe it would be clearer to create a column with the absolute values in it and look at that:

scores.1 %>%
  mutate(abso = abs(Factor1)) %>%
  filter(abso == max(abso))

Does this work too?

scores.1 %>% arrange(desc(abs(Factor1))) %>% slice(1:5)

It looks as if it does: “sort the Factor 1 scores in descending order by absolute value, and display the first few”. The most extreme scores on Factor 1 are all negative: the most positive one (found above) was only about 1.70.

For you, you don’t have to be this sophisticated. It’s enough to eyeball the factor scores on factor 1 and find one that’s reasonably far away from zero. Then you note its row and slice that row, later.

I think I like the idea of creating a new column with the absolute values in it and finding the largest of that. Before we pursue that, though, remember that we don’t need to look at all the PERS columns, because only some of them load highly on each factor. These are the ones I defined into f1 first; the first ones have positive loadings and the last three have negative loadings:

f1 <- c(3, 13, 28, 33, 38, 8, 18, 23, 43)
scores.1 %>%
  mutate(abso = abs(Factor1)) %>%
  filter(abso == max(abso)) %>%
  select(id, Factor1, num_range("PERS", f1, width = 2)) 

I don’t think I’ve used num_range before, like, ever. It is one of the select-helpers like starts_with. It is used when you have column names that are some constant text followed by variable numbers, which is exactly what we have here: we want to select the PERS columns with numbers that we specify. num_range requires (at least) two things: the text prefix, followed by a vector of numbers that are to be glued onto the prefix. I defined that first so as not to clutter up the select line. The third thing here is width: all the PERS columns have a name that ends with two digits, so PERS03 rather than PERS3, and using width makes sure that the zero gets inserted.

Individual 340 is a low scorer on factor 1, so they should have low scores on the first five items (the ones with positive loading on factor 1) and high scores on the last four. This is indeed what happened: 1s, 2s and 3s on the first five items and 4s and 5s on the last four. Having struggled through that, factors 2 and 3 are repeats of this. The high loaders on factor 2 are the ones shown in f2 below, with the first five loading positively and the last three negatively.2

f2 <- c(1, 11, 16, 26, 36, 6, 21, 31)
scores.1 %>%
  mutate(abso = abs(Factor2)) %>%
  filter(abso == max(abso)) %>%
  select(id, Factor2, num_range("PERS", f2, width = 2))

What kind of idiot, I was thinking, named the data frame of scores scores.1 when there are going to be three factors to assess?

The first five scores are lowish, but the last three are definitely high (three 5s). This idea of a low score on the positive-loading items and a high score on the negatively-loading ones is entirely consistent with a negative factor score.

Finally, factor 3, which loads highly on items 4, 9 (neg), 14, 19, 24 (neg), 29, 34 (neg), 39. (Item 44, which you’d expect to be part of this factor, is actually in factor 5.) First we see which individual this is:

f3 <- c(4, 14, 19, 29, 39, 9, 24, 34)
scores.1 %>%
  mutate(abso = abs(Factor3)) %>%
  filter(abso == max(abso)) %>%
  select(id, Factor3, num_range("PERS", f3, width = 2)) 

The only mysterious one there is item 19, which ought to be low, because it has a positive loading and the factor score is unusually negative. But it is 4 on a 5-point scale. The others that are supposed to be low are 1 and the ones that are supposed to be high are 4 or 5, so those all match up.

\(\blacksquare\)

  1. Check the uniquenesses. Which one(s) seem unusually high? Check these against the factor loadings. Are these what you would expect?

Solution

Mine are

pers.ok.1$uniquenesses
   PERS01    PERS02    PERS03    PERS04    PERS05    PERS06    PERS07    PERS08 
0.3276244 0.6155884 0.4955364 0.6035655 0.5689691 0.4980334 0.5884146 0.6299781 
   PERS09    PERS10    PERS11    PERS12    PERS13    PERS14    PERS15    PERS16 
0.5546981 0.7460655 0.7740590 0.8016644 0.5336047 0.5035412 0.7381636 0.6352166 
   PERS17    PERS18    PERS19    PERS20    PERS21    PERS22    PERS23    PERS24 
0.8978624 0.5881834 0.5949740 0.6900378 0.3274366 0.6564542 0.5279346 0.6107080 
   PERS25    PERS26    PERS27    PERS28    PERS29    PERS30    PERS31    PERS32 
0.6795545 0.5412962 0.7438329 0.5289192 0.7114735 0.7386601 0.5762901 0.5592906 
   PERS33    PERS34    PERS35    PERS36    PERS37    PERS38    PERS39    PERS40 
0.6029914 0.6573411 0.9306766 0.4966071 0.6396371 0.6804821 0.5933423 0.6204610 
   PERS41    PERS42    PERS43    PERS44 
0.8560531 0.5684220 0.6898732 0.7872295 

Yours will be different if you used a different number of factors. But the procedure you follow will be the same as mine.

I think the highest of these is 0.9307, for item 35. Also high is item 17, 0.8979. If you look back at the table of loadings, item 35 has low loadings on all the factors: the largest in size is only 0.180. The largest loading for item 17 is 0.277, on factor 4. This is not high either.

Looking down the loadings table, also item 41 has only a loading of \(-0.326\) on factor 5, so its uniqueness ought to be pretty high as well. At 0.8561, it is.

\(\blacksquare\)

41.5 A correlation matrix

Here is a correlation matrix between five variables. This correlation matrix was based on \(n=50\) observations. Save the data into a file.


1.00 0.90 -0.40 0.28 -0.05
0.90 1.00 -0.60 0.43 -0.20
-0.40 -0.60 1.00 -0.80 0.40
0.28 0.43 -0.80 1.00 -0.70
-0.05 -0.20 0.40 -0.70 1.00
  1. Read in the data, using col_names=F (why?). Check that you have five variables with names invented by R.

Solution

I saved my data into cov5.txt,3 delimited by single spaces, so:

corr <- read_delim("cov5.txt", " ", col_names = F)
Rows: 5 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
dbl (5): X1, X2, X3, X4, X5

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
corr

I needed to say that I have no variable names and I want R to provide some. As you see, it did: X1 through X5. You can also supply your own names in this fashion:

my_names <- c("first", "second", "third", "fourth", "fifth")
corr2 <- read_delim("cov5.txt", " ", col_names = my_names)
Rows: 5 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
dbl (5): first, second, third, fourth, fifth

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
corr2

\(\blacksquare\)

  1. Run a principal components analysis from this correlation matrix.

Solution

Two lines, these:

corr.mat <- as.matrix(corr)
corr.1 <- princomp(covmat = corr.mat)

Or do it in one step as

corr.1a <- princomp(as.matrix(corr))

if you like, but I think it’s less clear what’s going on.

\(\blacksquare\)

  1. * Obtain a scree plot. Can you justify the use of two components (later, factors), bearing in mind that we have only five variables?

Solution

ggscreeplot(corr.1)

There is kind of an elbow at 3, which would suggest two components/factors.4

You can also use the eigenvalue-bigger-than-1 thing:

summary(corr.1)
Importance of components:
                          Comp.1    Comp.2     Comp.3     Comp.4    Comp.5
Standard deviation     1.7185460 1.1686447 0.70207741 0.36584870 0.2326177
Proportion of Variance 0.5906801 0.2731461 0.09858254 0.02676905 0.0108222
Cumulative Proportion  0.5906801 0.8638262 0.96240875 0.98917780 1.0000000

Only the first two eigenvalues are bigger than 1, and the third is quite a bit smaller. So this would suggest two factors also. The third eigenvalue is in that kind of nebulous zone between between being big and being small, and the percent of variance explained is also ambiguous: is 86% good enough or should I go for 96%? These questions rarely have good answers. But an issue is that you want to summarize your variables with a (much) smaller number of factors; with 5 variables, having two factors rather than more than two seems to be a way of gaining some insight.

\(\blacksquare\)

  1. Take a look at the first two component loadings. Which variables appear to feature in which component? Do they have a positive or negative effect?

Solution

corr.1$loadings

Loadings:
   Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
X1  0.404  0.571  0.287  0.363  0.545
X2  0.482  0.439  0.144 -0.363 -0.650
X3 -0.501  0.122  0.652  0.412 -0.375
X4  0.490 -0.390 -0.171  0.689 -0.323
X5 -0.338  0.561 -0.665  0.304 -0.189

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
SS loadings       1.0    1.0    1.0    1.0    1.0
Proportion Var    0.2    0.2    0.2    0.2    0.2
Cumulative Var    0.2    0.4    0.6    0.8    1.0

This is messy.

In the first component, the loadings are all about the same in size, but the ones for X3 and X5 are negative and the rest are positive. Thus, component 1 is contrasting X3 and X5 with the others.

In the second component, the emphasis is on X1, X2 and X5, all with negative loadings, and possibly X4, with a positive loading.

Note that the first component is basically “size”, since the component loadings are all almost equal in absolute value. This often happens in principal components; for example, it also happened with the running records in class.

I hope the factor analysis, with its rotation, will straighten things out some.

\(\blacksquare\)

  1. Create a “covariance list” (for the purposes of performing a factor analysis on the correlation matrix).

Solution

This is about the most direct way:

corr.list <- list(cov = corr.mat, n.obs = 50)

recalling that there were 50 observations. The idea is that we feed this into factanal instead of the correlation matrix, so that the factor analysis knows how many individuals there were (for testing and such).

Note that you need the correlation matrix as a matrix, not as a data frame. If you ran the princomp all in one step, you’ll have to create the correlation matrix again, for example like this:

corr.list2 <- list(cov = as.matrix(corr), n.obs = 50)

The actual list looks like this:

corr.list
$cov
        X1    X2   X3    X4    X5
[1,]  1.00  0.90 -0.4  0.28 -0.05
[2,]  0.90  1.00 -0.6  0.43 -0.20
[3,] -0.40 -0.60  1.0 -0.80  0.40
[4,]  0.28  0.43 -0.8  1.00 -0.70
[5,] -0.05 -0.20  0.4 -0.70  1.00

$n.obs
[1] 50

An R list is a collection of things not all of the same type, here a matrix and a number, and is a handy way of keeping a bunch of connected things together. You use the same dollar-sign notation as for a data frame to access the things in a list:

corr.list$n.obs
[1] 50

and logically this is because, to R, a data frame is a special kind of a list, so anything that works for a list also works for a data frame, plus some extra things besides.5

The same idea applies to extracting things from the output of a regression (with lm) or something like a t.test: the output from those is also a list. But for those, I like tidy from broom better.

\(\blacksquare\)

  1. Carry out a factor analysis with two factors. We’ll investigate the bits of it in a moment.

Solution

corr.2 <- factanal(factors = 2, covmat = corr.list)

\(\blacksquare\)

  1. * Look at the factor loadings. Describe how the factors are related to the original variables. Is the interpretation clearer than for the principal components analysis?

Solution

corr.2$loadings

Loadings:
     Factor1 Factor2
[1,]          0.905 
[2,] -0.241   0.968 
[3,]  0.728  -0.437 
[4,] -0.977   0.201 
[5,]  0.709         

               Factor1 Factor2
SS loadings      2.056   1.987
Proportion Var   0.411   0.397
Cumulative Var   0.411   0.809

Oh yes, this is a lot clearer. Factor 1 is variables 3 and 5 contrasted with variable 4; factor 2 is variables 1 and 2. No hand-waving required.

Perhaps now is a good time to look back at the correlation matrix and see why the factor analysis came out this way:

corr

Variable X1 is highly correlated with X2 but not really any of the others. Likewise variables X3, X4, X5 are more or less highly correlated among themselves but not with the others (X2 and X3 being an exception, but the big picture is as I described). So variables that appear in the same factor should be highly correlated with each other and not with variables that are in different factors. But it took the factor analysis to really show this up.

\(\blacksquare\)

  1. Look at the uniquenesses. Are there any that are unusually high? Does that surprise you, given your answer to (here)? (You will probably have to make a judgement call here.)

Solution

corr.2$uniquenesses
       X1        X2        X3        X4        X5 
0.1715682 0.0050000 0.2789445 0.0050000 0.4961028 

The ones for X3 and X5 are higher than the rest; this is because their loadings on factor 1 are lower than the rest. Since those loadings are still high, I wouldn’t be worried about the uniquenesses.

The point here is that an (excessively) high uniqueness indicates a variable that doesn’t appear in any factor. The easy link to make is “all the variables appear in a factor, so there shouldn’t be any very high uniquenesses”. If, say, X3 doesn’t have a high loading on any factor, X3 would have a high uniqueness (like 0.9, and none of these values approach that).

\(\blacksquare\)

41.6 Air pollution

The data in link are measurements of air-pollution variables recorded at 12 noon on 42 different days at a location in Los Angeles. The file is in .csv format, since it came from a spreadsheet. Specifically, the variables (in suitable units), in the same order as in the data file, are:

  • wind speed

  • solar radiation

  • carbon monoxide

  • Nitric oxide (also known as nitrogen monoxide)

  • Nitrogen dioxide

  • Ozone

  • Hydrocarbons

The aim is to describe pollution using fewer than these seven variables.

  1. Read in the data and demonstrate that you have the right number of rows and columns in your data frame.

Solution

This is a .csv file, so:

my_url <- "http://ritsokiguess.site/datafiles/airpollution.csv"
air <- read_csv(my_url)
Rows: 42 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (7): wind, solar.radiation, CO, NO, NO2, O3, HC

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
air

There should be 42 rows (for the 42 days), and 7 columns (for the 7 variables), and there are.

\(\blacksquare\)

  1. * Obtain a five-number summary for each variable. You can do this in one go for all seven variables.

Solution

Like this (the cleanest):

air %>% 
  summarize(across(everything(), \(x) quantile(x)))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.

I have to figure out how to identify which number from the five number summary each of these is, but in this case you can easily figure it out since the min is the smallest and the max has to be the biggest in each column.

Or, with some more work, this:

air %>%
  pivot_longer(everything(), names_to="xname", values_to="x") %>% 
  nest_by(xname) %>%
  rowwise() %>% 
  mutate(q = list(enframe(quantile(data$x)))) %>%
  unnest(q) %>%
  pivot_wider(names_from=name, values_from=value) %>% 
  select(-data)

There’s a lot here. Run it one line at a time to see what it does:

  • put the names of the variables in one column and the values in a second. This is the same trick as when we want to make plots of all the variables facetted.

  • the nest_by says: for each variable (whose names are now in xname), make a dataframe called data of the observations (in x) for that variable

  • the rest of the way, work one row at a time

  • work out the five-number summary for each variable, using the values x in the data frame data of each row of the list-column, one at a time. This is the base R quantile, working on a vector (the column x of the data frame data), so it gives you back a named vector. If you are not familiar with that, try running quantile(1:10) and see how the output has both the percentiles and, above them, the percents that they go with. The tidyverse doesn’t like names, so my favourite way of keeping them with a named vector is to run it through enframe. This makes a two-column dataframe, with a column called name that is in this case the percents, and a column called value that is the percentiles. This is a dataframe rather than a single number, so it needs a list on the front as well (to make another list-column). There are rather a lot of brackets to close here; if you are not sure you have enough, type another close bracket, pause, and see what it matches (R Studio will show you). If it matches nothing, you have too many close brackets.

  • show the values of the five-number summary for each variable (in long format, but with the percentages attached)

  • for human consumption, put the percentiles in columns, one row for each variable

  • finally, get rid of the dataframes of original values (that we don’t need any more now that we have summarized them).

Extra: say you wanted to make facetted histograms of each variable. You would begin the same way, with the pivot_longer, and at the end, facet_wrap with scales = "free" (since the variables are measured on different scales):

air %>%
  pivot_longer(everything(), names_to="xname", values_to="x") %>% 
  ggplot(aes(x=x)) + geom_histogram(bins = 6) +
  facet_wrap(~xname, scales = "free")

Extra extra: I originally put a pipe symbol on the end of the line with the geom_histogram on it, and got an impenetrable error. However, googling the error message (often a good plan) gave me a first hit that told me exactly what I had done.

\(\blacksquare\)

  1. Obtain a principal components analysis. Do it on the correlation matrix, since the variables are measured on different scales. You don’t need to look at the results yet.

Solution

This too is all rather like the previous question:

air.1 <- princomp(air, cor = T)

\(\blacksquare\)

  1. Obtain a scree plot. How many principal components might be worth looking at? Explain briefly. (There might be more than one possibility. If so, discuss them all.)

Solution

ggscreeplot the thing you just obtained, having loaded package ggbiplot:

ggscreeplot(air.1)

There is a technicality here, which is that ggbiplot, the package, loads plyr, which contains a lot of the same things as dplyr (the latter is a cut-down version of the former). If you load dplyr and then plyr (that is to say, if you load the tidyverse first and then ggbiplot), you will end up with trouble, and probably the wrong version of a lot of functions. To avoid this, load ggbiplot first, and then you’ll be OK.

Now, finally, we might diverge from the previous question. There are actually two elbows on this plot, at 2 and at 4, which means that we should entertain the idea of either 1 or 3 components. I would be inclined to say that the elbow at 2 is still “too high up” the mountain — there is still some more mountain below it.

The points at 3 and 6 components look like elbows too, but they are pointing the wrong way. What you are looking for when you search for elbows are points that are the end of the mountain and the start of the scree. The elbows at 2 (maybe) and 4 (definitely) are this kind of thing, but the elbows at 3 and at 6 are not.

\(\blacksquare\)

  1. Look at the summary of the principal components object. What light does this shed on the choice of number of components? Explain briefly.

Solution

summary(air.1)
Importance of components:
                          Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
Standard deviation     1.5286539 1.1772853 1.0972994 0.8526937 0.80837896
Proportion of Variance 0.3338261 0.1980001 0.1720094 0.1038695 0.09335379
Cumulative Proportion  0.3338261 0.5318262 0.7038356 0.8077051 0.90105889
                           Comp.6     Comp.7
Standard deviation     0.73259047 0.39484041
Proportion of Variance 0.07666983 0.02227128
Cumulative Proportion  0.97772872 1.00000000

The first component only explains 33% of the variability, not very much, but the first three components together explain 70%, which is much more satisfactory. So I would go with 3 components.

There are two things here: finding an elbow, and explaining a sensible fraction of the variability. You could explain more of the variability by taking more components, but if you are not careful you end up explaining seven variables with, um, seven variables.

If you go back and look at the scree plot, you’ll see that the first elbow is really rather high up the mountain, and it’s really the second elbow that is the start of the scree.

If this part doesn’t persuade you that three components is better than one, you need to pick a number of components to use for the rest of the question, and stick to it all the way through.

\(\blacksquare\)

  1. * How do each of your preferred number of components depend on the variables that were measured? Explain briefly.

Solution

When this was a hand-in question, there were three marks for it, which was a bit of a giveaway! Off we go:

air.1$loadings

Loadings:
                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
wind             0.237  0.278  0.643  0.173  0.561  0.224  0.241
solar.radiation -0.206 -0.527  0.224  0.778 -0.156              
CO              -0.551        -0.114         0.573  0.110 -0.585
NO              -0.378  0.435 -0.407  0.291         0.450  0.461
NO2             -0.498  0.200  0.197               -0.745  0.338
O3              -0.325 -0.567  0.160 -0.508         0.331  0.417
HC              -0.319  0.308  0.541 -0.143 -0.566  0.266 -0.314

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000
Proportion Var  0.143  0.143  0.143  0.143  0.143  0.143  0.143
Cumulative Var  0.143  0.286  0.429  0.571  0.714  0.857  1.000

You’ll have to decide where to draw the line between “zero” and “nonzero”. It doesn’t matter so much where you put the line, so your answers can differ from mine and still be correct.

We need to pick the loadings that are “nonzero”, however we define that, for example:

  • component 1 depends (negatively) on carbon monoxide and nitrogen dioxide.

  • component 2 depends (negatively) on solar radiation and ozone and possibly positively on nitric oxide.

  • component 3 depends (positively) on wind and hydrocarbons.

It is a good idea to translate the variable names (which are abbreviated) back into the long forms.

\(\blacksquare\)

  1. Make a data frame that contains (i) the original data, (ii) a column of row numbers, (iii) the principal component scores. Display some of it.

Solution

All the columns contain numbers, so cbind will do it. (The component scores are seven columns, so bind_cols won’t do it unless you are careful.):

cbind(air, air.1$scores) %>%  
  mutate(row = row_number()) -> d
head(d)

This is probably the easiest way, but you see that there is a mixture of base R and Tidyverse. The result is actually a base R data.frame, so displaying it will display all of it, hence my use of head. If you want to do it the all-Tidyverse way6 then you need to bear in mind that bind_cols only accepts vectors or data frames, not matrices, so a bit of care is needed first:

air.1$scores %>%
  as_tibble() %>%
  bind_cols(air) %>%
  mutate(row = row_number()) -> dd
dd

I think the best way to think about this is to start with what is farthest from being a data frame or a vector (the matrix of principal component scores, here), bash that into shape first, and then glue the rest of the things to it.

Note that we used all Tidyverse stuff here, so the result is a tibble, and displaying it for me displays the first ten rows as you’d expect.

\(\blacksquare\)

  1. Display the row of your new data frame for the observation with the smallest (most negative) score on component 1. Which row is this? What makes this observation have the most negative score on component 1?

Solution

I think the best strategy is to sort by component 1 score (in the default ascending order), and then display the first row:

d %>% arrange(Comp.1) %>% slice(1)

It’s row 8.

We said earlier that component 1 depends negatively on carbon monoxide and nitrogen dioxide, so that an observation that is low on component 1 should be high on these things.7

So are these values high or low? That was the reason for having you make the five-number summary here. For observation 8, CO is 6 and NO2 is 21; looking back at the five-number summary, the value of CO is above Q3, and the value of NO2 is the highest of all. So this is entirely what we’d expect.

\(\blacksquare\)

  1. Which observation has the lowest (most negative) value on component 2? Which variables ought to be high or low for this observation? Are they? Explain briefly.

Solution

This is a repeat of the ideas we just saw:

d %>% arrange(Comp.2) %>% slice(1)

and for convenience, we’ll grab the quantiles again:

air %>% 
  summarize(across(everything(), \(x) quantile(x)))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.

Day 34 (at the end of the line). We said that component 2 depends (negatively) on solar radiation and ozone and possibly positively on nitric oxide. This means that day 34 ought to be high on the first two and low on the last one (since it’s at the low end of component 2). Solar radiation is, surprisingly, close to the median (75), but ozone, 24, is very near the highest, and nitric oxide, 1, is one of a large number of values equal to the lowest. So day 34 is pointing the right way, even if its variable values are not quite what you’d expect. This business about figuring out whether values on variables are high or low is kind of fiddly, since you have to refer back to the five-number summary to see where the values for a particular observation come. Another way to approach this is to calculate percentile ranks for everything. Let’s go back to our original data frame and replace everything with its percent rank:

air %>% mutate(across(everything(), \(x) percent_rank(x))) -> pct_rank
pct_rank

Observation 34 is row 34 of this:

pct_rank %>% slice(34)

Very high on ozone, (joint) lowest on nitric oxide, but middling on solar radiation. The one we looked at before, observation 8, is this:

pct_rank %>% slice(8)

High on carbon monoxide, the highest on nitrogen dioxide.

\(\blacksquare\)

  1. Obtain a biplot, with the row numbers labelled, and explain briefly how your conclusions from the previous two parts are consistent with it.

Solution

ggbiplot(air.1, labels = d$row)

Day 8 is way over on the left. The things that point in the direction of observation 8 (NO2, CO and to a lesser extent NO and HC) are the things that observation 8 is high on. On the other hand, observation 8 is around the middle of the arrows for wind, solar.radiation and O3, so that day is not especially remarkable for those.

Observation 34 is nearest the bottom, so we’d expect it to be high on ozone (yes), high on solar radiation (no), low on nitric oxide (since that points the most upward, yes) and also maybe low on wind, since observation 34 is at the “back end” of that arrow. Wind is 6, which is at the first quartile, low indeed.

The other thing that you see from the biplot is that there are four variables pointing more or less up and to the left, and at right angles to them, three other variables pointing up-and-right or down-and-left. You could imagine rotating those arrows so that the group of 4 point upwards, and the other three point left and right. This is what factor analysis does, so you might imagine that this technique might give a clearer picture of which variables belong in which factor than principal components does. Hence what follows.

\(\blacksquare\)

  1. Run a factor analysis on the same data, obtaining two factors. Look at the factor loadings. Is it clearer which variables belong to which factor, compared to the principal components analysis? Explain briefly.

Solution

air.2 <- factanal(air, 2, scores = "r")
air.2$loadings

Loadings:
                Factor1 Factor2
wind            -0.176  -0.249 
solar.radiation          0.319 
CO               0.797   0.391 
NO               0.692  -0.152 
NO2              0.602   0.152 
O3                       0.997 
HC               0.251   0.147 

               Factor1 Factor2
SS loadings      1.573   1.379
Proportion Var   0.225   0.197
Cumulative Var   0.225   0.422

I got the factor scores since I’m going to look at a biplot shortly. If you aren’t, you don’t need them.

Factor 1 is rather more clearly carbon monoxide, nitric oxide and nitrogen dioxide. Factor 2 is mostly ozone, with a bit of solar radiation and carbon monoxide. I’d say this is clearer than before.

A biplot would tell us whether the variables are better aligned with the axes now:

biplot(air.2$scores, air.2$loadings)

At least somewhat. Ozone points straight up, since it is the dominant part of factor 2 and not part of factor 1 at all. Carbon monoxide and the two oxides of nitrogen point to the right.

Extra: wind, solar.radiation and HC don’t appear in either of our factors, which also shows up here:

air.2$uniquenesses
           wind solar.radiation              CO              NO             NO2 
      0.9070224       0.8953343       0.2126417       0.4983564       0.6144170 
             O3              HC 
      0.0050000       0.9152467 

Those variables all have high uniquenesses.

What with the high uniquenesses, and the fact that two factors explain only 42% of the variability, we really ought to look at 3 factors, the same way that we said we should look at 3 components:

air.3 <- factanal(air, 3)
air.3$loadings

Loadings:
                Factor1 Factor2 Factor3
wind                    -0.210  -0.334 
solar.radiation          0.318         
CO               0.487   0.318   0.507 
NO               0.238  -0.269   0.931 
NO2              0.989                 
O3                       0.987   0.124 
HC               0.427   0.103   0.172 

               Factor1 Factor2 Factor3
SS loadings      1.472   1.312   1.288
Proportion Var   0.210   0.187   0.184
Cumulative Var   0.210   0.398   0.582

In case you are wondering, factanal automatically uses the correlation matrix, and so takes care of variables measured on different scales without our having to worry about that.

The rotation has only helped somewhat here. Factor 1 is mainly NO2 with some influence of CO and HC; factor 2 is mainly ozone (with a bit of solar radiation and carbon monoxide), and factor 3 is mainly NO with a bit of CO.

I think I mentioned most of the variables in there, so the uniquenesses should not be too bad:

air.3$uniquenesses
           wind solar.radiation              CO              NO             NO2 
      0.8404417       0.8905074       0.4046425       0.0050000       0.0050000 
             O3              HC 
      0.0050000       0.7776557 

Well, not great: wind and solar.radiation still have high uniquenesses because they are not strongly part of any factors.

If you wanted to, you could obtain the factor scores for the 3-factor solution, and plot them on a three-dimensional plot using rgl, rotating them to see the structure. A three dimensional “biplot”8 would also be a cool thing to look at.

\(\blacksquare\)


  1. There are two types of scores here: a person’s scores on the psychological test, 1 through 5, and their factor scores, which are decimal numbers centred at zero. Try not to get these confused.↩︎

  2. I think the last four items in the entire survey are different; otherwise the total number of items would be a multiple of 5.↩︎

  3. Not to be confused with covfefe, which was current news when I wrote this question.↩︎

  4. There is also kind of an elbow at 4, which would suggest three factors, but that’s really too many with only 5 variables. That wouldn’t be much of a reduction in the number of variables, which is what principal components is trying to achieve.↩︎

  5. In computer science terms, a data frame is said to inherit from a list: it is a list plus extra stuff.↩︎

  6. There really ought to be a radio station CTDY: All Tidyverse, All The Time.↩︎

  7. You might have said that component 1 depended on other things as well, in which case you ought to consider whether observation 8 is, as appropriate, high or low on these as well.↩︎

  8. A three-dimensional biplot ought to be called a triplot.↩︎