Principal components

Principal Components

Have measurements on (possibly large) number of variables on some individuals.
Question: can we describe data using fewer variables (because original variables correlated in some way)?
Look for direction (linear combination of original variables) in which values most spread out. This is first principal component.
Second principal component then direction uncorrelated with this in which values then most spread out. And so on.

Principal components

See whether small number of principal components captures most of variation in data.
Might try to interpret principal components.
If 2 components good, can make plot of data.
(Like discriminant analysis, but for individuals rather than groups.)
“What are important ways that these data vary?”

Packages

You might not have installed the first of these. See over for instructions.

library(ggbiplot) 
library(tidyverse)
library(ggrepel)
library(conflicted)
conflicts_prefer(dplyr::mutate)

ggbiplot has a special installation: see over.

Installing `ggbiplot`

ggbiplot not on CRAN, so usual install.packages will not work. This is same procedure you used for smmr in C32:
Install package devtools first (once):

install.packages("devtools")

Then install ggbiplot (once):

library(devtools)
install_github("vqv/ggbiplot")

Small example: 2 test scores for 8 people

my_url <- "http://ritsokiguess.site/datafiles/test12.txt"
test12 <- read_table(my_url)
test12

A plot

ggplot(test12, aes(x = first, y = second, label = id)) +
  geom_point() + geom_text_repel() +
  geom_smooth(method = "lm", se = FALSE)

Principal component analysis

Grab just the numeric columns:

test12 %>% select(where(is.numeric)) -> test12_numbers

Strongly correlated, so data nearly 1-dimensional:

cor(test12_numbers)

          first   second
first  1.000000 0.989078
second 0.989078 1.000000

Finding principal components

Make a score summarizing this one dimension. Like this:

test12.pc <- princomp(test12_numbers, cor = TRUE)
summary(test12.pc)

Importance of components:
                         Comp.1      Comp.2
Standard deviation     1.410347 0.104508582
Proportion of Variance 0.994539 0.005461022
Cumulative Proportion  0.994539 1.000000000

Comments

“Standard deviation” shows relative importance of components (as for LDs in discriminant analysis)
Here, first one explains almost all (99.4%) of variability.
That is, look only at first component and ignore second.
cor=TRUE standardizes all variables first. Usually wanted, because variables measured on different scales. (Only omit if variables measured on same scale and expect similar variability.)

Scree plot

ggscreeplot(test12.pc)

Imagine scree plot continues at zero, so 2 components is a big elbow (take one component).

Component loadings

explain how each principal component depends on (standardized) original variables (test scores):

test12.pc$loadings


Loadings:
       Comp.1 Comp.2
first   0.707  0.707
second  0.707 -0.707

               Comp.1 Comp.2
SS loadings       1.0    1.0
Proportion Var    0.5    0.5
Cumulative Var    0.5    1.0

First component basically sum of (standardized) test scores. That is, person tends to score similarly on two tests, and a composite score would summarize performance.

Component scores

d <- data.frame(test12, test12.pc$scores)
d

Person A is a low scorer, very negative comp.1 score.
Person D is high scorer, high positive comp.1 score.
Person E average scorer, near-zero comp.1 score.
comp.2 says basically nothing.

Plot of scores

ggplot(d, aes(x = Comp.1, y = Comp.2, label = id)) +
  geom_point() + geom_text_repel()

Comments

Vertical scale exaggerates importance of comp.2.
Fix up to get axes on same scale:

ggplot(d, aes(x = Comp.1, y = Comp.2, label = id)) +
  geom_point() + geom_text_repel() +
  coord_fixed() -> g

Shows how exam scores really spread out along one dimension:

The biplot

Plotting variables and individuals on one plot.
Shows how components and original variables related.
Shows how individuals score on each component, and therefore suggests how they score on each variable.
Add labels option to identify individuals:

g <- ggbiplot(test12.pc, labels = test12$id)

The biplot

Comments

Variables point almost same direction (right). Thus very positive value on comp.1 goes with high scores on both tests, and test scores highly correlated.
Position of individuals on plot according to scores on principal components, implies values on original variables. Eg.:
D very positive on comp.1, high scorer on both tests.
A and F very negative on comp.1, poor scorers on both tests.
C positive on comp.2, high score on first test relative to second.
A negative on comp.2, high score on second test relative to first.

Places rated

Every year, a new edition of the Places Rated Almanac is produced. This rates a large number (in our data 329) of American cities on a number of different criteria, to help people find the ideal place for them to live (based on what are important criteria for them).

The data for one year are in http://ritsokiguess.site/datafiles/places.txt. The data columns are aligned but the column headings are not.

The criteria

There are nine of them:

climate: a higher value means that the weather is better
housing: a higher value means that there is more good housing or a greater choice of different types of housing
health: higher means better healthcare facilities
crime: higher means more crime (bad)
trans: higher means better transportation (this being the US, probably more roads)
educate: higher means better educational facilities, schools, colleges etc.
arts: higher means better access to the arts (theatre, music etc)
recreate: higher means better access to recreational facilities
econ: higher means a better economy (more jobs, spending power etc)

Each city also has a numbered id.

Read in the data

my_url <- "http://ritsokiguess.site/datafiles/places.txt"
places0 <- read_table(my_url)

Look at distributions of everything

places0 %>% 
  pivot_longer(-id, names_to = "criterion", 
               values_to = "rating") %>% 
  ggplot(aes(x = rating)) + geom_histogram(bins = 10) + 
  facet_wrap(~criterion, scales = "free") -> g

The histograms

Transformations

Several of these variables have long right tails
Take logs of everything but id:

places0 %>% 
  mutate(across(-id, \(x) log(x))) -> places
places

Just the numerical columns

get rid of the id column

places %>% select(-id) -> places_numeric