Chapter 23 Vector and matrix algebra

library(tidyverse)

23.1 Heights and foot lengths again

Earlier, we investigated some data on predicting the height of a person from the length of their foot. The data were in http://ritsokiguess.site/datafiles/heightfoot.csv.

Read in and display (some of) the data.
In your regression course, you learned (or will learn) the matrix formulation of the least squares estimates of intercept and slope. This produces a vector \(\hat\beta\) containing estimates of both the intercept and the slope, from the formula

\[ \hat\beta = (X^T X)^{-1} X^T y, \]

where:

\(X\) is a matrix containing a column of 1s followed by all the columns of explanatory variables
\(X^T\) denotes the (matrix) transpose of \(X\)
\(M^{-1}\) denotes the inverse of the matrix \(M\)
\(y\) denotes the column of response variable values.

Use the formula above to obtain the least squares estimates of intercepts and slope for this regression, using R’s vector-matrix algebra. Hint: you are advised to do the calculation in steps, or else it will be very hard to read, and hard for the grader to check that it is correct.

Verify that your calculation is correct by running the regression.

My solutions follow:

23.2 Heights and foot lengths again

Earlier, we investigated some data on predicting the height of a person from the length of their foot. The data were in http://ritsokiguess.site/datafiles/heightfoot.csv.

Read in and display (some of) the data.

Solution

Copy what you did before:

my_url <- "http://ritsokiguess.site/datafiles/heightfoot.csv"
hf <- read_csv(my_url)

## 
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   height = col_double(),
##   foot = col_double()
## )

hf

## # A tibble: 33 x 2
##    height  foot
##     <dbl> <dbl>
##  1   66.5  27  
##  2   73.5  29  
##  3   70    25.5
##  4   71    27.9
##  5   73    27  
##  6   71    26  
##  7   71    29  
##  8   69.5  27  
##  9   73    29  
## 10   71    27  
## # … with 23 more rows

\(\blacksquare\)

In your regression course, you learned (or will learn) the matrix formulation of the least squares estimates of intercept and slope. This produces a vector \(\hat\beta\) containing estimates of both the intercept and the slope, from the formula

\[ \hat\beta = (X^T X)^{-1} X^T y, \]

where:

\(X\) is a matrix containing a column of 1s followed by all the columns of explanatory variables
\(X^T\) denotes the (matrix) transpose of \(X\)
\(M^{-1}\) denotes the inverse of the matrix \(M\)
\(y\) denotes the column of response variable values.

Solution

There is some setup first: we have to get hold of \(X\) and \(y\) from the data as a matrix and a vector respectively. I would use tidyverse ideas to do this, and then turn them into a matrix at the end, which I think is best. Don’t forget to create a column of 1s to make the first column of \(X\)

hf %>% mutate(one=1) %>% 
select(one, foot) %>% 
as.matrix() -> X
head(X)

##      one foot
## [1,]   1 27.0
## [2,]   1 29.0
## [3,]   1 25.5
## [4,]   1 27.9
## [5,]   1 27.0
## [6,]   1 26.0

(head displays the first six rows, or else you’ll be displaying all 33, which is too many.)

Another approach is this:

X <- cbind(1, hf$foot)
head(X)

##      [,1] [,2]
## [1,]    1 27.0
## [2,]    1 29.0
## [3,]    1 25.5
## [4,]    1 27.9
## [5,]    1 27.0
## [6,]    1 26.0

Note that the recycling rules mean that a column with only one value in it will be repeated to the length of the other one, and so this is better than working out how many observations there are and repeating 1 that many times.

The choice here is whether to use tidyverse stuff and turn into a matrix at the end, or make a matrix at the start (which is what cbind from base R is doing). I don’t believe you’ve seen that in this course, so you ought to cite your source if you go that way.

The simplest choice for making \(y\) is this:

y <- hf$height
y

##  [1] 66.5 73.5 70.0 71.0 73.0 71.0 71.0 69.5 73.0 71.0 69.0 69.0 73.0 75.0 73.0 72.0 69.0 68.0 72.5 78.0 79.0 71.0 74.0 66.0 71.0 71.0 71.0 84.0 77.0 72.0 70.0 76.0 68.0

This also works:

hf %>% select(height) %>% pull(height)

##  [1] 66.5 73.5 70.0 71.0 73.0 71.0 71.0 69.5 73.0 71.0 69.0 69.0 73.0 75.0 73.0 72.0 69.0 68.0 72.5 78.0 79.0 71.0 74.0 66.0 71.0 71.0 71.0 84.0 77.0 72.0 70.0 76.0 68.0

(remembering that you don’t want to have anything that’s a dataframe), or this:

hf %>% select(height) %>% as.matrix() -> yy
head(yy)

##      height
## [1,]   66.5
## [2,]   73.5
## [3,]   70.0
## [4,]   71.0
## [5,]   73.0
## [6,]   71.0

remembering that a (column) vector and a 1-column matrix are the same thing as R is concerned.

Now we want to construct some things. I would go at it this way, rather than trying to do everything at once (if you do, you will either get lost now, or in six months when you try to figure out what you did):

Xt <- t(X) # X-transpose
XtX <- Xt %*% X 
XtXi <- solve(XtX)
Xty <- Xt %*% y
XtXi %*% Xty

##           [,1]
## [1,] 34.336335
## [2,]  1.359062

The intercept is 34.33 and the slope is 1.36.

These compute, respectively, \(X^T\), \(X^T X\), the inverse of that, \(X^T y\) and \(\hat\beta\). Expect credit for laying out your calculation clearly.

Extra: the value of this formula is that it applies no matter whether you have one \(x\)-variable, as here (or in the windmill data), or whether you have a lot (as in the asphalt data). In either case, \(\hat\beta\) contains the estimate of the intercept followed by all the slope estimates, however many there are. There are also matrix formulas that tell you how the slopes or the residuals will change if you remove one observation or one explanatory variable, so that something like step will work very efficiently, and calculations for leverages likewise.

\(\blacksquare\)

Verify that your calculation is correct by running the regression.

Solution

The usual lm:

hf.1 <- lm(height ~ foot, data = hf)
summary(hf.1)

## 
## Call:
## lm(formula = height ~ foot, data = hf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7491 -1.3901 -0.0310  0.8918 12.9690 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.3363     9.9541   3.449 0.001640 ** 
## foot          1.3591     0.3581   3.795 0.000643 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.102 on 31 degrees of freedom
## Multiple R-squared:  0.3173, Adjusted R-squared:  0.2952 
## F-statistic: 14.41 on 1 and 31 DF,  p-value: 0.0006428

The same.

Extra: in this “well-conditioned” case,³⁵ it makes no difference, but if \(X^T X\) is almost singular, so that it almost doesn’t have an inverse (for example, some of your explanatory variables are highly correlated with each other), you can get into trouble. Regression calculations in practice use something more sophisticated like the singular value decomposition of \(X^TX\) to diagnose whether \(X^TX\) is actually singular or almost so, which from a numerical point of view is almost as bad, and to produce a sensible answer in that case.

I guess I should try to make up one where it struggles. Let me do one with two \(x\)’s that are strongly correlated:

d <- tribble(
~x1, ~x2, ~y,
10, 20, 55,
11, 19.0001, 60,
12, 17.9999, 61,
13, 17.0001, 64,
14, 15.9998, 66,
15, 15.0001, 67
)
d

## # A tibble: 6 x 3
##      x1    x2     y
##   <dbl> <dbl> <dbl>
## 1    10  20      55
## 2    11  19.0    60
## 3    12  18.0    61
## 4    13  17.0    64
## 5    14  16.0    66
## 6    15  15.0    67

x2 is almost exactly equal to 30 minus x1. What’s the right answer?

d.1 <- lm(y ~ x1 + x2, data = d)
summary(d.1)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = d)
## 
## Residuals:
##        1        2        3        4        5        6 
## -1.37530  1.26859  0.03118  0.63549  0.43765 -0.99760 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11837.3   138685.6  -0.085    0.937
## x1             398.0     4622.9   0.086    0.937
## x2             395.7     4622.8   0.086    0.937
## 
## Residual standard error: 1.303 on 3 degrees of freedom
## Multiple R-squared:  0.9485, Adjusted R-squared:  0.9141 
## F-statistic: 27.61 on 2 and 3 DF,  p-value: 0.0117

coef(d.1)

## (Intercept)          x1          x2 
## -11837.2938    398.0000    395.6835

You should be right away suspicious here: the R-squared is high, but neither of the explanatory variables are significant! (This actually means that you can remove one of them, either one.) The standard errors are also suspiciously large, never a good sign. If you’ve done C67, you might be thinking about variance inflation factors here:

library(car)
vif(d.1)

##        x1        x2 
## 220326265 220326265

These are both huge (greater than 5 or 10 or whatever guideline you use), indicating that they are highly correlated with each other (as we know they are).

All right, how does the matrix algebra work? This is just the same as before:

d %>% mutate(one=1) %>% 
select(one, starts_with("x")) %>% 
as.matrix() -> X
head(X)

##      one x1      x2
## [1,]   1 10 20.0000
## [2,]   1 11 19.0001
## [3,]   1 12 17.9999
## [4,]   1 13 17.0001
## [5,]   1 14 15.9998
## [6,]   1 15 15.0001

and then

y <- d$y
Xt <- t(X) # X-transpose
XtX <- Xt %*% X 
XtXi <- solve(XtX)
Xty <- Xt %*% y
XtXi %*% Xty

##            [,1]
## one -11837.2051
## x1     397.9970
## x2     395.6805

These answers are actually noticeably different from the right answers (with a few more decimals here):

coef(d.1)

## (Intercept)          x1          x2 
## -11837.2938    398.0000    395.6835

One way of finding out how nearly singular \(X^TX\) is is to look at its eigenvalues. You’ll recall that a singular matrix has one or more zero eigenvalues:

eigen(XtX)

## eigen() decomposition
## $values
## [1] 2.781956e+03 3.404456e+01 8.805997e-11
## 
## $vectors
##             [,1]        [,2]        [,3]
## [1,] -0.04643281 -0.00782885  0.99889074
## [2,] -0.57893109 -0.81469635 -0.03329647
## [3,] -0.81405331  0.57983495 -0.03329628

The third eigenvalue is \(8.8 \times 10^{-11}\), which is very close to zero, especially compared to the other two, which are 34 and over two thousand. This is a very nearly singular matrix, and hence \((X^TX)^{-1}\) is very close to not existing at all, and that would mean that you couldn’t even compute the intercept and slope estimates, never mind hope to get close to the right answer.

\(\blacksquare\)

Which in this case means that the column of 1s and the actual x values are not strongly correlated, which means that the x-values vary enough.↩