STAD29 Assignment 1

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 US education expenditures

What influences the amount that a US state spends per student on public school education? Some data were collected for the 50 states plus Washington, DC in 1970, and are available in http://ritsokiguess.site/datafiles/Anscombe.csv. The columns are:

  • state: abbreviated state name (text)
  • education: per-capita education expenditures, 1970 dollars.
  • income: per-capita income, 1970 dollars.
  • young: number of residents aged under 18, per 1000 population.
  • urban: number of urban residents, per 1000 population.

(a) (1 point) Read in and display (some of) the data.

As expected:

my_url <- "http://ritsokiguess.site/datafiles/Anscombe.csv"
Anscombe <- read_csv(my_url)
Rows: 51 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): state
dbl (4): education, income, young, urban

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Anscombe

You might have a bit of trouble deciding on a name for the dataframe, since education is one of its columns. I chose the name of the datafile, with a capital A.

Extra 1: This is the name of a statistician named Frank Anscombe, who wrote a 1981 book “Computing in Statistical Science Through APL”, from which these data come. About the only statistical software available in those days was SAS (which, as now, had a very expensive annual licence), so academic statisticians would often code things themselves in languages like FORTRAN. Anscombe wrote his book using a different language called APL1 The thing about APL is that it uses a whole bunch of different symbols, some of which are seen nowhere else, and the only way you could use APL back in 1981 was to buy a special keyboard that had those symbols on it. The idea was that the APL code for computing things like means and sums of squares (and hence slopes in regression) was very short, and so you could express a lot with a few (esoteric) symbols.

Extra 2: Anscombe is perhaps better known for the “Anscombe quartet”, four small data sets he made up, each containing an \(x\) and a \(y\), for which the mean and SD of all the \(x\)’s are the same, the mean and SD of all the \(y\)’s are the same, the correlations between each \(x\) and \(y\) are the same, and the intercepts and slopes of the regression lines for predicting \(y\) from \(x\) are all the same. So, you would think, the actual data values would be pretty much the same. This short video demonstrates that this is very much not the case. Anscombe’s purpose in devising these data sets (and publishing a paper about them) was to illustrate that it is not enough to do the calculations; you need to look at graphs of your data to decide whether those calculations are ones that you trust.

If you want to explore Anscombe’s quartet further, R has it as a built in dataset called anscombe (lowercase “a”!) with the four pairs of explanatory and response:

anscombe

from where you can compute means, SDs, correlations and regression lines, verify that all four of them are the same to a high degree of accuracy, and then look at the four scatterplots.

Extra 3: The Frank Anscombe Quartet sounds to me as if it should be the name of a 1970s-era jazz combo, presumably with a pianist, a bass player, a drummer and a jazz violinist, the sort of thing you would have found vinyl recordings of in the old Sam the Record Man on Yonge Street.

(b) (2 points) Fit a regression predicting education expenditure from the other three quantitative variables, and display the results.

This is straight C32 stuff so far:

Anscombe.1 <- lm(education ~ income + young + urban, data = Anscombe)
summary(Anscombe.1)

Call:
lm(formula = education ~ income + young + urban, data = Anscombe)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.240 -15.738  -1.156  15.883  51.380 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.868e+02  6.492e+01  -4.418 5.82e-05 ***
income       8.065e-02  9.299e-03   8.674 2.56e-11 ***
young        8.173e-01  1.598e-01   5.115 5.69e-06 ***
urban       -1.058e-01  3.428e-02  -3.086  0.00339 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.69 on 47 degrees of freedom
Multiple R-squared:  0.6896,    Adjusted R-squared:  0.6698 
F-statistic: 34.81 on 3 and 47 DF,  p-value: 5.337e-12

Use tidy and glance from broom if you prefer.

(c) (2 points) Why would you not consider removing any of the explanatory variables from the model you just fitted? Explain briefly.

All three of their P-values are (strongly) significant, and therefore removing any of them would be a mistake (for example, would cause the R-squared to drop substantially). That’s all you need.

(d) (4 points) Consider an imaginary state A with income 2800, young 370, and urban 790. Consider also a second imaginary state B with income 3200, young 350, and urban 650. For each of these two sets of explanatory variable values, predict the mean value of education for all states with these same explanatory variable values, and obtain a 95% confidence interval in each case. In your output, display only the predicted mean response, its lower and upper confidence limits, and the values of the explanatory variables.

This is prediction of a mean response, which is what predictions from the marginaleffects package does. To start this process off, define a dataframe (that I like to call new) with the two rows of values to predict for. Find a way to make this work. My preference is to use tribble, where you lay out the values as they are going to look:

new <- tribble(
  ~income, ~young, ~urban,
  2800, 370, 790,
  3200, 350, 650
)
new

If you prefer, you can use tibble instead, but you have to give all the values for each variable first:

new1 <- tibble(
  income = c(2800, 3200),
  young = c(370, 350),
  urban = c(790, 650)
)
new1

If you don’t think of either of those, put the values you are making predictions for into a text file (example: type them into a spreadsheet and save as csv), and then read them in from the file.

Having done that, then you run predictions. This seems to go better with cbind around the outside, in my experience (in 2023, it was my first time teaching marginaleffects, and we had a lot of confusion about what kind of output to expect, caused by this):

cbind(predictions(model = Anscombe.1, newdata = new))

This is the answer, or at least, it contains the answer plus some other things we don’t want. predictions is tidyverse-friendly, though, so figure out what you want to display and select it:

cbind(predictions(model = Anscombe.1, newdata = new)) %>% 
  select(estimate, conf.low, conf.high, income:urban)

Get here, somehow. You don’t need to take two steps, as I did (although you will probably need to think in two steps, even if you edit your work before you hand it in).

Extra: the reason for all those words about “all states with the same explanatory variable values” was to guide you towards a confidence interval for a mean response, the sort of thing that marginaleffects does. Contrast that with prediction intervals for a new individual. That doesn’t make any sense here because you can’t conjure up new states out of nowhere, although I suppose you might happen to know income, young, and urban for one of these same states in a different year, or something like that. (The US states are often a handy source of data, but they are not a random sample of anything.)

Having said that, let’s pretend that our values in new actually came from two new US-like states. Then we can ask what the expenditure on education might be for those new states, thus:

p <- predict(Anscombe.1, new, interval = "p")
cbind(new, p)

These intervals (from lwr to upr) are much wider than the ones we obtained before, because they are trying to get at the variability in a single response value (vs. the mean of all possible response values), and there is a lot more to go wrong when you only look at one observation. As we see below, the prediction interval for State B (the second one) is very slightly shorter than for State A, and for the same reason we see below.

(e) (3 points) For which set of explanatory variable values is the mean education expenditure estimated more accurately: the values for state A, or the values for state B? Why did things come out this way? Explain briefly (you might want to do another calculation to support your answer).

The more accurate estimate goes with a shorter confidence interval. If your number sense is good, you might be able to eyeball the two confidence intervals and say that the first one is about 34 wide (\(175-141\)) and the second one is only about 15 wide (\(196-181\)). You don’t need to be any more accurate than this; it’s clear that the second interval is shorter, and therefore the estimation for values like State B’s is more accurate.

But, you also have R at your disposal, so you can work out the width of each interval, by repeating2 and adding to the work you did before:

cbind(predictions(model = Anscombe.1, newdata = new)) %>% 
  select(estimate, conf.low, conf.high, income:urban) %>% 
  mutate(width = conf.high - conf.low)

and there you see that the second interval (for state B) is indeed shorter.

One point so far, because the point of this question is not so much the what as the why, which is coming up.

What made the second interval come out shorter? The usual thing in regression-based intervals is that if the values you are predicting for are closer to their means, the interval will be shorter, and if they are further away, the interval will be longer. The logic behind this is that if you are close to the mean(s), there are lots of nearby data points to support your prediction, but if you are farther away from any of the means, there won’t be as many nearby points to work from (you might even be extrapolating), and things are a lot less certain.

To see whether that has happened here, work out the means of each of the explanatory variables. The easiest way to do that is to use summary, which gives you things like quartiles as well:

summary(Anscombe)
    state             education         income         young      
 Length:51          Min.   :112.0   Min.   :2081   Min.   :326.2  
 Class :character   1st Qu.:165.0   1st Qu.:2786   1st Qu.:342.1  
 Mode  :character   Median :192.0   Median :3257   Median :354.1  
                    Mean   :196.3   Mean   :3225   Mean   :358.9  
                    3rd Qu.:228.5   3rd Qu.:3612   3rd Qu.:369.1  
                    Max.   :372.0   Max.   :4425   Max.   :439.7  
     urban       
 Min.   : 322.0  
 1st Qu.: 552.5  
 Median : 664.0  
 Mean   : 664.5  
 3rd Qu.: 790.5  
 Max.   :1000.0  

but you can literally work out just the means of just the variables you want if you prefer:

Anscombe %>% 
  summarize(across(income:urban, \(x) mean(x)))

What values were we predicting for again? These:

new

The values in the first row, from State A, are a bit off from the mean in each case: below the mean for income, and above the mean for the other two. (If you used summary, you also get to see that these values are close to the first, third, and third quartiles respectively). On the other hand, the values in the second row (State B) are all very close to their means, and so the prediction for this row should be much more accurate.

To summarize: remember that more accurate predictions go with explanatory variable values closer to their means, and then work out what the means are, so you can see which set of values is closer to its mean. The second set has the shorter CI and also the values are closer to their means, so all is as you expect.

Extra: earlier I mentioned that Anscombe’s book, from which our data came, used APL to do the calculations. Appendix 3 of the book contains the code of his “package” to do statistical analysis. (This is the earliest example I have seen of the word “package” to denote a collection of reusable functions to perform a task, although the term was already in use for statistical software that you can make do several different tasks, like Minitab or SPSS.) Anyway, I wanted to show you something small that demonstrates just how foreign APL looks:

This is a function to standardize an input vector X. Over coffee this morning, I did a little investigation of line 3:

  • \(\rho X\) means “the length of \(X\)”, that is, the number of data values in \(X\) (the sample size, what we would call \(n\))
  • negative numbers are represented with an “upstairs” minus sign, to make a distinct symbol from \(-\) for subtraction.3 Hence \(^- 1\) is negative 1.
  • \(\times\) means “multiply”, and \(*\) means “raise to power”. The latter is confusing, if you are an Excel user, or even if you were a FORTRAN user at the time (where \(*\) is “multiply” as we know it, and raising to a power is done with \(**\)).
  • \(\div\) means “divide”, with the extra proviso that if there is no number before it, 1 is assumed (hence \(\div 2\) is the same as \(1 \div 2\) or 0.5).
  • \(+/X\) means “reduce X by summing”: that is, add up all the numbers in \(X\). In APL, \(/\) plays a similar role to that played by summarize in the tidyverse.
  • Hence, for example:
    • \((+/X)\div \rho X\) is the mean of the values in \(X\) (the sum of them divided by the number of them)
    • \(\div ^- 1 + \rho X\) is \(1/(n-1)\), part of the formula for the variance.

Look at the very top line, and the character after the Y: a left arrow (which was one of the special keys on the APL keyboard), apparently being used to save something in Y, just as in R! This, I think, is precisely where R’s left-arrow for assignment came from.

Before there was R, there was a language called S, whose aim was to show off the capabilities of the new graphics terminals of the time (thus, doing statistics, but especially making graphs). Given what I said about Anscombe’s Quartet above, I think Anscombe would have approved.4 When I was in grad school, we had a room of (probably very expensive) Sun Workstations on which we ran S. The problem was that S was commercial software (and probably also very expensive). I’m pretty sure somebody’s research grant funded our room full of Sun Workstations and S to run on them.

The people who wrote S also knew APL, and so some things (like the left-arrow assignment, and the functions named sum and length, but not the funky keyboard) from APL found their way into S. In New Zealand in the mid 90s, Ross Ihaka and his colleagues wanted to make an open-source version of S, which they called R. The design of R was that it should look as much like S in operation as possible, but under the hood it actually ran very differently, because its internal design was very different. R has nothing to do with APL, except for the fact that it inherits ideas from S and S inherits from APL.

One last piece of Anscombe trivia: his wife, and John Tukey’s5 wife, were sisters, and so Anscombe and Tukey were brothers-in-law.

2 Driver’s seat position

Car drivers like to adjust the seat position for their own comfort. Car designers would find it helpful to know where different drivers will position the seat depending on their size and age. Researchers at the University of Michigan collected measurements on 38 drivers. The following variables were measured:

  • Age in years
  • Weightin lbs
  • HtShoes: Height (wearing shoes), in cm
  • Ht: Height (barefoot), in cm
  • Seated: Seated height in cm
  • Arm: lower arm length in cm
  • Thigh: Thigh length in cm
  • Leg: Lower leg length in cm
  • hipcenter: horizontal distance of the midpoint of the hips from a fixed location in the car in mm (response).

The data are in http://ritsokiguess.site/datafiles/seatpos.csv.

(a) (1 point) Read in and display (some of) the data.

Absolutely no surprises here:

my_url <- "http://ritsokiguess.site/datafiles/seatpos.csv"
seatpos <- read_csv(my_url)
Rows: 38 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): Age, Weight, HtShoes, Ht, Seated, Arm, Thigh, Leg, hipcenter

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
seatpos

38 drivers indeed. You choose the name for your dataframe. Something like drivers would also work.

(b) (3 points) Run a multiple regression to predict hipcenter from the other variables, and display the output from the regression. Hint: using . in a model formula means “all the other variables except for those explicitly named”.

This process is straight from C32. The hint enables us to avoid typing out all those explanatory variable names: we name the response variable, and then use the dot to stand for “all the other variables”:

seatpos.1 <- lm(hipcenter ~ ., data = seatpos)

If you don’t figure that out, type the names of the eight explanatory variables one by one, separated by plus signs, on the right side of the squiggle. You get to the same place, but it takes you rather a lot longer!

Displaying the model output is most easily done with summary:

summary(seatpos.1)

Call:
lm(formula = hipcenter ~ ., data = seatpos)

Residuals:
    Min      1Q  Median      3Q     Max 
-73.827 -22.833  -3.678  25.017  62.337 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) 436.43213  166.57162   2.620   0.0138 *
Age           0.77572    0.57033   1.360   0.1843  
Weight        0.02631    0.33097   0.080   0.9372  
HtShoes      -2.69241    9.75304  -0.276   0.7845  
Ht            0.60134   10.12987   0.059   0.9531  
Seated        0.53375    3.76189   0.142   0.8882  
Arm          -1.32807    3.90020  -0.341   0.7359  
Thigh        -1.14312    2.66002  -0.430   0.6706  
Leg          -6.43905    4.71386  -1.366   0.1824  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 37.72 on 29 degrees of freedom
Multiple R-squared:  0.6866,    Adjusted R-squared:  0.6001 
F-statistic:  7.94 on 8 and 29 DF,  p-value: 1.306e-05

but if you like the broom output, you can do that instead. For what is coming up, you need both glance and tidy:

glance(seatpos.1)
tidy(seatpos.1)

(c) (3 points) Some apparently inconsistent things appear in your output. What are they, and how do you know they are inconsistent? Explain briefly.

These two things are inconsistent with each other (until you look more carefully):

  • The R-squared of about 69% is moderate to good (or whatever adjectives you want to use: the point is that it is reasonably large). Equivalently, the global F-test is very significant (P-value 0.000013).
  • None of the explanatory variables seem to add anything to the regression, because none of them are significant.

These are (apparently) inconsistent, because the first point says that something helps to predict hipcenter, and the second one says that nothing does!

Extra: you might be suspecting multicollinearity at about this point, with apparently inflated P-values for these explanatory variables.

(d) (3 points) Fit a model predicting hipcenter from only HtShoes, and show that it does not fit significantly worse than your first regression.

All right, so fit the model first:

seatpos.2 <- lm(hipcenter ~ HtShoes, data = seatpos)

If you want to, take a look at the results (but there is actually no need to do this yet):

summary(seatpos.2)

Call:
lm(formula = hipcenter ~ HtShoes, data = seatpos)

Residuals:
    Min      1Q  Median      3Q     Max 
-99.981 -27.150   2.983  22.637  73.731 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 565.5927    92.5794   6.109 4.97e-07 ***
HtShoes      -4.2621     0.5391  -7.907 2.21e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 36.55 on 36 degrees of freedom
Multiple R-squared:  0.6346,    Adjusted R-squared:  0.6244 
F-statistic: 62.51 on 1 and 36 DF,  p-value: 2.207e-09

The intuition to take away from here is that even though we took out all those explanatory variables, R-squared has hardly changed at all. But I said “significantly worse”, which means to do a test, and the right one is the ANOVA comparing the fit of the two models, smaller one first:

anova(seatpos.2, seatpos.1)

There is no significant difference in the fit of the two models, and so I would prefer to use seatpos.2 because it is (much) simpler.

Part marks for looking at the output from the second model and comparing the R-squared (say two out of three if everything else is correct), because doing so does not quite answer the question.

Extra: looking at the output from the second model, you see that HtShoes is now strongly significant, when it was not before.

(e) (2 points) Why do you think the P-value of HtShoes is so different between your two models? Explain briefly.

There are a couple of directions you might go here. You might take a multicollinearity route, and say that the explanatory variables (or at least some of them) must be highly correlated, and therefore that the P-value of HtShoes will depend very much on which other explanatory variables are in the model (or that you cannot really tell which explanatory variables predict hipcenter until you have removed all of them except one).

You could also focus on the idea that the P-value of an explanatory variable will depend on which others are in the model, and the two models being compared here are very different: HtShoes has nothing to add to the predictions from the others, but in a regression by itself it is very predictive. The reason behind this, though, is really the multicollinearity; if the explanatory variables had been unrelated, the P-value of HtShoes would have been very similar with the other explanatory variables in the model, and in a model by itself.

(f) (2 points) Based on what you know or can guess about the variables in this dataset, why do you think it makes practical sense that you would have seen the results that you did in parts (b) and (e)?

We now know that multicollinearity, or high correlation between the explanatory variables, is the cause of what we saw in (b) and (e). Does it make practical sense that this collection of explanatory variables would be highly correlated? Well, look at what they are: apart from age, they are all measurements of size. A bigger driver would be expected to be taller (and likely heavier), as well as having larger limb and body measurements (which is what the other explanatory variables are).

The results of the model seatpos.2 really are telling us that a bigger driver will have a more negative value of hipcenter. All the values of hipcenter seem to be negative, but the most negative ones go with the largest values on pretty much everything (except for age).

I wanted the main part of your answer to be an explanation in words of what it is about these explanatory variables that would make them strongly correlated. Once you have gotten that far, you could certainly support your answer with something indicating that the correlations actually are high:

cor(seatpos)
                  Age      Weight     HtShoes          Ht     Seated        Arm
Age        1.00000000  0.08068523 -0.07929694 -0.09012812 -0.1702040  0.3595111
Weight     0.08068523  1.00000000  0.82817733  0.82852568  0.7756271  0.6975524
HtShoes   -0.07929694  0.82817733  1.00000000  0.99814750  0.9296751  0.7519530
Ht        -0.09012812  0.82852568  0.99814750  1.00000000  0.9282281  0.7521416
Seated    -0.17020403  0.77562705  0.92967507  0.92822805  1.0000000  0.6251964
Arm        0.35951115  0.69755240  0.75195305  0.75214156  0.6251964  1.0000000
Thigh      0.09128584  0.57261442  0.72486225  0.73496041  0.6070907  0.6710985
Leg       -0.04233121  0.78425706  0.90843341  0.90975238  0.8119143  0.7538140
hipcenter  0.20517217 -0.64033298 -0.79659640 -0.79892742 -0.7312537 -0.5850950
                Thigh         Leg  hipcenter
Age        0.09128584 -0.04233121  0.2051722
Weight     0.57261442  0.78425706 -0.6403330
HtShoes    0.72486225  0.90843341 -0.7965964
Ht         0.73496041  0.90975238 -0.7989274
Seated     0.60709067  0.81191429 -0.7312537
Arm        0.67109849  0.75381405 -0.5850950
Thigh      1.00000000  0.64954120 -0.5912015
Leg        0.64954120  1.00000000 -0.7871685
hipcenter -0.59120155 -0.78716850  1.0000000

All of the correlations not involving Age are large, and the highest of all is between height with or without shoes (not a surprise). Age is not really correlated with anything, including the response, which is why it did not make it into our second model.

It would be a mistake to lead off your answer with this correlation matrix, because the point of this question was not to say that the correlations were high, but why they were high (which the correlation matrix gives no insight on).

Footnotes

  1. Which, amusingly, stands for A Programming Language.↩︎

  2. If you do this, you might realize that it would be worth saving the result of the first two lines above, so that you can start from there without recalculating.↩︎

  3. When I was learning “modern math” in the 1970s, we used the same \(^- 2\) to mean “negative 2”, so it was in the air at the time.↩︎

  4. Anscombe died in 2001, so he probably would have known about S and maybe even an early version of R as well.↩︎

  5. The man behind Honestly Significant Differences.↩︎