You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.
If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)
You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.
Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.
1 US education expenditures
What influences the amount that a US state spends per student on public school education? Some data were collected for the 50 states plus Washington, DC in 1970, and are available in http://ritsokiguess.site/datafiles/Anscombe.csv. The columns are:
Rows: 51 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): state
dbl (4): education, income, young, urban
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Anscombe
You might have a bit of trouble deciding on a name for the dataframe, since education is one of its columns. I chose the name of the datafile, with a capital A.
Extra 1: This is the name of a statistician named Frank Anscombe, who wrote a 1981 book “Computing in Statistical Science Through APL”, from which these data come. About the only statistical software available in those days was SAS (which, as now, had a very expensive annual licence), so academic statisticians would often code things themselves in languages like FORTRAN. Anscombe wrote his book using a different language called APL1 The thing about APL is that it uses a whole bunch of different symbols, some of which are seen nowhere else, and the only way you could use APL back in 1981 was to buy a special keyboard that had those symbols on it. The idea was that the APL code for computing things like means and sums of squares (and hence slopes in regression) was very short, and so you could express a lot with a few (esoteric) symbols.
Extra 2: Anscombe is perhaps better known for the “Anscombe quartet”, four small data sets he made up, each containing an \(x\) and a \(y\), for which the mean and SD of all the \(x\)’s are the same, the mean and SD of all the \(y\)’s are the same, the correlations between each \(x\) and \(y\) are the same, and the intercepts and slopes of the regression lines for predicting \(y\) from \(x\) are all the same. So, you would think, the actual data values would be pretty much the same. This short video demonstrates that this is very much not the case. Anscombe’s purpose in devising these data sets (and publishing a paper about them) was to illustrate that it is not enough to do the calculations; you need to look at graphs of your data to decide whether those calculations are ones that you trust.
If you want to explore Anscombe’s quartet further, R has it as a built in dataset called anscombe (lowercase “a”!) with the four pairs of explanatory and response:
anscombe
from where you can compute means, SDs, correlations and regression lines, verify that all four of them are the same to a high degree of accuracy, and then look at the four scatterplots.
Extra 3: The Frank Anscombe Quartet sounds to me as if it should be the name of a 1970s-era jazz combo, presumably with a pianist, a bass player, a drummer and a jazz violinist, the sort of thing you would have found vinyl recordings of in the old Sam the Record Man on Yonge Street.
(b) (2 points) Fit a regression predicting education expenditure from the other three quantitative variables, and display the results.
This is straight C32 stuff so far:
Anscombe.1<-lm(education ~ income + young + urban, data = Anscombe)summary(Anscombe.1)
Call:
lm(formula = education ~ income + young + urban, data = Anscombe)
Residuals:
Min 1Q Median 3Q Max
-60.240 -15.738 -1.156 15.883 51.380
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.868e+02 6.492e+01 -4.418 5.82e-05 ***
income 8.065e-02 9.299e-03 8.674 2.56e-11 ***
young 8.173e-01 1.598e-01 5.115 5.69e-06 ***
urban -1.058e-01 3.428e-02 -3.086 0.00339 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 26.69 on 47 degrees of freedom
Multiple R-squared: 0.6896, Adjusted R-squared: 0.6698
F-statistic: 34.81 on 3 and 47 DF, p-value: 5.337e-12
Use tidy and glance from broom if you prefer.
(c) (2 points) Why would you not consider removing any of the explanatory variables from the model you just fitted? Explain briefly.
All three of their P-values are (strongly) significant, and therefore removing any of them would be a mistake (for example, would cause the R-squared to drop substantially). That’s all you need.
(d) (4 points) Consider an imaginary state A with income 2800, young 370, and urban 790. Consider also a second imaginary state B with income 3200, young 350, and urban 650. For each of these two sets of explanatory variable values, predict the mean value of education for all states with these same explanatory variable values, and obtain a 95% confidence interval in each case. In your output, display only the predicted mean response, its lower and upper confidence limits, and the values of the explanatory variables.
This is prediction of a mean response, which is what predictions from the marginaleffects package does. To start this process off, define a dataframe (that I like to call new) with the two rows of values to predict for. Find a way to make this work. My preference is to use tribble, where you lay out the values as they are going to look:
new <-tribble(~income, ~young, ~urban,2800, 370, 790,3200, 350, 650)new
If you prefer, you can use tibble instead, but you have to give all the values for each variable first:
If you don’t think of either of those, put the values you are making predictions for into a text file (example: type them into a spreadsheet and save as csv), and then read them in from the file.
Having done that, then you run predictions. This seems to go better with cbind around the outside, in my experience (in 2023, it was my first time teaching marginaleffects, and we had a lot of confusion about what kind of output to expect, caused by this):
This is the answer, or at least, it contains the answer plus some other things we don’t want. predictions is tidyverse-friendly, though, so figure out what you want to display and select it:
Get here, somehow. You don’t need to take two steps, as I did (although you will probably need to think in two steps, even if you edit your work before you hand it in).
Extra: the reason for all those words about “all states with the same explanatory variable values” was to guide you towards a confidence interval for a mean response, the sort of thing that marginaleffects does. Contrast that with prediction intervals for a new individual. That doesn’t make any sense here because you can’t conjure up new states out of nowhere, although I suppose you might happen to know income, young, and urban for one of these same states in a different year, or something like that. (The US states are often a handy source of data, but they are not a random sample of anything.)
Having said that, let’s pretend that our values in new actually came from two new US-like states. Then we can ask what the expenditure on education might be for those new states, thus:
p <-predict(Anscombe.1, new, interval ="p")cbind(new, p)
These intervals (from lwr to upr) are much wider than the ones we obtained before, because they are trying to get at the variability in a single response value (vs. the mean of all possible response values), and there is a lot more to go wrong when you only look at one observation. As we see below, the prediction interval for State B (the second one) is very slightly shorter than for State A, and for the same reason we see below.
(e) (3 points) For which set of explanatory variable values is the mean education expenditure estimated more accurately: the values for state A, or the values for state B? Why did things come out this way? Explain briefly (you might want to do another calculation to support your answer).
The more accurate estimate goes with a shorter confidence interval. If your number sense is good, you might be able to eyeball the two confidence intervals and say that the first one is about 34 wide (\(175-141\)) and the second one is only about 15 wide (\(196-181\)). You don’t need to be any more accurate than this; it’s clear that the second interval is shorter, and therefore the estimation for values like State B’s is more accurate.
But, you also have R at your disposal, so you can work out the width of each interval, by repeating2 and adding to the work you did before:
and there you see that the second interval (for state B) is indeed shorter.
One point so far, because the point of this question is not so much the what as the why, which is coming up.
What made the second interval come out shorter? The usual thing in regression-based intervals is that if the values you are predicting for are closer to their means, the interval will be shorter, and if they are further away, the interval will be longer. The logic behind this is that if you are close to the mean(s), there are lots of nearby data points to support your prediction, but if you are farther away from any of the means, there won’t be as many nearby points to work from (you might even be extrapolating), and things are a lot less certain.
To see whether that has happened here, work out the means of each of the explanatory variables. The easiest way to do that is to use summary, which gives you things like quartiles as well:
summary(Anscombe)
state education income young
Length:51 Min. :112.0 Min. :2081 Min. :326.2
Class :character 1st Qu.:165.0 1st Qu.:2786 1st Qu.:342.1
Mode :character Median :192.0 Median :3257 Median :354.1
Mean :196.3 Mean :3225 Mean :358.9
3rd Qu.:228.5 3rd Qu.:3612 3rd Qu.:369.1
Max. :372.0 Max. :4425 Max. :439.7
urban
Min. : 322.0
1st Qu.: 552.5
Median : 664.0
Mean : 664.5
3rd Qu.: 790.5
Max. :1000.0
but you can literally work out just the means of just the variables you want if you prefer:
The values in the first row, from State A, are a bit off from the mean in each case: below the mean for income, and above the mean for the other two. (If you used summary, you also get to see that these values are close to the first, third, and third quartiles respectively). On the other hand, the values in the second row (State B) are all very close to their means, and so the prediction for this row should be much more accurate.
To summarize: remember that more accurate predictions go with explanatory variable values closer to their means, and then work out what the means are, so you can see which set of values is closer to its mean. The second set has the shorter CI and also the values are closer to their means, so all is as you expect.
Extra: earlier I mentioned that Anscombe’s book, from which our data came, used APL to do the calculations. Appendix 3 of the book contains the code of his “package” to do statistical analysis. (This is the earliest example I have seen of the word “package” to denote a collection of reusable functions to perform a task, although the term was already in use for statistical software that you can make do several different tasks, like Minitab or SPSS.) Anyway, I wanted to show you something small that demonstrates just how foreign APL looks: