Worksheet 12

Published

November 22, 2023

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There is extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

This week’s questions, in order, are a multiple regression (Question 1), a regression with categorical variable (Question 2), and writing a function (Question 3). I encourage you to start with the one where you think you need the most practice.

1 Behavior Risk Factor Surveillance System 2

The Centers for Disease Prevention and Control (in the US) run a survey called the Behavior Risk Factor Surveillance System, in which people submit information about health conditions and risk behaviours. The data we have are percentages of adults within each state for whom the item is true. The variables we are interested in here are:

FruitVeg5: the percent of respondents who eat at least 5 servings of fruits and vegetables every day (response)
SmokeEveryday: the percent of respondents who smoke every day.
EdCollege: the percent of respondents who have completed a college diploma or university degree.

The data are in http://ritsokiguess.site/datafiles/brfss_no_utah.csv. There are many other variables, which you can ignore. The state of Utah is omitted.

You might have seen these data before; if you have, you might recall that we omitted Utah, because (for religious reasons) a lot of people do not smoke there for reasons that have nothing to do with eating fruits and vegetables.

Read in and display (a little of) the data.
Make a graph that shows the relationships, if any, between the response and the two explanatory variables.
Describe any trends you see on your graph(s) (think form, direction, strength if you find it helpful).
Run a regression predicting the percent of adults who eat 5 servings of fruits and vegetables daily from EdCollege and SmokeEveryday, and display the results.
Draw a complete set of residual plots for this regression (three or four plots, depending how you count them). Do you have any concerns? Explain briefly.

Behavior Risk Factor Surveillance System 2: my solutions

FruitVeg5: the percent of respondents who eat at least 5 servings of fruits and vegetables every day (response)
SmokeEveryday: the percent of respondents who smoke every day.
EdCollege: the percent of respondents who have completed a college diploma or university degree.

The data are in http://ritsokiguess.site/datafiles/brfss_no_utah.csv. There are many other variables, which you can ignore. The state of Utah is omitted.

Read in and display (a little of) the data.

Solution

A simple read_csv:

my_url <- "http://ritsokiguess.site/datafiles/brfss_no_utah.csv"
brfss <- read_csv(my_url)

Rows: 49 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): State
dbl (30): Age18_24, Age25_34, Age35_44, Age45_54, Age55_64, Age65orMore, EdL...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

brfss

There are 49 rows, one for each state apart from Utah. There are a lot of columns, 31 altogether. You can check for yourself, by looking or as below, that the columns I named above do actually exist. There is no need to “get rid” of the other columns, unless you have a strong desire to do so.

$\blacksquare$

Make a graph that shows the relationships, if any, between the response and the two explanatory variables.

Solution

This is best done as one of those facetted graphs, with each of the (here two) facets showing a scatterplot of the response against one of the explanatory variables. To do that, arrange the dataframe longer, with all the explanatory variable values in one column and a second column saying which ones they are:

brfss %>% 
  pivot_longer(c(SmokeEveryday, EdCollege), names_to = "xname", values_to = "x") %>% 
  ggplot(aes(x = x, y = FruitVeg5)) + geom_point() + geom_smooth(se = FALSE) +
  facet_wrap(~xname, scales = "free")

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

That, I think, is the best way to do it. Some notes:

I can’t remember if the two $x$-variable columns are consecutive, and I only need to pick out these two, so c rather than :.
Give the column of values a name like x or xvals to remind you that this is going on the plot.
Plot the points and add a smooth trend (since the aim is to see what kind of relationship there is, and it may not be linear)
Use different scales on the facets because each one is a different $x$-variable and the typical values are different. (They are both percents, certainly, but one is about 30 and the other is less than 20).

A second-best way is to plot the two scatterplots separately. Do that if you can’t get the longer-plus-facets to work.

$\blacksquare$

Describe any trends you see on your graph(s) (think form, direction, strength if you find it helpful).

Solution

I see:

form: more or less linear in both cases (no real suggestion of a curve, especially given the (considerable) scatter of the points). See further comments below.
direction: an upward trend for EdCollege, a downward trend for SmokeEveryday.
strength: weak to moderate (after all, there clearly are trends; they just aren’t very strong because there is a lot of scatter).

“More or less linear” you might consider a bit of a stretch, especially on the EdCollege plot. But that “hook” shape on the left is not really anything to get excited about; it really only exists because of that one observation with EdCollege just less than 25 and FruitVeg5 greater than 25, that is a long way from the trend of the others. The other points are more or less on a straight trend.

The other thing I’d think about is that both trends have a “kink” in the middle where they sort of curve. My take is that these are “inconsequential wiggles” caused by how the data points happened to come out. There is really nothing in the data that suggests a curve in either case.

$\blacksquare$

Run a regression predicting the percent of adults who eat 5 servings of fruits and vegetables daily from EdCollege and SmokeEveryday, and display the results.

Solution

This:

brfss.1 <- lm(FruitVeg5 ~ EdCollege + SmokeEveryday, data = brfss)
summary(brfss.1)


Call:
lm(formula = FruitVeg5 ~ EdCollege + SmokeEveryday, data = brfss)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.5881 -1.4681  0.3668  1.8217  6.5814 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    22.1165     5.9015   3.748 0.000497 ***
EdCollege       0.2351     0.1044   2.253 0.029066 *  
SmokeEveryday  -0.4359     0.2033  -2.144 0.037371 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.747 on 46 degrees of freedom
Multiple R-squared:  0.4266,    Adjusted R-squared:  0.4017 
F-statistic: 17.11 on 2 and 46 DF,  p-value: 2.783e-06

Use glance and tidy from broom if you like them better.

$\blacksquare$

Draw a complete set of residual plots for this regression (three or four plots, depending how you count them). Do you have any concerns? Explain briefly.

Solution

Residuals against fitted values, normal quantile plot of residuals, residuals against each of the two explanatory variables.

Residuals against fitted values:

ggplot(brfss.1, aes(x = .fitted, y = .resid)) + geom_point()

Pretty much random. That point top left with the biggest (most positive) residual is the only one to be at all worried about.

Normal quantile plot of residuals:

ggplot(brfss.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()

On the line, really, all the way along. I wouldn’t stress about those three most negative residuals; they are really not far below the line. Also: that most positive residual, which shows up on the other plots as “unusually large”, is indeed quite a bit bigger than the second largest one, but it is close to the line here, so it is pretty much the size you’d expect the largest residual to be (and thus ought not to be a big concern).

Residuals against explanatory variables. I would use augment from broom to create a dataframe with data and residuals in it:

library(broom)
brfss.1 %>% augment(brfss) -> d
d

and then do the same thing as in part (b):

d %>% pivot_longer(c(EdCollege, SmokeEveryday), names_to = "xname", values_to = "x") %>% 
  ggplot(aes(x = x, y = .resid)) + geom_point() + 
  facet_wrap(~xname, scales = "free")

Not bad. You could choose to be bothered by that point top left of EdCollege, and the one top right-ish on SmokeEveryday which seems to have an unusually large (positive) residual.

I would declare myself to be happy overall. There are no systematic problems, such as graphs that are completely the wrong shape, and the normal quantile plot indicates that the “large” positive residual is not actually too large.

$\blacksquare$

2 Prison stress

You may have previously seen data on stress in prisoners, some of whom took a physical training program and some of whom did not. The introduction to the question is below. Here, we analyze the data in a different way than before.

Being in prison is stressful, for anybody. 26 prisoners took part in a study where their stress was measured at the start and end of the study. Some of the prisoners, chosen at random, completed a physical training program (for these prisoners, the Group column is Sport) and some did not (Group is Control). The researchers’ main aim was to see whether the physical training program reduced stress on average in the population of prisoners. The data are in http://www.ritsokiguess.site/datafiles/PrisonStress.csv, in four columns, respectively an identifier for the prisoner, whether or not they did physical training, their stress score at the start of the study, and their stress score at the end.

Read in and display (some of) the data.
Plot the stress scores after against the stress scores before, labelling the points by treatment group. Add regression lines for each group.
Comment on the effects of (i) stress score before and (ii) treatment group on stress score after.
Fit a regression predicting stress score after from stress score before and from the (categorical) treatment. Display the results.
Interpret the Estimate values for PSSbefore and Group, in the context of the data.

Prison stress: my solutions

You may have previously seen data on stress in prisoners, some of whom took a physical training program and some of whom did not. The introduction to the question is below.

(a) Read in and display (some of) the data.

Very much the usual. Give the dataframe a name of your choosing; stress is good as a name because none of the columns are actually called stress:

my_url <- "http://www.ritsokiguess.site/datafiles/PrisonStress.csv"
stress <- read_csv(my_url)

Rows: 26 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Subject, Group
dbl (2): PSSbefore, PSSafter

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

stress

As you see, the columns of stress scores are called PSSbefore and PSSafter, which is because PSS was the name of the stress scale the researchers used.

(b) Plot the stress scores after against the stress scores before, labelling the points by treatment group. Add regression lines for each group.

The justification for going straight to regression lines rather than smooth trends is (you might imagine) that you first plot just the points without the lines and say to yourself that there are “no obvious curves” in either the red points or the blue ones (try it and see).

The plot is based on thinking of the after stress value as a response and the before one as explanatory (both quantitative), with the treatment group as a categorical explanatory variable:

ggplot(stress, aes(x = PSSbefore, y = PSSafter, colour = Group)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

(c) Comment on the effects of (i) stress score before and (ii) treatment group on stress score after.

There’s still a good bit of scatter, but:

as stress score before goes up, stress score after goes up as well (a general upward trend in the points). This is easier to see if you focus first on just the red points and then just the blue ones. There is actually a treatment effect (which we get to in a moment) that confuses the issue if you look at all the points together.
most of the red points are above and most of the blue points are below, particularly if you compare subjects with similar before scores. The lines help us judge that: the red line is above the blue one, meaning that for any before score, the after score is predicted to be higher for someone in the control group and lower for someone in the Sport group. That is, once you allow for how much stress a subject had before, they will have less stress after doing the physical training than they would if they were in the control group.

The second thing is what is really of interest here: what effect does the treatment have, once you allow for the effects of other variables? (Seeing that high stress before tends to go with high stress after is neither surprising nor very interesting; the value in measuring stress before is to use it in accounting for the effect of treatment.)

(d) Fit a regression predicting stress score after from stress score before and from the (categorical) treatment. Display the results.

This is easier, almost, to do than it is to say. You don’t have to do anything special with the categorical variable; just add it on the right side of the squiggle:

stress.1 <- lm(PSSafter ~ PSSbefore + Group, data = stress)
summary(stress.1)


Call:
lm(formula = PSSafter ~ PSSbefore + Group, data = stress)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.564  -2.759   0.139   3.569   9.309 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  16.0909     2.7358   5.882 5.39e-06 ***
PSSbefore     0.4667     0.1298   3.595  0.00153 ** 
GroupSport   -7.2598     2.4731  -2.935  0.00743 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.717 on 23 degrees of freedom
Multiple R-squared:  0.4044,    Adjusted R-squared:  0.3526 
F-statistic: 7.809 on 2 and 23 DF,  p-value: 0.00258

(e) Interpret the Estimate values for `PSSbefore` and `Group`, in the context of the data.

The significantly positive number beside PSSbefore is an ordinary regression slope.¹ This says that if your stress score before is higher by 1, and the treatment group is the same, your stress score after is expected to be higher by 0.47 units on this scale. This is not terribly surprising, and also not really what we care about. The significantly negative number next to GroupSport says that if you were in the physical training group, your stress score afterwards is on average lower by over 7 points, compared to if you were in the “baseline” Control group, even if your stress score before was the same.

This is really quantifying the effect of the treatment while also allowing for or adjusting for other reasons the groups might have been different (like, the Sport group had higher stress before compared to the control group). When we saw these data before, the two-sample t-test you did on the After stress scores did not account for how the groups might have been different before (it assumed they were the same up to randomization, which it seems they were not), while this analysis accounts for everything that is going on. I hope you are feeling that this more sophisticated analysis is rather more satisfactory than the two-sample t-test you did before.

3 Writing a function to do wide $t$-test

The way we know how to run a two-sample $t$-test is to arrange the data “long”: that is, to have one column with all the data in it, and a second column saying which of the two groups each data value belongs to. However, sometimes we get data in two separate columns, and we would like to make a function that will run the $t$-test as we know how to do it, but on data in this form.

As an example, suppose we have data on student test scores for some students who took a course online (asynchronously), and some other students who took the same course in-person. All the students completed an assignment afterwards to see how much they had learned. (The assignment is out of 40.) The students who took the course online scored 32, 37, 35, 28; the students who took the course in-person scored 35, 31, 29, 25. (There were actually a lot more students than this, but these will do to build our function with.)

Enter these data into two vectors called online and classroom respectively.
Using the two vectors you just made, make a dataframe with those two vectors as columns, and rearrange it to be a suitable input for t.test. (Hint: you might find everything() useful in your rearrangement). When you have everything suitably rearranged, run a two-sample $t$-test (two-sided, Welch) as we did it in lecture. What P-value do you get?
Write a function that takes two vectors as input. Call them x and y. The function should run a two-sample (two-sided, Welch) $t$-test on the two vectors as input, running the $t$-test in the way we learned in lecture, and return all the output from the $t$-test. Test your function on the same data you used earlier, and verify that you get the same P-value.
Modify your function to return just the P-value from the $t$-test, as a number. Test your modified function and demonstrate that it does indeed return only the P-value (and not the rest of the $t$-test output).
Modify your function to allow any inputs that t.test accepts. Demonstrate that your modified function works by obtaining the P-value for a pooled (two-sided) $t$-test for the test score data. (This should be very slightly different from the P-value you had before.)

Writing a function to do wide $t$-test: my solutions

Enter these data into two vectors called online and classroom respectively.

Solution

Use c:

online <- c(32, 37, 35, 28)
classroom <- c(35, 31, 29, 25)
online

[1] 32 37 35 28

classroom

[1] 35 31 29 25

If you are careful you can copy and paste my values rather than typing them in again.

If for some reason c doesn’t come to mind, put them in a dataframe with tribble:

thing <- tribble(
  ~online, ~classroom,
  32, 35,
  37, 31,
  35, 29,
  28, 25
)
thing

This seems to require you to type the values, because tribble is a different shape. (Or, type the data values into a file, and then read that in. Find a way.)

The question asked for vectors not a dataframe, though, so if you go this way, pull the columns out and save them. You could also use pull (or pluck) here:

online <- thing$online
online

[1] 32 37 35 28

classroom <- thing$classroom
classroom

[1] 35 31 29 25

The function we are about to write needs vectors as input, so we need to produce vectors somehow here. It may seem backwards to do it this way, since in the next part we make a dataframe, but the point is that the function we write will take vectors as input and so we need to work with that. (If you were consulting with someone, they would say something like “this is what our other process produces, and we need you to write a function to turn this into that”.)

$\blacksquare$

Using the two vectors you just made, make a dataframe with those two vectors as columns, and rearrange it to be a suitable input for t.test. (Hint: you might find everything() useful in your rearrangement). When you have everything suitably rearranged, run a two-sample $t$-test (two-sided, Welch) as we did it in lecture. What P-value do you get?

Solution

tibble to make the dataframe, and pivot_longer to rearrange the values (they start out as “wide” and you want to make them longer):

tibble(online, classroom) %>% 
  pivot_longer(everything(), names_to = "location", values_to = "score") -> d
d

This is the right layout now. And now to run the $t$-test on it. Two-sided and Welch are both the defaults, so we don’t need any extra input (I am making it easier for you):

t.test(score ~ location, data = d)


    Welch Two Sample t-test

data:  score by location
t = -1.0498, df = 5.9776, p-value = 0.3344
alternative hypothesis: true difference in means between group classroom and group online is not equal to 0
95 percent confidence interval:
 -9.998992  3.998992
sample estimates:
mean in group classroom    mean in group online 
                     30                      33

The P-value is 0.3344. From these tiny samples, it is not at all impressive, but that’s not the point: it looks like the right sort of P-value from data like this.

The reason for the everything() is that it will pivot things longer no matter what they are called (or however many of them there are, but in this question there will only be two columns), and that will be helpful when you write a function and your columns have different names than here: you can still use everything().

It’s easier to save the output from pivot_longer and use that as input to the $t$-test (you will probably realize, as I did, that it would be helpful to call it something), but you can also do it all in one shot, by using the notation . for “the thing that came out of the previous step”:²

tibble(online, classroom) %>% 
  pivot_longer(everything(), names_to = "location", values_to = "score") %>% 
  t.test(score ~ location, data = .)


    Welch Two Sample t-test

data:  score by location
t = -1.0498, df = 5.9776, p-value = 0.3344
alternative hypothesis: true difference in means between group classroom and group online is not equal to 0
95 percent confidence interval:
 -9.998992  3.998992
sample estimates:
mean in group classroom    mean in group online 
                     30                      33

The result is the same either way. But unless you remember what the “dot” thing does, it is easier to save the output from pivot_longer in something, and use that something as input to t.test.

Extra: instead of using tibble, you could use the base R data.frame, but I haven’t taught you that, so if you use it, you must (if this were on an assignment) say where you got it from.

$\blacksquare$

Write a function that takes two vectors as input. Call them x and y. The function should run a two-sample (two-sided, Welch) $t$-test on the two vectors as input, running the $t$-test in the way we learned in lecture, and return all the output from the $t$-test. Test your function on the same data you used earlier, and verify that you get the same P-value.

Solution

The reason I had you do the previous part was so that you have written everything that you need to put into your function. (This is a good strategy: get it working in ordinary code first. Then, when you write your function, there’s less to go wrong.)

t_test_wide <- function(x, y) {
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value") -> d
  t.test(value ~ group, data = d)
}

Change your inputs to tibble to be the inputs to the function, x and y. It’s better also to change the names_to and values_to to something more generic than location and score (or whatever you used before), because the function you are writing will apply to any two columns of input, not just test scores in two different classes, and future-you will appreciate knowing the roles of things in your function, ready for when future-you needs to change something about how the function works.

Also, whatever names you put in names_to and values_to need to be in t.test as well, otherwise when you come to test it (in a moment) you won’t get the same results as you had before.

R style, and thus best practice here, is to have the last line of the function be the value that is returned from the function (here, the output from t.test, all of it). There is no need to wrap it in return; it works (unlike in Python) to do it without. Also, if you are thinking about Python, you might be wondering about the indentation. Python requires the indentation, but R doesn’t (because of the curly brackets around the function). Having said that, though, it’s good style to do the indentation as I did it (and also benefits future-you when you come back to your code in six months and have forgotten how you did it): indent each line of the function once (two spaces), and indent the pivot_longer part of the pipeline once more to make it clear that it belongs to the pipeline and the t.test does not.

To test, use classroom and online as the two inputs to the function. They can be either way around, since the test is two-sided:

t_test_wide(classroom, online)


    Welch Two Sample t-test

data:  value by group
t = -1.0498, df = 5.9776, p-value = 0.3344
alternative hypothesis: true difference in means between group x and group y is not equal to 0
95 percent confidence interval:
 -9.998992  3.998992
sample estimates:
mean in group x mean in group y 
             30              33

The same P-value indeed, and everything else the same also.

If the P-value is not the same (or you get an error from your function), you will need to go back and fix up your function until it works. Debugging is a big part of coding. Most of us don’t get things right the first time. Within limits,³ it doesn’t matter how long it takes, as long as you get there by the time your work is due.

Extra: this, perhaps surprisingly, does in fact work:

t.test(online, classroom)


    Welch Two Sample t-test

data:  online and classroom
t = 1.0498, df = 5.9776, p-value = 0.3344
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.998992  9.998992
sample estimates:
mean of x mean of y 
       33        30

but I don’t want you to do it this way. (If you find out about it, you can use it as a way to check your result, but it is not acceptable as part of an answer to this question.)

$\blacksquare$

Modify your function to return just the P-value from the $t$-test, as a number. Test your modified function and demonstrate that it does indeed return only the P-value (and not the rest of the $t$-test output).

Solution

This means finding out how to get just the P-value out of the t.test output. It’s there on the output, but how do we get it out?

There are two approaches I would consider.

One is to find out all the things that are output from t.test. To do that, run your initial t.test again, but this time save the result:

d_t <- t.test(score ~ location, data = d)

and then take a look at the names of the result:

names(d_t)

 [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
 [6] "null.value"  "stderr"      "alternative" "method"      "data.name"

The third of those is the one we want. Get it out with a dollar sign, as if it were a column of a dataframe:⁴

d_t$p.value

[1] 0.3343957

This, I think, also works (in a rather more tidyverse fashion):⁵

d_t %>% pluck("p.value")

[1] 0.3343957

This function names has quietly shown up a couple of times in this course (once in combination with power.t.test and once when I showed you what augment did in regression).

I said there was a second way. We used this idea in regression, but you might think to use it in a $t$-test connection to see what it does. The idea comes from broom: what do glance and tidy do in the context of a $t$-test? Let’s find out:

glance(d_t)

and

tidy(d_t)

They are actually identical, and they both have a thing called p.value that is what we want. Another way to run these is in a pipeline, piping the output from t.test into (say) tidy, thus:

t.test(score ~ location, data = d) %>% 
  tidy()

This is a dataframe (the reason why broom is useful), so get just the P-value out of this the way you would get anything out of a dataframe, with a dollar sign (which looks a little odd):

t.test(score ~ location, data = d) %>% 
  tidy() %>% .$p.value

[1] 0.3343957

or with pluck, or for that matter pull:

t.test(score ~ location, data = d) %>% 
  tidy() %>% pluck("p.value")

[1] 0.3343957

select doesn’t quite work here, because it gives you back a dataframe and we want a number.

I also need to talk some more about the dollar sign in pipeline thing. The tidy() produces a dataframe, and we can use the “dot” to refer to “the dataframe that came out of the previous step” (the output from the tidy). So the odd code .$p.value means, in English, “the column called p.value in the dataframe that came out of the previous step”.

All right, with all of that in mind, there are (at least) two ways to modify your function to return just the P-value.

The first is to use the dollar-sign plus p.value (that we found using names):

t_test_wide <- function(x, y) {
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value") -> d
  t_d <- t.test(value ~ group, data = d)
  t_d$p.value
}

The second is to use tidy (or glance) from broom, something like this:

t_test_wide_x <- function(x, y) {
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value") -> d
  t.test(value ~ group, data = d) %>% 
    tidy() %>% pluck("p.value")
}

I have two functions, so I have two tests (but you will only have one):

t_test_wide(classroom, online)

[1] 0.3343957

t_test_wide_x(classroom, online)

[1] 0.3343957

Success.

Extra (there is one more part to the actual question, so you can skip to that and read this later):

I originally wanted you to investigate what happens if you input two vectors of different lengths to your function. (I took that part out because the question was then too long and complicated.) It continued: Explain briefly what happened. Does it still make sense to do a $t$-test with input vectors of different lengths? Explain briefly. Hint: make sure the top line of your code chunk says {r, error=TRUE}, or, as the first line inside your code chunk, you have #| error: true.

The solution was:

Make up something, anything, eg. this:

t_test_wide(c(1,2), c(4,5,6))

Error in `tibble()`:
! Tibble columns must have compatible sizes.
• Size 2: Existing data.
• Size 3: Column at position 2.
ℹ Only values of size one are recycled.

If we input two columns of different lengths, we get an error. You should get the same error, possibly with different numbers in it (depending how long your input vectors were).

This is one of the new tidyverse error messages that tells you exactly what happened:

the error happened when we used tibble
the error was that all the columns in a dataframe must be the same length (with an exception that is mentioned later)
my first input had two numbers in it
my second input had three numbers in it
the exception is that if one of the inputs is of length 1 (has just one number in it), that one number will be repeated to make it the length of the other input, and in that case we won’t get an error:

tibble(x = 3, y = c(4,5,6))

The 3 has been repeated twice more to give us something of length 3, so there is no error here. This doesn’t make sense in the context of a t-test, but it’s not actually a code error.

The reason for the hint is that if you are not careful (and the hint tells you how to be careful), you will end up not being able to render your document (because of the error). The hint tells you how to allow your document to continue rendering if there’s an error, and not to bail out at the first error it sees. A more Quarto way of writing the code chunk is to put #| error: TRUE inside the code chunk as its first line. The effect is the same either way: your document renders, the error message is shown, but Quarto keeps going with the rest of it.

Aside: if you used data.frame above, you get a different error, but one with the same meaning:

data.frame(x = c(1,2), y = c(4, 5, 6))

Error in data.frame(x = c(1, 2), y = c(4, 5, 6)): arguments imply differing number of rows: 2, 3

The second part of the question is whether it makes sense to do a two-sample $t$-test with input columns of different lengths. The two-sample $t$-test works with two independent samples on two different groups of individuals, so there is no reason why the samples have to be the same size. (Think of our kids learning to read in class: the sample sizes there were 23 and 21, not the same.) So our function really ought to work for input vectors of different lengths.

I am not going to make you explore this further in the question, but if you want to think about how to do it: the problem is that tibble needs two vectors the same length, but logically the $t$-test does not. One way around this is to put NA values on the end of the shorter vector. You have probably run into length to find how many entries are in a vector:

u <- c(1, 2)
v <- c(4, 5, 6)
length(u)

[1] 2

length(v)

[1] 3

No surprises there. But you can also assign the length of a vector. Let’s do that in a way that we can use in our function:

n <- max(length(u), length(v))
n

[1] 3

length(u) <- n
length(v) <- n

What does it mean to “make u have length 3”, when we know it has length 2? Well, let’s see what happened to it:

[1]  1  2 NA

It now is of length 3, because it got an NA glued onto the end!

So does our function now work with this u and v?

t_test_wide(u, v)

[1] 0.02127341

It looks as if it does. Indeed, if you check it out, you’ll find that t.test ignores any missing values (according to the help, “Missing values are silently removed”), so this is a very smooth way around the problem of input vectors being of different lengths: put NAs on the end of the shorter one to make them the same length. We can put this at the top of our function, thus, with some comments to help future-me:

t_test_wide_q <- function(x, y) {
  # make x and y have the same length
  n <- max(length(x), length(y))
  length(x) <- n
  length(y) <- n
  # assemble into dataframe and run t-test
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value") -> d
  t_d <- t.test(value ~ group, data = d)
  t_d$p.value
}

and then

t_test_wide_q(c(1, 2, 3, 4), c(4, 5, 6))

[1] 0.03464755

which works and looks plausible.

$\blacksquare$

Modify your function to allow any inputs that t.test accepts. Demonstrate that your modified function works by obtaining the P-value for a pooled (two-sided) $t$-test for the test score data. (This should be very slightly different from the P-value you had before.)

Solution

This is the business about passing input arguments on elsewhere without having to intercept them one by one, using ...:

t_test_wide <- function(x, y, ...) {
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value") -> d
  t.test(value ~ group, data = d, ...) %>% 
    tidy() %>% pluck("p.value")
}

Put the ... at the end of the inputs to the function, and again at the end of the inputs to t.test (since that’s where anything unrecognized needs to be sent). When you use ..., think about where the other input is coming from (the inputs to your function), and where it’s going to (t.test in this case).

To test:

t_test_wide(classroom, online, var.equal = TRUE)

[1] 0.3342512

Our function does not have an input var.equal, so it gets passed on to t.test which does. The P-value is only very slightly different to the one we had before, though it is not exactly the same, which suggests that the spreads of scores in each class are about the same:

t_test_wide(classroom, online)

[1] 0.3343957

and

sd(classroom)

[1] 4.163332

sd(online)

[1] 3.91578

Extra: having thought about equality of spreads, I now realize that I would like to have a function that draws a boxplot with input of the same form. But a boxplot requires longer data the same way that t.test does, and if I add a boxplot function, it would have to do pivot_longer as well, in just the same way that t_test_wide does. Having the exact same code in two different places is (i) wasteful, and (ii) dangerous, because if we later decide to change the way the pivot-longer works , we have to remember to change it in both places.

The solution is what coding people call “refactoring”: pull out the unit that our other functions have in common, and make that into a separate function of its own. I’ll call mine make_longer_df:

make_longer_df <- function(x, y) {
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value")
}

Note that we don’t need to save the result now, because it gets returned to the outside world. To test:

make_longer_df(classroom, online)

That looks all right. Now, we rewrite our t_test_wide function to itself use what we just wrote:

t_test_wide <- function(x, y, ...) {
  d <- make_longer_df(x, y)
  t.test(value ~ group, data = d, ...) %>% 
    tidy() %>% pluck("p.value")
}

This has now simplified t_test_wide by abstracting away the business of exactly how you make a longer dataframe out of the input vectors: you can now read the function as “make a longer dataframe out of the input, and then run a t-test on that”. Exactly how the input is made into a longer dataframe is not important; if you want to know how it works, you look at the code in make_longer_df. This is rather like writing a report with an Appendix;⁶ if you have some details that are not critical to most people reading the report, you put them in an Appendix so as not to disrupt the reading flow of most of your readers, and anyone who wants to know the details can read the Appendix.

Let’s check that this works:

t_test_wide(classroom, online)

[1] 0.3343957

and so it does. Now to write a boxplot function, which is also going to use make_longer_df:

boxplot_wide <- function(x, y) {
  d <- make_longer_df(x, y)
  ggplot(d, aes(x = group, y = value)) + geom_boxplot()
}

To make the ggplot work, we had to peek back into the code for make_longer_df to recall what the two columns of the longer dataframe were called, so that we could use them in making the boxplot. Does it work?

boxplot_wide(classroom, online)

Yes, and the two groups do indeed have about the same spread.

If we decided that we wanted to do the pivot-longer in a different way (an example might be to allow the names for the new columns to be input as well, rather than calling them group and value always), we would modify make_longer_df as needed, and then both t_test_wide and boxplot_wide will work properly without needing to change either of them.⁷

The point of writing a function is to turn a complicated task (like pivoting-longer a dataframe in a particular way) into something with a simple name, which not only simplifies the coding, but also simplifies thinking about the code, because now you only have to think about what a function does, and not about how it does it. This is how coders keep on top of a project with thousands of lines of code: lots of functions with simple names that do complicated tasks, and the higher-level functions (our t_test_wide and boxplot_wide) call the lower-level ones (our make_longer_df).

$\blacksquare$

Footnotes

The P-values are in the last column.↩︎
The reason you have to do it this way is that the first input to t.test is a “model formula” saying what depends on what, not a dataframe. If the dataframe were first, there would be no problems omitting the dataframe as usual, to mean “whatever came out of the previous step”.↩︎
There is never actually an infinite amount of time to do something, so you do have to know enough to get reasonably close reasonably quickly. The time limit on assignments is there for that reason.↩︎
It is actually an element of a list, but dataframes at their core are actually lists of columns, so the extraction process is the same.↩︎
pluck pulls anything out of anything else; it doesn’t have to be a dataframe.↩︎
Or my footnotes.↩︎
This is actually not quite true in this case: boxplot_wide would also need to know the names of the columns in what I called d, so that would need tweaking. But the general idea is that changes at the lower level will not affect the higher level much if at all.↩︎

Worksheet 12

1 Behavior Risk Factor Surveillance System 2

Behavior Risk Factor Surveillance System 2: my solutions

2 Prison stress

Prison stress: my solutions

(a) Read in and display (some of) the data.

(b) Plot the stress scores after against the stress scores before, labelling the points by treatment group. Add regression lines for each group.

(c) Comment on the effects of (i) stress score before and (ii) treatment group on stress score after.

(d) Fit a regression predicting stress score after from stress score before and from the (categorical) treatment. Display the results.

(e) Interpret the Estimate values for `PSSbefore` and `Group`, in the context of the data.

3 Writing a function to do wide \(t\)-test

Writing a function to do wide \(t\)-test: my solutions

Footnotes

1 Behavior Risk Factor Surveillance System 2

Behavior Risk Factor Surveillance System 2: my solutions

2 Prison stress

Prison stress: my solutions

(a) Read in and display (some of) the data.

(b) Plot the stress scores after against the stress scores before, labelling the points by treatment group. Add regression lines for each group.

(c) Comment on the effects of (i) stress score before and (ii) treatment group on stress score after.

(d) Fit a regression predicting stress score after from stress score before and from the (categorical) treatment. Display the results.

(e) Interpret the Estimate values for PSSbefore and Group, in the context of the data.

3 Writing a function to do wide \(t\)-test

Writing a function to do wide \(t\)-test: my solutions

Footnotes

(e) Interpret the Estimate values for `PSSbefore` and `Group`, in the context of the data.