Worksheet 10

Published

March 28, 2024

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There might be extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

(Note: the “wide t” question uses the ... idea that I skipped over in lecture. You can skip that part, or read the lecture slides or my solutions and figure out how it goes. Either way, the Extra in my solution, about refactoring, is worth your while reading.)

1 Corn yields

Two varieties of corn, imaginatively labelled A and B, were randomly assigned to 8 plots of land. Four amounts of fertilizer were used, randomly assigned to plots so that each variety and each amount of fertilizer was tested in exactly one plot. The yield of corn (the amount grown per acre) in each plot was recorded. This is actually a randomized block design, but we are going to treat the amount of corn grown as quantitative. The data are in http://ritsokiguess.site/datafiles/cornyield.csv.

  1. Read in and display the data (this time, you should see it all).

  2. Make a suitable graph of the three variables. Do the relationships between fertilizer and yield look more or less linear for each variety?

  3. Fit a regression predicting yield from fertilizer and variety, and display the output.

  4. Interpret the Estimate values (ignoring for now that one of them is not significant).

  5. Can you remove either explanatory variable from your regression? Explain briefly.

  6. Redraw your graph with the regression lines on it. The model we fitted above assumes that the two lines have the same slope. Does that appear reasonable? Explain briefly.

  7. The effect of fertilizer seems to be different for the two varieties. This suggests adding an interaction to your model, which you can do by replacing + in your regression with *. Fit a model with interaction, and display the results. How do you know that this model fits much better than your previous one?

Corn yields: my solutions

Two varieties of corn, imaginatively labelled A and B, were randomly assigned to 8 plots of land. Four amounts of fertilizer were used, randomly assigned to plots so that each variety and each amount of fertilizer was tested in exactly one plot. The yield of corn (the amount grown per acre) in each plot was recorded. This is actually a randomized block design, but we are going to treat the amount of corn grown as quantitative. The data are in http://ritsokiguess.site/datafiles/cornyield.csv.

(a) Read in and display the data (this time, you should see it all).

As expected:

my_url <- "http://ritsokiguess.site/datafiles/cornyield.csv"
cornyield <- read_csv(my_url)
Rows: 8 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): variety
dbl (2): yield, fertilizer

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cornyield

The three columns as promised (and you can check that each variety-fertilizer combination appears exactly once).

(b) Make a suitable graph of the three variables. Do the relationships between fertilizer and yield look more or less linear for each variety?

Two quantitative variables, of which yield is the response, and one categorical one variety, which you can use colour for:

ggplot(cornyield, aes(x = fertilizer, y = yield, colour = variety)) + geom_point()

The relationship between yield and fertilizer looks more or less linear for each variety, and for these amounts of fertilizer, more fertilizer means more corn. Add the regression lines to your graph if you like:

ggplot(cornyield, aes(x = fertilizer, y = yield, colour = variety)) +
  geom_point() + geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

What you see now is that the lines fit very well (as much as you can detect this with only four points for each line), but they have different slopes. This will have an impact later on.

(c) Fit a regression predicting yield from fertilizer and variety, and display the output.

Fitting the regression has no surprises:

cornyield.1 <- lm(yield ~ fertilizer + variety, data = cornyield)
summary(cornyield.1)

Call:
lm(formula = yield ~ fertilizer + variety, data = cornyield)

Residuals:
     1      2      3      4      5      6      7      8 
-2.275 -0.925  0.425  2.775  1.975  1.325 -0.325 -2.975 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 131.6250     2.2343  58.912 2.67e-08 ***
fertilizer    0.9300     0.1511   6.156  0.00164 ** 
varietyB     -0.2500     1.6889  -0.148  0.88811    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.389 on 5 degrees of freedom
Multiple R-squared:  0.8835,    Adjusted R-squared:  0.8369 
F-statistic: 18.96 on 2 and 5 DF,  p-value: 0.004631

(d) Interpret the Estimate values (ignoring for now that one of them is not significant).

The Estimate for fertilizer is a regular regression slope. Its value is 0.93, meaning that if the amount of fertilizer is increased by 1, the yield is increased by 0.93. To confirm this, look back at your graph; the fertilizer increases in steps of 5, and the yield increases by either a bit more or a bit less than 5 each time, depending on which variety you are looking at. So a slope of close to 1 makes sense. This is, according to this model, the same for both varieties, which the graph doesn’t really support, but that’s the interpretation in this model.

variety is a categorical variable, so the Estimate says how much the predicted yield changes for the variety shown compared to the baseline variety A. This one says that the yield for variety B is 0.25 less on average than for variety A. It is not clear from the graph that this is meaningful; sometimes variety A is better (with more fertilizer) and sometimes it is worse (with less fertilizer.) We address this shortly.

I didn’t actually interpret the intercept (or ask you not to). This is the predicted yield when fertilizer is zero, as you would expect, but for the baseline variety A. (Don’t look back at your graph yet, because that will confuse you.) The predicted yield for fertilizer 0 and variety B is less than that by 0.25 (the Estimate for variety B). Because we are assuming that the slopes for both varieties are the same, this says that B is predicted to be less than A by 0.25 all the way along.

Extra: the graph is telling a different story than all this: it says that the two lines have different slopes. But we don’t know how to fit this yet.

Extra:

(e) Can you remove either explanatory variable from your regression? Explain briefly.

One of our explanatory variables is categorical (variety), so we have to think about whether we can remove it as a whole, and thus we should use drop1:

drop1(cornyield.1, test = "F")

There is a significant effect of fertilizer, but not of variety, so the latter could be removed. (We won’t do that, because we have a better analysis coming later.)

You’ll see that the P-value for variety here is the same as in the summary output. This is because there were only two varieties, and so testing for an overall effect of variety was the same thing as testing whether variety B differed from variety A. If we had had three varieties, it would have been a different story, though, and only drop1 would have given us the result we wanted. My suggestion is to get in the habit of using drop1 if your regression has categorical explanatory variables.

(f) Redraw your graph with the regression lines on it. The model we fitted above assumes that the two lines have the same slope. Does that appear reasonable? Explain briefly.

I already drew that graph, copied and pasted here:

ggplot(cornyield, aes(x = fertilizer, y = yield, colour = variety)) +
  geom_point() + geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

The slope for variety A seems larger than the one for variety B, and the points are close enough to the lines to suggest that this difference is going to be significant.

(g) The effect of fertilizer seems to be different for the two varieties. This suggests adding an interaction to your model, which you can do by replacing + in your regression with *. Fit a model with interaction, and display the results. How do you know that this model fits much better than your previous one?

Like this:

cornyield.2 <- lm(yield ~ fertilizer * variety, data = cornyield)
summary(cornyield.2)

Call:
lm(formula = yield ~ fertilizer * variety, data = cornyield)

Residuals:
   1    2    3    4    5    6    7    8 
 0.2 -0.1 -0.4  0.3 -0.5  0.5  0.5 -0.5 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         127.50000    0.69821 182.609 5.39e-09 ***
fertilizer            1.26000    0.05099  24.711 1.59e-05 ***
varietyB              8.00000    0.98742   8.102 0.001262 ** 
fertilizer:varietyB  -0.66000    0.07211  -9.153 0.000791 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5701 on 4 degrees of freedom
Multiple R-squared:  0.9947,    Adjusted R-squared:  0.9907 
F-statistic: 249.8 on 3 and 4 DF,  p-value: 5.275e-05

The term with a : in them is for the interaction.

We know that this model fits much better because the R-squared has gone way up, from 88% to over 99% (and thus this model fits almost perfectly).

You could also see this by testing the interaction term, which you would do again with drop1:

drop1(cornyield.2, test = "F")

This is again safe (because variety is categorical), though the P-value is the same as the one in the summary table (because there are only two varieties; if there were three, it would again be different).

Extra thoughts:

The Estimate for the interaction term tells you how much the slopes are different for the two varieties. Specifically, the value 1.26 for fertilizer says that the slope of the line for the baseline variety A is 1.26. The Estimate for fertilizer:varietyB, -0.66, says how much the slope changes as you go from the baseline variety A to variety B: it goes from 1.26 to \(1.26 - 0.66 = 0.60\), reflecting the idea that the line for variety B is much less steep compared to the one for variety A.

Likewise, the Intercept (127.5) is actually the intercept of the line for Variety A (baseline), and the Estimate for variety B, 8, is the change in intercept as you go from variety A to B (so the intercept for B is \(127.5 + 8 = 135.5\)). This confirms what you saw in the graph of (e): the slope for variety A is larger, but the intercept is larger for variety B.

I gave the drop1 output for this model just above. Note that this only includes the interaction term, because in a model with an interaction term, you have to test the interaction first: if it is significant, as here, that is your finding. (To understand it, you go back to your graph: the lines have significantly different slopes. This is unlike the crickets example in lecture, where the two lines had almost identical slopes.)

Thinking of what this dataset is actually about, my intuition for the way fertilizer works is that there ought to be an optimal amount to use (such that the yield is maximum). This might show up as something like a parabola, where the yield goes down again if you apply too much fertilizer, or it might be that the yield stays the same once a certain amount of fertilizer is reached, and more fertilizer does not increase the yield (so adding more is a waste rather than being actually harmful). This is something we could imagine consulting an expert about. In any case, for the data we have, that maximum has by no means been reached yet. I suspect that the reason these data came to us is that separate straight lines fit very well and it’s a nice example for regression with categorical variables.

2 Greeting people at a conference

You are volunteering for an R conference, and one of your tasks is to write an R function that greets a person and asks them whether they come from a certain place. The function will take two inputs, a name and a place. For example, if the name is Professor and the place is England, the function needs to return Hello Professor, are you from England? Once of the conference organizers tells you about the paste function, which accepts any number of pieces of text, or variables containing text, and returns all those pieces of text in one string, with spaces in between, like this:

day <- "Tuesday"
paste("Happy", day)
[1] "Happy Tuesday"
  1. Write a function called greet to greet a person as described in the question.

  2. How would you use your function to greet someone you think is from Montreal called Celine?

  3. What happens when you run this code? Explain briefly why what happened actually happened.

greet(place = "Beijing", name = "Ding")
  1. Suppose you have a lot of people at the conference who come from Toronto, and you don’t want to have to specify that every time. How can you modify your function to save yourself some work when running it? Test your revised function by greeting Drake.

  2. The conference organizer tells you that there are a lot of people coming who are all from Toronto. Their names are in a dataframe called people, as defined below (you should be able to copy-paste the two lines of code). How can you create a second column called greeting in this dataframe, containing an appropriate greeting for each of these people, without using a loop?

names <- c("Mahinda", "Sotirios", "Sohee", "Mike", "Anya", "Paco", "Nick")
people <- tibble(person = names)

Greeting people at a conference - my solutions

You are volunteering for an R conference, and one of your tasks is to write an R function that greets a person and asks them whether they come from a certain place. The function will take two inputs, a name and a place. For example, if the name is Professor and the place is England, the function needs to return Hello Professor, are you from England? Once of the conference organizers tells you about the paste function, which accepts any number of pieces of text, or variables containing text, and returns all those pieces of text in one string, with spaces in between, like this:

day <- "Tuesday"
paste("Happy", day)
[1] "Happy Tuesday"
  1. Write a function called greet to greet a person as described in the question.

Solution

Make sure you get all the punctuation, but otherwise it’s a one-liner:

greet <- function(name, place) {
  paste("Hello", name, ",", "are you from", place, "?")
}

Have the name and place the other way around if you prefer. Stick to those names, though, or else (c) won’t make much sense.

\(\blacksquare\)

  1. How would you use your function to greet someone you think is from Montreal called Celine?

Solution

Name first, then place, so

greet("Celine", "Montreal")
[1] "Hello Celine , are you from Montreal ?"

\(\blacksquare\)

  1. What happens when you run this code? Explain briefly why what happened actually happened.
greet(place = "Beijing", name = "Ding")

Solution

Try it and see. It will work:

greet(place = "Beijing", name = "Ding")
[1] "Hello Ding , are you from Beijing ?"

This is because named inputs can be in any order: R will match them by their name rather than their position.

(If you called the inputs something other than name and place, then this won’t work. Match it up to whatever you called the inputs, though, and it will.)

\(\blacksquare\)

  1. Suppose you have a lot of people at the conference who come from Toronto, and you don’t want to have to specify that every time. How can you modify your function to save yourself some work when running it? Test your revised function by greeting Drake.

Solution

Give the place input a default of Toronto, thus:

greet <- function(name, place = "Toronto") {
  paste("Hello", name, ",", "are you from", place, "?")
}

and hence, without specifying a place:

greet("Drake")
[1] "Hello Drake , are you from Toronto ?"

\(\blacksquare\)

  1. The conference organizer tells you that there are a lot of people coming who are all from Toronto. Their names are in a dataframe called people, as defined below (you should be able to copy-paste the two lines of code). How can you create a second column called greeting in this dataframe, containing an appropriate greeting for each of these people, without using a loop?
names <- c("Mahinda", "Sotirios", "Sohee", "Mike", "Anya", "Paco", "Nick")
people <- tibble(person = names)

Solution

The way I showed you (or will show you, depending when you read this) is mutate with a map. The appropriate one here is map_chr because greet returns text:

people %>% 
  mutate(greeting = map_chr(person, \(p) greet(p)))

The map can be read in English as “for each thing in person, run greet on it”, where the dot stands for “it”, that is, the person you are looking at at the moment.

Another way that will work is to use rowwise, which will work out the greeting for each person one at a time and then glue them back together:

people %>% 
  rowwise() %>% 
  mutate(greeting = greet(person))

I like the code for rowwise better, but map is more general: it will work on the original names like this, read as “for each person in names, run greet on them”:

map_chr(names, \(name) greet(name))
[1] "Hello Mahinda , are you from Toronto ?" 
[2] "Hello Sotirios , are you from Toronto ?"
[3] "Hello Sohee , are you from Toronto ?"   
[4] "Hello Mike , are you from Toronto ?"    
[5] "Hello Anya , are you from Toronto ?"    
[6] "Hello Paco , are you from Toronto ?"    
[7] "Hello Nick , are you from Toronto ?"    

The thing it returns is a thing like names: that is, a vector rather than a dataframe column as above.

\(\blacksquare\)

3 Writing a function to do wide \(t\)-test

The way we know how to run a two-sample \(t\)-test is to arrange the data “long”: that is, to have one column with all the data in it, and a second column saying which of the two groups each data value belongs to. However, sometimes we get data in two separate columns, and we would like to make a function that will run the \(t\)-test as we know how to do it, but on data in this form.

As an example, suppose we have data on student test scores for some students who took a course online (asynchronously), and some other students who took the same course in-person. All the students completed an assignment afterwards to see how much they had learned. (The assignment is out of 40.) The students who took the course online scored 32, 37, 35, 28; the students who took the course in-person scored 35, 31, 29, 25. (There were actually a lot more students than this, but these will do to build our function with.)

  1. Enter these data into two vectors called online and classroom respectively.

  2. Using the two vectors you just made, make a dataframe with those two vectors as columns, and rearrange it to be a suitable input for t.test. (Hint: you might find everything() useful in your rearrangement). When you have everything suitably rearranged, run a two-sample \(t\)-test (two-sided, Welch) as we did it in lecture. What P-value do you get?

  3. Write a function that takes two vectors as input. Call them x and y. The function should run a two-sample (two-sided, Welch) \(t\)-test on the two vectors as input, running the \(t\)-test in the way we learned in lecture, and return all the output from the \(t\)-test. Test your function on the same data you used earlier, and verify that you get the same P-value.

  4. Modify your function to return just the P-value from the \(t\)-test, as a number. Test your modified function and demonstrate that it does indeed return only the P-value (and not the rest of the \(t\)-test output).

  5. Modify your function to allow any inputs that t.test accepts. Demonstrate that your modified function works by obtaining the P-value for a pooled (two-sided) \(t\)-test for the test score data. (This should be very slightly different from the P-value you had before.)

Writing a function to do wide \(t\)-test: my solutions

The way we know how to run a two-sample \(t\)-test is to arrange the data “long”: that is, to have one column with all the data in it, and a second column saying which of the two groups each data value belongs to. However, sometimes we get data in two separate columns, and we would like to make a function that will run the \(t\)-test as we know how to do it, but on data in this form.

As an example, suppose we have data on student test scores for some students who took a course online (asynchronously), and some other students who took the same course in-person. All the students completed an assignment afterwards to see how much they had learned. (The assignment is out of 40.) The students who took the course online scored 32, 37, 35, 28; the students who took the course in-person scored 35, 31, 29, 25. (There were actually a lot more students than this, but these will do to build our function with.)

  1. Enter these data into two vectors called online and classroom respectively.

Solution

Use c:

online <- c(32, 37, 35, 28)
classroom <- c(35, 31, 29, 25)
online
[1] 32 37 35 28
classroom
[1] 35 31 29 25

If you are careful you can copy and paste my values rather than typing them in again.

If for some reason c doesn’t come to mind, put them in a dataframe with tribble:

thing <- tribble(
  ~online, ~classroom,
  32, 35,
  37, 31,
  35, 29,
  28, 25
)
thing

This seems to require you to type the values, because tribble is a different shape. (Or, type the data values into a file, and then read that in. Find a way.)

The question asked for vectors not a dataframe, though, so if you go this way, pull the columns out and save them. You could also use pull (or pluck) here:

online <- thing$online
online
[1] 32 37 35 28
classroom <- thing$classroom
classroom
[1] 35 31 29 25

The function we are about to write needs vectors as input, so we need to produce vectors somehow here. It may seem backwards to do it this way, since in the next part we make a dataframe, but the point is that the function we write will take vectors as input and so we need to work with that. (If you were consulting with someone, they would say something like “this is what our other process produces, and we need you to write a function to turn this into that”.)

\(\blacksquare\)

  1. Using the two vectors you just made, make a dataframe with those two vectors as columns, and rearrange it to be a suitable input for t.test. (Hint: you might find everything() useful in your rearrangement). When you have everything suitably rearranged, run a two-sample \(t\)-test (two-sided, Welch) as we did it in lecture. What P-value do you get?

Solution

tibble to make the dataframe, and pivot_longer to rearrange the values (they start out as “wide” and you want to make them longer):

tibble(online, classroom) %>% 
  pivot_longer(everything(), names_to = "location", values_to = "score") -> d
d

This is the right layout now. And now to run the \(t\)-test on it. Two-sided and Welch are both the defaults, so we don’t need any extra input (I am making it easier for you):

t.test(score ~ location, data = d)

    Welch Two Sample t-test

data:  score by location
t = -1.0498, df = 5.9776, p-value = 0.3344
alternative hypothesis: true difference in means between group classroom and group online is not equal to 0
95 percent confidence interval:
 -9.998992  3.998992
sample estimates:
mean in group classroom    mean in group online 
                     30                      33 

The P-value is 0.3344. From these tiny samples, it is not at all impressive, but that’s not the point: it looks like the right sort of P-value from data like this.

The reason for the everything() is that it will pivot things longer no matter what they are called (or however many of them there are, but in this question there will only be two columns), and that will be helpful when you write a function and your columns have different names than here: you can still use everything().

It’s easier to save the output from pivot_longer and use that as input to the \(t\)-test (you will probably realize, as I did, that it would be helpful to call it something), but you can also do it all in one shot, by using the notation . for “the thing that came out of the previous step”:1

tibble(online, classroom) %>% 
  pivot_longer(everything(), names_to = "location", values_to = "score") %>% 
  t.test(score ~ location, data = .)

    Welch Two Sample t-test

data:  score by location
t = -1.0498, df = 5.9776, p-value = 0.3344
alternative hypothesis: true difference in means between group classroom and group online is not equal to 0
95 percent confidence interval:
 -9.998992  3.998992
sample estimates:
mean in group classroom    mean in group online 
                     30                      33 

The result is the same either way. But unless you remember what the “dot” thing does, it is easier to save the output from pivot_longer in something, and use that something as input to t.test.

Extra: instead of using tibble, you could use the base R data.frame, but I haven’t taught you that, so if you use it, you must (if this were on an assignment) say where you got it from.

\(\blacksquare\)

  1. Write a function that takes two vectors as input. Call them x and y. The function should run a two-sample (two-sided, Welch) \(t\)-test on the two vectors as input, running the \(t\)-test in the way we learned in lecture, and return all the output from the \(t\)-test. Test your function on the same data you used earlier, and verify that you get the same P-value.

Solution

The reason I had you do the previous part was so that you have written everything that you need to put into your function. (This is a good strategy: get it working in ordinary code first. Then, when you write your function, there’s less to go wrong.)

t_test_wide <- function(x, y) {
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value") -> d
  t.test(value ~ group, data = d)
}

Change your inputs to tibble to be the inputs to the function, x and y. It’s better also to change the names_to and values_to to something more generic than location and score (or whatever you used before), because the function you are writing will apply to any two columns of input, not just test scores in two different classes, and future-you will appreciate knowing the roles of things in your function, ready for when future-you needs to change something about how the function works.

Also, whatever names you put in names_to and values_to need to be in t.test as well, otherwise when you come to test it (in a moment) you won’t get the same results as you had before.

R style, and thus best practice here, is to have the last line of the function be the value that is returned from the function (here, the output from t.test, all of it). There is no need to wrap it in return; it works (unlike in Python) to do it without. Also, if you are thinking about Python, you might be wondering about the indentation. Python requires the indentation, but R doesn’t (because of the curly brackets around the function). Having said that, though, it’s good style to do the indentation as I did it (and also benefits future-you when you come back to your code in six months and have forgotten how you did it): indent each line of the function once (two spaces), and indent the pivot_longer part of the pipeline once more to make it clear that it belongs to the pipeline and the t.test does not.

To test, use classroom and online as the two inputs to the function. They can be either way around, since the test is two-sided:

t_test_wide(classroom, online)

    Welch Two Sample t-test

data:  value by group
t = -1.0498, df = 5.9776, p-value = 0.3344
alternative hypothesis: true difference in means between group x and group y is not equal to 0
95 percent confidence interval:
 -9.998992  3.998992
sample estimates:
mean in group x mean in group y 
             30              33 

The same P-value indeed, and everything else the same also.

If the P-value is not the same (or you get an error from your function), you will need to go back and fix up your function until it works. Debugging is a big part of coding. Most of us don’t get things right the first time. Within limits,2 it doesn’t matter how long it takes, as long as you get there by the time your work is due.

Extra: this, perhaps surprisingly, does in fact work:

t.test(online, classroom)

    Welch Two Sample t-test

data:  online and classroom
t = 1.0498, df = 5.9776, p-value = 0.3344
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.998992  9.998992
sample estimates:
mean of x mean of y 
       33        30 

but I don’t want you to do it this way. (If you find out about it, you can use it as a way to check your result, but it is not acceptable as part of an answer to this question.)

\(\blacksquare\)

  1. Modify your function to return just the P-value from the \(t\)-test, as a number. Test your modified function and demonstrate that it does indeed return only the P-value (and not the rest of the \(t\)-test output).

Solution

This means finding out how to get just the P-value out of the t.test output. It’s there on the output, but how do we get it out?

There are two approaches I would consider.

One is to find out all the things that are output from t.test. To do that, run your initial t.test again, but this time save the result:

d_t <- t.test(score ~ location, data = d)

and then take a look at the names of the result:

names(d_t)
 [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
 [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  

The third of those is the one we want. Get it out with a dollar sign, as if it were a column of a dataframe:3

d_t$p.value
[1] 0.3343957

This, I think, also works (in a rather more tidyverse fashion):4

d_t %>% pluck("p.value")
[1] 0.3343957

This function names has quietly shown up a couple of times in this course (once in combination with power.t.test and once when I showed you what augment did in regression).

I said there was a second way. We used this idea in regression, but you might think to use it in a \(t\)-test connection to see what it does. The idea comes from broom: what do glance and tidy do in the context of a \(t\)-test? Let’s find out:

glance(d_t)

and

tidy(d_t)

They are actually identical, and they both have a thing called p.value that is what we want. Another way to run these is in a pipeline, piping the output from t.test into (say) tidy, thus:

t.test(score ~ location, data = d) %>% 
  tidy()

This is a dataframe (the reason why broom is useful), so get just the P-value out of this the way you would get anything out of a dataframe, with a dollar sign (which looks a little odd):

t.test(score ~ location, data = d) %>% 
  tidy() %>% .$p.value
[1] 0.3343957

or with pluck, or for that matter pull:

t.test(score ~ location, data = d) %>% 
  tidy() %>% pluck("p.value")
[1] 0.3343957

select doesn’t quite work here, because it gives you back a dataframe and we want a number.

I also need to talk some more about the dollar sign in pipeline thing. The tidy() produces a dataframe, and we can use the “dot” to refer to “the dataframe that came out of the previous step” (the output from the tidy). So the odd code .$p.value means, in English, “the column called p.value in the dataframe that came out of the previous step”.

All right, with all of that in mind, there are (at least) two ways to modify your function to return just the P-value.

The first is to use the dollar-sign plus p.value (that we found using names):

t_test_wide <- function(x, y) {
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value") -> d
  t_d <- t.test(value ~ group, data = d)
  t_d$p.value
}

The second is to use tidy (or glance) from broom, something like this:

t_test_wide_x <- function(x, y) {
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value") -> d
  t.test(value ~ group, data = d) %>% 
    tidy() %>% pluck("p.value")
}

I have two functions, so I have two tests (but you will only have one):

t_test_wide(classroom, online)
[1] 0.3343957
t_test_wide_x(classroom, online)
[1] 0.3343957

Success.

Extra (there is one more part to the actual question, so you can skip to that and read this later):

I originally wanted you to investigate what happens if you input two vectors of different lengths to your function. (I took that part out because the question was then too long and complicated.) It continued: Explain briefly what happened. Does it still make sense to do a \(t\)-test with input vectors of different lengths? Explain briefly. Hint: make sure the top line of your code chunk says {r, error=TRUE}, or, as the first line inside your code chunk, you have #| error: true.

The solution was:

Make up something, anything, eg. this:

t_test_wide(c(1,2), c(4,5,6))
Error in `tibble()`:
! Tibble columns must have compatible sizes.
• Size 2: Existing data.
• Size 3: Column at position 2.
ℹ Only values of size one are recycled.

If we input two columns of different lengths, we get an error. You should get the same error, possibly with different numbers in it (depending how long your input vectors were).

This is one of the new tidyverse error messages that tells you exactly what happened:

  • the error happened when we used tibble
  • the error was that all the columns in a dataframe must be the same length (with an exception that is mentioned later)
  • my first input had two numbers in it
  • my second input had three numbers in it
  • the exception is that if one of the inputs is of length 1 (has just one number in it), that one number will be repeated to make it the length of the other input, and in that case we won’t get an error:
tibble(x = 3, y = c(4,5,6))

The 3 has been repeated twice more to give us something of length 3, so there is no error here. This doesn’t make sense in the context of a t-test, but it’s not actually a code error.

The reason for the hint is that if you are not careful (and the hint tells you how to be careful), you will end up not being able to render your document (because of the error). The hint tells you how to allow your document to continue rendering if there’s an error, and not to bail out at the first error it sees. A more Quarto way of writing the code chunk is to put #| error: TRUE inside the code chunk as its first line. The effect is the same either way: your document renders, the error message is shown, but Quarto keeps going with the rest of it.

Aside: if you used data.frame above, you get a different error, but one with the same meaning:

data.frame(x = c(1,2), y = c(4, 5, 6))
Error in data.frame(x = c(1, 2), y = c(4, 5, 6)): arguments imply differing number of rows: 2, 3

The second part of the question is whether it makes sense to do a two-sample \(t\)-test with input columns of different lengths. The two-sample \(t\)-test works with two independent samples on two different groups of individuals, so there is no reason why the samples have to be the same size. (Think of our kids learning to read in class: the sample sizes there were 23 and 21, not the same.) So our function really ought to work for input vectors of different lengths.

I am not going to make you explore this further in the question, but if you want to think about how to do it: the problem is that tibble needs two vectors the same length, but logically the \(t\)-test does not. One way around this is to put NA values on the end of the shorter vector. You have probably run into length to find how many entries are in a vector:

u <- c(1, 2)
v <- c(4, 5, 6)
length(u)
[1] 2
length(v)
[1] 3

No surprises there. But you can also assign the length of a vector. Let’s do that in a way that we can use in our function:

n <- max(length(u), length(v))
n
[1] 3
length(u) <- n
length(v) <- n

What does it mean to “make u have length 3”, when we know it has length 2? Well, let’s see what happened to it:

u
[1]  1  2 NA

It now is of length 3, because it got an NA glued onto the end!

So does our function now work with this u and v?

t_test_wide(u, v)
[1] 0.02127341

It looks as if it does. Indeed, if you check it out, you’ll find that t.test ignores any missing values (according to the help, “Missing values are silently removed”), so this is a very smooth way around the problem of input vectors being of different lengths: put NAs on the end of the shorter one to make them the same length. We can put this at the top of our function, thus, with some comments to help future-me:

t_test_wide_q <- function(x, y) {
  # make x and y have the same length
  n <- max(length(x), length(y))
  length(x) <- n
  length(y) <- n
  # assemble into dataframe and run t-test
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value") -> d
  t_d <- t.test(value ~ group, data = d)
  t_d$p.value
}

and then

t_test_wide_q(c(1, 2, 3, 4), c(4, 5, 6))
[1] 0.03464755

which works and looks plausible.

\(\blacksquare\)

  1. Modify your function to allow any inputs that t.test accepts. Demonstrate that your modified function works by obtaining the P-value for a pooled (two-sided) \(t\)-test for the test score data. (This should be very slightly different from the P-value you had before.)

Solution

This is the business about passing input arguments on elsewhere without having to intercept them one by one, using ...:

t_test_wide <- function(x, y, ...) {
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value") -> d
  t.test(value ~ group, data = d, ...) %>% 
    tidy() %>% pluck("p.value")
}

Put the ... at the end of the inputs to the function, and again at the end of the inputs to t.test (since that’s where anything unrecognized needs to be sent). When you use ..., think about where the other input is coming from (the inputs to your function), and where it’s going to (t.test in this case).

To test:

t_test_wide(classroom, online, var.equal = TRUE)
[1] 0.3342512

Our function does not have an input var.equal, so it gets passed on to t.test which does. The P-value is only very slightly different to the one we had before, though it is not exactly the same, which suggests that the spreads of scores in each class are about the same:

t_test_wide(classroom, online)
[1] 0.3343957

and

sd(classroom)
[1] 4.163332
sd(online)
[1] 3.91578

Extra: having thought about equality of spreads, I now realize that I would like to have a function that draws a boxplot with input of the same form. But a boxplot requires longer data the same way that t.test does, and if I add a boxplot function, it would have to do pivot_longer as well, in just the same way that t_test_wide does. Having the exact same code in two different places is (i) wasteful, and (ii) dangerous, because if we later decide to change the way the pivot-longer works , we have to remember to change it in both places.

The solution is what coding people call “refactoring”: pull out the unit that our other functions have in common, and make that into a separate function of its own. I’ll call mine make_longer_df:

make_longer_df <- function(x, y) {
  tibble(x, y) %>% 
    pivot_longer(everything(), names_to = "group", values_to = "value")
}

Note that we don’t need to save the result now, because it gets returned to the outside world. To test:

make_longer_df(classroom, online)

That looks all right. Now, we rewrite our t_test_wide function to itself use what we just wrote:

t_test_wide <- function(x, y, ...) {
  d <- make_longer_df(x, y)
  t.test(value ~ group, data = d, ...) %>% 
    tidy() %>% pluck("p.value")
}

This has now simplified t_test_wide by abstracting away the business of exactly how you make a longer dataframe out of the input vectors: you can now read the function as “make a longer dataframe out of the input, and then run a t-test on that”. Exactly how the input is made into a longer dataframe is not important; if you want to know how it works, you look at the code in make_longer_df. This is rather like writing a report with an Appendix;5 if you have some details that are not critical to most people reading the report, you put them in an Appendix so as not to disrupt the reading flow of most of your readers, and anyone who wants to know the details can read the Appendix.

Let’s check that this works:

t_test_wide(classroom, online)
[1] 0.3343957

and so it does. Now to write a boxplot function, which is also going to use make_longer_df:

boxplot_wide <- function(x, y) {
  d <- make_longer_df(x, y)
  ggplot(d, aes(x = group, y = value)) + geom_boxplot()
}

To make the ggplot work, we had to peek back into the code for make_longer_df to recall what the two columns of the longer dataframe were called, so that we could use them in making the boxplot. Does it work?

boxplot_wide(classroom, online)

Yes, and the two groups do indeed have about the same spread.

If we decided that we wanted to do the pivot-longer in a different way (an example might be to allow the names for the new columns to be input as well, rather than calling them group and value always), we would modify make_longer_df as needed, and then both t_test_wide and boxplot_wide will work properly without needing to change either of them.6

The point of writing a function is to turn a complicated task (like pivoting-longer a dataframe in a particular way) into something with a simple name, which not only simplifies the coding, but also simplifies thinking about the code, because now you only have to think about what a function does, and not about how it does it. This is how coders keep on top of a project with thousands of lines of code: lots of functions with simple names that do complicated tasks, and the higher-level functions (our t_test_wide and boxplot_wide) call the lower-level ones (our make_longer_df).

\(\blacksquare\)

Footnotes

  1. The reason you have to do it this way is that the first input to t.test is a “model formula” saying what depends on what, not a dataframe. If the dataframe were first, there would be no problems omitting the dataframe as usual, to mean “whatever came out of the previous step”.↩︎

  2. There is never actually an infinite amount of time to do something, so you do have to know enough to get reasonably close reasonably quickly. The time limit on assignments is there for that reason.↩︎

  3. It is actually an element of a list, but dataframes at their core are actually lists of columns, so the extraction process is the same.↩︎

  4. pluck pulls anything out of anything else; it doesn’t have to be a dataframe.↩︎

  5. Or my footnotes.↩︎

  6. This is actually not quite true in this case: boxplot_wide would also need to know the names of the columns in what I called d, so that would need tweaking. But the general idea is that changes at the lower level will not affect the higher level much if at all.↩︎