Worksheet 11

Published

December 4, 2024

Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.

If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial. I wanted to give you some extra practice, so there are three multiple regression scenarios and a function-writing one. If you want to practice writing a function, skip ahead to question 21 and the preamble above it.

Packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)
library(MASS, exclude = "select")
library(leaps)

Behavior Risk Factor Surveillance System 2

The Centers for Disease Prevention and Control (in the US) run a survey called the Behavior Risk Factor Surveillance System, in which people submit information about health conditions and risk behaviours. The data we have are percentages of adults within each state for whom the item is true. The variables we are interested in here are:

FruitVeg5: the percent of respondents who eat at least 5 servings of fruits and vegetables every day (response)
SmokeEveryday: the percent of respondents who smoke every day.
EdCollege: the percent of respondents who have completed a college diploma or university degree.

The data are in http://ritsokiguess.site/datafiles/brfss_no_utah.csv. There are many other variables, which you can ignore. The state of Utah is omitted.

You might have seen these data before; if you have, you might recall that we omitted Utah, because (for religious reasons) a lot of people do not smoke there for reasons that have nothing to do with eating fruits and vegetables.

(This is, I suspect, a rather old question.)

Read in and display (a little of) the data.

Make a suitable plot of the percent of people eating five fruits and vegetables every day against the percent smoking every day. Add a smooth trend if appropriate.

Describe any trends you see on your picture (think form, direction, strength).

Draw a scatterplot of fruit and vegetable consumption vs rate of college graduation. Describe what you see on your scatterplot (form, direction, strength).

Run a regression predicting the percent of adults who eat 5 servings of fruits and vegetables daily from EdCollege and SmokeEveryday, and display the results.

Look at the two numbers in the Estimate column, not including the intercept. Do their signs, positive or negative, makes sense in the context of the data? Explain briefly.

What does the last number in the rightmost column of the regression output tell us? Does this make sense in the light of graphs you have drawn? Explain briefly.

Draw a complete set of residual plots for this regression (four plots). Do you have any concerns? Explain briefly.

Veggie burgers

You like hamburgers, but you are a vegetarian. What to do? Today, there are many brands of hamburgers without any meat in them. Some of these are designed to taste like meat, and some have their own flavour. A magazine rated the flavour and texture of 12 different (numbered) brands of meatless hamburgers (to give a rating score between 0 and 100), along with the price (in cents), the number of calories, the grams of fat, and the milligrams of sodium. These measurements are per burger. Is it possible to predict the rating score of a brand of meatless hamburger from the other measurements, and if so, how? The data are in http://ritsokiguess.site/datafiles/veggie-burgers.txt, in aligned columns.

Read in and display (most of) the data.

Fit a suitable regression to predict score from the other measured variables. Display the results.

It looks as if both price and sodium will be able to be removed from this regression. Do so, explain briefly why another test is necessary, and do that other test. What do you conclude?

What happens if you do backward elimination from here, starting from the best model found so far? Does the result seem to make sense? Explain briefly.

Find the best model according to (your choice) AIC or adjusted R-squared. Bear in mind that the best model might have more than two explanatory variables in it. Hint: step finds the best model according to AIC.

For the best model obtained in the previous part, do a residual analysis (this will be three or about six plots, depending on how you count them).

Do you see any problems in your residual plots, or not? Explain briefly.

Construction projects

How long does it take to complete a large construction project? This might depend on a number of things. In one study of this issue, fifteen construction projects were investigated (as a random sample of “all possible construction projects”). Five variables were measured:

time taken to complete the project (days)
size of the contract (thousands of dollars)
number of work days affected by bad weather
whether or not there was a worker’s strike during the construction (1 = yes, 0 = no)
number of subcontractors involved in the project.

Subcontractors are people like electricians, plumbers, and so on, that are hired by the company overseeing the whole project to complete specific jobs. A large project might have a number of subcontractors coming in at different times to do parts of the work.

The data are in http://ritsokiguess.site/datafiles/construction.csv.

Read in and display (some of) the data.

Fit a suitable regression, predicting the response variable from everything else, and display the results.

Build a good model for predicting completion time, using backward elimination with \(\alpha = 0.10\). Describe your process.

For your best model, obtain a full set of residual plots. This means:

residuals vs fitted values
normal quantile plot of residuals
residuals vs each of the explanatory variables (all of them, not just the ones in your final model).

Comment briefly on your plots, and how they support the regression being basically satisfactory (or not, if you think it’s not). You may assume that all the data values were correctly recorded.

Writing a function to do wide \(t\)-test

The way we know how to run a two-sample \(t\)-test is to arrange the data “long”: that is, to have one column with all the data in it, and a second column saying which of the two groups each data value belongs to. However, sometimes we get data in two separate columns, and we would like to make a function that will run the \(t\)-test on data in this form.

As an example, suppose we have data on student test scores for some students who took a course online (asynchronously), and some other students who took the same course in-person. The students who took the course online scored 32, 37, 35, 28; the students who took the course in-person scored 35, 31, 29, 25. (There were actually a lot more students than this, but these will do to test with.)

Enter these data into two vectors called online and classroom respectively.

Using the two vectors you just made, make a dataframe with those two vectors as columns, and rearrange it to be a suitable input for t.test. Hint: you might find everything() useful in your rearrangement. When you have everything suitably rearranged, run a two-sample \(t\)-test (two-sided, Welch). What P-value do you get?

Write a function that takes two vectors as input. Call them x and y. The function should run a two-sample (two-sided, Welch) \(t\)-test on the two vectors as input and return the output from the \(t\)-test.

Test your function on the same data you used earlier, and verify that you get the same P-value.

Modify your function to return just the P-value from the \(t\)-test, as a number.

Test your modified function and demonstrate that it does indeed return only the P-value (and not the rest of the \(t\)-test output).

What happens if you input two vectors of different lengths to your function? Explain briefly what happened. Does it still make sense to do a \(t\)-test with input vectors of different lengths? Explain briefly. Hint: if you plan to render your document, make sure the top line inside your code chunk says #| error: true.

Modify your function to allow any inputs that t.test accepts. Demonstrate that your modified function works by obtaining a pooled \(t\)-test for the test score data.