Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.
If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial. I wanted to give you some extra practice, so there are three multiple regression scenarios. There will be some function-writing practice on Worksheet 12, which is not attached to a tutorial.
Packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The Centers for Disease Prevention and Control (in the US) run a survey called the Behavior Risk Factor Surveillance System, in which people submit information about health conditions and risk behaviours. The data we have are percentages of adults within each state for whom the item is true. The variables we are interested in here are:
FruitVeg5: the percent of respondents who eat at least 5 servings of fruits and vegetables every day (response)
SmokeEveryday: the percent of respondents who smoke every day.
EdCollege: the percent of respondents who have completed a college diploma or university degree.
The data are in http://ritsokiguess.site/datafiles/brfss_no_utah.csv. There are many other variables, which you can ignore. The state of Utah is omitted, because (for religious reasons) a lot of people do not smoke there for reasons that have nothing to do with eating fruits and vegetables.
Read in and display (a little of) the data.
Make a suitable plot of the percent of people eating five fruits and vegetables every day against the percent smoking every day. Add a smooth trend if appropriate.
Describe any trends you see on your picture (think form, direction, strength).
Draw a scatterplot of fruit and vegetable consumption vs rate of college graduation. Describe what you see on your scatterplot (form, direction, strength).
Run a regression predicting the percent of adults who eat 5 servings of fruits and vegetables daily from EdCollege and SmokeEveryday, and display the results.
Look at the two numbers in the Estimate column, not including the intercept. Do their signs, positive or negative, makes sense in the context of the data, based on what you know or can guess? Explain briefly.
What does the last number in the rightmost column of the regression output tell us? Does this make sense in the light of graphs you have drawn? Explain briefly.
Draw a complete set of residual plots for this regression (four plots). Do you have any concerns? Explain briefly.
Construction projects
How long does it take to complete a large construction project? This might depend on a number of things. In one study of this issue, fifteen construction projects were investigated (as a random sample of “all possible construction projects”). Five variables were measured:
time taken to complete the project (days)
size of the contract (thousands of dollars)
number of work days affected by bad weather
whether or not there was a worker’s strike during the construction (1 = yes, 0 = no)
number of subcontractors involved in the project.
Subcontractors are people like electricians, plumbers, and so on, that are hired by the company overseeing the whole project to complete specific jobs. A large project might have a number of subcontractors coming in at different times to do parts of the work.
Fit a suitable regression, predicting the response variable from everything else, and display the results.
Build a good model for predicting completion time, using backward elimination with \(\alpha = 0.10\). Describe your process.
For your best model, obtain a full set of residual plots. This means:
residuals vs fitted values
normal quantile plot of residuals
residuals vs each of the explanatory variables.
Comment briefly on your plots, and how they support the regression being basically satisfactory. You may assume that all the data values were correctly recorded.
Forced expiratory volume
One way to measure how well someone is breathing is to measure their “forced expiratory volume” (FEV), which is how much air (in litres) you can expel from your lungs in one second. If the FEV is too low, this may indicate difficulties in breathing. A doctor wanted to see whether children whose parents smoked at home had difficulties breathing (as measured by the FEV). The data are in http://ritsokiguess.site/datafiles/fev.csv. The doctor also recorded the child’s age (in years) and height (in cent), as well as the child’s gender (here male or female), along with whether or not the child had been exposed to smoking in the home (in the column Smoke). 654 children were observed.
Read in and display (some of) the data.
Fit a regression predicting forced expiratory volume from all the other variables. Why should you run drop1? Do so for this model.
What does your drop1 output tell you? In particular, does it address the doctor’s concern?
Look at the summary output for your model, and interpret the Estimates for Ht and Gender.
Make a complete set of residual plots for this model. Do you have any concerns? (Hint: the procedure you know doesn’t like a mixture of quantitative and categorical variables, because it’s trying to put text and numbers into the same column. Do the quantitative variables first, and then do the categorical ones.)
Investigate a transformation of FEV. Are the results consistent with your residual plots?
Fit a regression suggested by the results of your previous question, and display the drop1 and summary output. Has anything changed from your previous work? Explain briefly.