Worksheet 12

Published

November 22, 2023

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There is extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

This week’s questions, in order, are a multiple regression (Question 1), a regression with categorical variable (Question 2), and writing a function (Question 3). I encourage you to start with the one where you think you need the most practice.

1 Behavior Risk Factor Surveillance System 2

The Centers for Disease Prevention and Control (in the US) run a survey called the Behavior Risk Factor Surveillance System, in which people submit information about health conditions and risk behaviours. The data we have are percentages of adults within each state for whom the item is true. The variables we are interested in here are:

  • FruitVeg5: the percent of respondents who eat at least 5 servings of fruits and vegetables every day (response)
  • SmokeEveryday: the percent of respondents who smoke every day.
  • EdCollege: the percent of respondents who have completed a college diploma or university degree.

The data are in http://ritsokiguess.site/datafiles/brfss_no_utah.csv. There are many other variables, which you can ignore. The state of Utah is omitted.

You might have seen these data before; if you have, you might recall that we omitted Utah, because (for religious reasons) a lot of people do not smoke there for reasons that have nothing to do with eating fruits and vegetables.

  1. Read in and display (a little of) the data.

  2. Make a graph that shows the relationships, if any, between the response and the two explanatory variables.

  3. Describe any trends you see on your graph(s) (think form, direction, strength if you find it helpful).

  4. Run a regression predicting the percent of adults who eat 5 servings of fruits and vegetables daily from EdCollege and SmokeEveryday, and display the results.

  5. Draw a complete set of residual plots for this regression (three or four plots, depending how you count them). Do you have any concerns? Explain briefly.

Behavior Risk Factor Surveillance System 2: my solutions

The Centers for Disease Prevention and Control (in the US) run a survey called the Behavior Risk Factor Surveillance System, in which people submit information about health conditions and risk behaviours. The data we have are percentages of adults within each state for whom the item is true. The variables we are interested in here are:

  • FruitVeg5: the percent of respondents who eat at least 5 servings of fruits and vegetables every day (response)
  • SmokeEveryday: the percent of respondents who smoke every day.
  • EdCollege: the percent of respondents who have completed a college diploma or university degree.

The data are in http://ritsokiguess.site/datafiles/brfss_no_utah.csv. There are many other variables, which you can ignore. The state of Utah is omitted.

You might have seen these data before; if you have, you might recall that we omitted Utah, because (for religious reasons) a lot of people do not smoke there for reasons that have nothing to do with eating fruits and vegetables.

  1. Read in and display (a little of) the data.

Solution

A simple read_csv:

my_url <- "http://ritsokiguess.site/datafiles/brfss_no_utah.csv"
brfss <- read_csv(my_url)
Rows: 49 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): State
dbl (30): Age18_24, Age25_34, Age35_44, Age45_54, Age55_64, Age65orMore, EdL...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
brfss

There are 49 rows, one for each state apart from Utah. There are a lot of columns, 31 altogether. You can check for yourself, by looking or as below, that the columns I named above do actually exist. There is no need to “get rid” of the other columns, unless you have a strong desire to do so.

\(\blacksquare\)

  1. Make a graph that shows the relationships, if any, between the response and the two explanatory variables.

Solution

This is best done as one of those facetted graphs, with each of the (here two) facets showing a scatterplot of the response against one of the explanatory variables. To do that, arrange the dataframe longer, with all the explanatory variable values in one column and a second column saying which ones they are:

brfss %>% 
  pivot_longer(c(SmokeEveryday, EdCollege), names_to = "xname", values_to = "x") %>% 
  ggplot(aes(x = x, y = FruitVeg5)) + geom_point() + geom_smooth(se = FALSE) +
  facet_wrap(~xname, scales = "free")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'