Worksheet 1

Published

January 5, 2025

Two problems to work through.

Packages you’ll need:

library(tidyverse)
library(marginaleffects)

Veggie burgers

You like hamburgers, but you are a vegetarian. What to do? Today, there are many brands of hamburgers without any meat in them. Some of these are designed to taste like meat, and some have their own flavour. A magazine rated the flavour and texture of 12 different (numbered) brands of meatless hamburgers (to give a rating score between 0 and 100), along with the price (in cents), the number of calories, the grams of fat, and the milligrams of sodium. These measurements are per burger. Is it possible to predict the rating score of a brand of meatless hamburger from the other measurements, and if so, how? The data are in http://ritsokiguess.site/datafiles/veggie-burgers.txt, in aligned columns.

(Some of this question appeared on the C32 Worksheet 11. There is some duplication, but I have tried to remove most of it.)

Read in and display (most of) the data.

Fit a suitable regression to predict score from the other measured variables (that is to say, not brand: why?). Display the results.

It looks as if both price and sodium will be able to be removed from this regression. Do so, explain briefly why another test is necessary, and do that other test. What do you conclude? (Note: if you display your output to the second regression, something rather odd will appear. You can safely ignore that.)

Another veggie burger (not in the original dataset) has the following values for the explanatory variables: price 91, calories 140, fat 5, sodium 450. What can you say about the likely score for a veggie burger with these values? Obtain a suitable interval, for each of your two models.

Compare the lengths of your two intervals. Does it make sense that your shorter one should be shorter? Explain briefly.

Using our second model (the one with only calories and fat in it), find a suitable interval for the mean score when (i) calories is 140 and fat is 5, (ii) calories is 120 and fat is 3. (You should have two intervals.)

Explain briefly why the second interval is shorter than the first one. Make sure you justify your answer.

Blood pressure

Twenty people with high blood pressure had various other measurements taken. The aim was to see which of these were associated with blood pressure, with the aim of understanding what causes high blood pressure. The variables observed were:

Pt: patient number (ignore)
BP: (diastolic) blood pressure, in mmHg
Age in years
Weight in kg
BSA: body surface area, in m\(^2\)
Dur: length of time since diagnosis of high blood pressure, in years
Pulse: pulse rate, in beats per minute
Stress: score on a questionnaire about stress levels (higher score is more stressed)

The data values, separated by tabs, are in https://ritsokiguess.site/datafiles/bloodpress.txt.

Read in and display (some of) the data.

Make a plot of the blood pressure against each of the measured explanatory variables. Hint: use the idea from C32 of making a suitable long dataframe and using facets in your graph.

Which explanatory variables seem to have a moderate or strong linear relationship with blood pressure?

Run a regression predicting blood pressure from BSA and Weight, and display the output. Does the significance or lack of significance of each of your explanatory variables surprise you? Explain briefly.

Explain briefly why it does in fact make sense that the regression results came out as they did. You may wish to draw another graph to support your explanation.