STAC32 Assignment 7

Packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)

Crabs

The crab species Leptograpsus variegatus has two colour forms, blue and orange. Fifty crabs of each colour form and sex were collected at Fremantle, Western Australia, and various measurements were taken on each crab. The variables of interest to us are:

sp species (B is blue and O is orange, a letter O)
sex Male or female
CL carapace length
index a number between 1 and 50 unique within each species-sex combination.

All of the measurements are in millimetres. The data are in http://ritsokiguess.site/datafiles/crabs.csv.

(1 point) Read in and display (some of) the data.

Zero surprises:

my_url <- "http://ritsokiguess.site/datafiles/crabs.csv"
crabs <- read_csv(my_url)

Rows: 200 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): sp, sex
dbl (6): index, FL, RW, CL, CW, BD

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

crabs

There are 200 crabs altogether, and two species (scroll down). A lot of other measurements were taken, in addition to the ones that will be of interest to us.

(2 points) Create, save, and display (some of) a dataframe with only the species and the carapace length (but all the observations for those two variables).

Making a dataframe with only some variables (columns) is select:

crabs %>% select(sp, CL) -> crabs1
crabs1

I called my new dataframe crabs1 (there will be a crabs2 shortly). A rather easy two points.

(3 points) Attempt to display the carapace lengths in two columns, one column for each species. What happens? Why did it happen? Explain briefly.

We have long data (such as you might use to do a two-sample test to compare the carapace lengths between the two species), and we want to make it wider:

crabs1 %>% pivot_wider(names_from = sp, values_from = CL)

Warning: Values from `CL` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} |>
  dplyr::summarise(n = dplyr::n(), .by = c(sp)) |>
  dplyr::filter(n > 1L)

A point for getting that code and output.

Then explain what happened: all of the carapace lengths for each species ended up crammed together into one cell of the dataframe. (If you like, contrast that with what you were expecting: 100 rows of carapace measurements in each column.) A point for that.

Then explain why it happened. The key thing is that both of the variables in crabs1 were used in the pivot_wider, so there is nothing left over to determine what row each observation goes into (so they all get crammed into one row).

Extra: The warning, if you read it carefully, gives some clues about that. You don’t need to investigate the options here, but you could use the values_fn option to pivot_wider in a couple of ways, one to say “I know this is going to happen and I’m not worried about it”, which kills the warning message:

crabs1 %>% pivot_wider(names_from = sp, values_from = CL,
                       values_fn = list)

Another possibility is to summarize all the values that ended up in one cell:

crabs1 %>% pivot_wider(names_from = sp, values_from = CL,
                       values_fn = median)

This works out the median carapace length for crabs of each species (and now each cell has only one thing in it because you replaced all the values with a summary). You can use any summary that you would use in summarize.

The third suggestion is this one:

crabs1 %>% 
  dplyr::summarise(n = dplyr::n(), .by = c(sp)) %>% 
  dplyr::filter(n > 1L)

This lists all the cells with more than one observation crammed into them. I replaced {data} with the name of my dataframe, and the |> is called “the base R pipe” and does more or less the same thing as %>%. This code says “count the number of observations for each species, and display the ones that are bigger than 1”. 1L is “1 made to be an integer, not a decimal number”. The .by thing is an alternative way of doing group-by and summarize: you put the group-by right in the summarize.

(2 points) Starting from the dataframe you read in from the file, create, save, and display (some of) a dataframe with species, sex, carapace length, and the column index (with all the observations).

select as before:

crabs %>% 
  select(sp, sex, index, CL) -> crabs2
crabs2

I called this one crabs2 (now you see why I called the previous one crabs1). The order of the columns is not important.

(3 points) Repeat your code to put the carapace lengths into two columns, one for each species, but starting from the dataframe you just created. Do you now get something that makes sense? Why is it different from your previous attempt? Explain briefly.

Copy, paste, and edit:

crabs2 %>% pivot_wider(names_from = sp, values_from = CL)

One point so far.

What we see is that we now have 100 rows, which matches up with the 100 crabs of each species (100 orange ones and 100 blue ones). This makes sense. One point.

What happened to make it work this time is the presence of the columns index and sex. There are 50 different values of index (1 through 50) and two different sexes (M and F). There are \(50 \times 2 = 100\) combinations of these, so that there are now 100 rows because index and sex were not named in the pivot_wider. The third point for enough of that.

Extra: the values in index didn’t have to be the numbers 1 through 50. They could be any 50 different values. Here’s a similar but smaller example (that I made up):

d <- tribble(
  ~greek, ~trt, ~y,
  "alpha", "a", 10,
  "beta", "a", 12,
  "gamma", "a", 13,
  "alpha", "b", 11,
  "beta", "b", 15
)
d

Then:

d %>% pivot_wider(names_from = trt, values_from = y)

The pivot-wider uses trt and y, leaving only greek. There are three different values in greek, so there are three rows. One of the cells in the result is empty (the NA) because in my d there was no value of y when greek was gamma and trt was b.

Consumption of natural gas

How does the price of natural gas affect how much of it people use? Consumption of natural gas was measured for 20 towns in Texas (in thousands of cubic feet per customer). In each town, the price was different (in cents per cubic feet). The data are in http://ritsokiguess.site/datafiles/texasgas.csv.

(1 point) Read in and display (some of) the data.

As always:

my_url <- "http://ritsokiguess.site/datafiles/texasgas.csv"
texasgas <- read_csv(my_url)

Rows: 20 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): price, consumption

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

texasgas

(2 points) Make a suitable graph of the two variables in your dataframe.

Two things to consider:

two quantitative variables, so a scatterplot
the response is consumption. The thinking behind “how does price affect consumption?” is that price is explanatory and consumption is the response; we are thinking about price as being something we could change,¹ which would have some sort of impact on how much was consumed, the consumption being something we cannot control.

Hence, a scatterplot with consumption on the \(y\)-axis:

ggplot(texasgas, aes(x = price, y = consumption)) + geom_point()

Add a smooth trend if you like (optional).

(2 points) What does your graph tell you about the relationship between price and consumption? (If you wish, use “form, direction, strength” as a guide.)

form: it seems to be a non-linear trend. Consumption goes steadily down up to a price of about 60, and then decreases less rapidly, or levels off if that’s what you see).
direction: it’s overall a downward trend: as price goes up, consumption goes down.²
strength: I’d call this a moderately strong relationship, since the points appear to be reasonably close to whatever the right relationship is, without being very close to it (there is still going to be some variation around the best trend).

The two key things to observe are the first two: it’s a downward but non-linear trend. Those are what the two points are for.

(2 points) Fit a straight line relationship between the two variables, and display the output. (You may not think this is the best thing to do, but do it anyway for the moment.)

This is an ordinary lm. Give the model a number, because there will be a second model later:

texasgas.1 <- lm(consumption ~ price, data = texasgas)
summary(texasgas.1)


Call:
lm(formula = consumption ~ price, data = texasgas)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.625 -10.719  -1.136  14.073  38.292 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  138.561     13.552  10.225 6.34e-09 ***
price         -1.104      0.202  -5.467 3.42e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.86 on 18 degrees of freedom
Multiple R-squared:  0.6241,    Adjusted R-squared:  0.6033 
F-statistic: 29.89 on 1 and 18 DF,  p-value: 3.417e-05

I named my model for the dataframe; you could have called it something like consumption.1 to name it for the response variable. Have a system that makes sense to you.

Alternatively, if you have broom loaded (I do), you could display the model output like this:

glance(texasgas.1)

tidy(texasgas.1)

which is equally good, if a bit more typing.

Note that there is nothing here indicating a problem with the model: R-squared is not too bad, and there is a significant negative relationship with price, as the economic theory³ predicts. To unearth the problem you are expecting to see, you will have to go beyond this output (which is the reason for the next part).

(4 points) Draw a graph that indicates a problem with the model you just fitted, and explain briefly what that problem is.

The way you unearth problems with a regression model is to look at residuals in some way. Two candidate plots are residuals vs. fitted values, and a normal quantile plot of residuals. Our suspicion is that the form of the fitted relationship is wrong (we fitted a line, and it is not linear), so the residuals vs. fitted values plot is the place to look first:

ggplot(texasgas.1, aes(x = .fitted, y = .resid)) + geom_point()

Add a smooth trend if you like.⁴ Two points for the graph with or without a smooth trend.

The problem here is a down and up trend, not the random scatter of points we were hoping for. This indicates that the original relationship was not a straight line but was non-linear (I would also accept “curved”). Two points, one each, for saying (i) what you saw, (ii) what it tells you.

Extra: the normal quantile plot of residuals is much less revealing:

ggplot(texasgas.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()

A very slightly long lower tail, but all that tells you is that the most negative residuals are a bit more negative than you’d expect. This plot doesn’t really tell you what is going on.

(2 points) Find out what R’s ifelse function does, and explain briefly. Cite your source in a way that the grader can check it (should they wish to).

Use your favourite search engine to find something that tells you about what ifelse does and how it works. I found this page.⁵ See if you can simplify the wording down from programmer-speak, maybe to something like this:

ifelse enables you to define a result according to the value of another variable
The first input is something that will be true or false (a logical condition)
The second input is the result if the first input is true
The third input is the result if the first input is false.

The example in my link shows how you can define a new variable that is Odd or Even according to whether the input number is odd or even. This is done by taking the remainder when you divide the input by 2 (that’s what a %% 2 does), and testing to see whether the remainder is zero (even) or not (odd).

Don’t use chatgpt or the like for a problem like this. Using a website will give you a link that always points to the same place (as long as that website exists), and is thus reproducible, meaning that if somebody else follows the same link, they will get the same result. If you use AI, even if you state the prompt you use, somebody using the same prompt could get a different answer to the one you got. This is not reproducible, and is thus bad science.

(2 points) Create a new column in your dataframe called gt that is the price minus 60 if price is greater than 60 and 0 otherwise. Save the resulting dataframe.

The unstated assumption is that this is going to use ifelse somehow (or else, why would I have you find out about it?) The logical condition is whether the price is greater than 60, the value if that is true is price - 60, and the value if false is zero:

texasgas %>% mutate(gt = ifelse(price > 60, price - 60, 0)) -> texasgas
texasgas

I saved the new dataframe on top of the old one. Maybe it’s better to give it a new name.

(2 points) Fit a regression predicting consumption from price and your new variable, and display the results.

Add gt to your regression thus:

texasgas.2 <- lm(consumption ~ price + gt, data = texasgas)
summary(texasgas.2)


Call:
lm(formula = consumption ~ price + gt, data = texasgas)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.744  -5.084   1.722   9.442  22.749 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 218.8263    17.4986  12.505 5.34e-10 ***
price        -2.8534     0.3560  -8.015 3.56e-07 ***
gt            2.7092     0.5144   5.266 6.30e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.23 on 17 degrees of freedom
Multiple R-squared:  0.8572,    Adjusted R-squared:  0.8404 
F-statistic: 51.01 on 2 and 17 DF,  p-value: 6.55e-08

Extra: this second model, whatever it is doing, fits a lot better than the first one.

(4 points) Plot the data and fitted relationship from the model you just fitted, on one graph. Join the fitted values with lines. Describe the form of the relationship you just fitted.

There are different ways to tackle this one. augment from the broom package is one way (to create a dataframe with the original data and the regression stuff in it, which you can then plot). Easier, though, is to use what I did in lecture to plot the windmill data along with the fitted line and curve. First, plotting the data is just what you did before:

ggplot(texasgas, aes(x = price, y = consumption)) + geom_point()

To that you need to add the fitted values joined by lines. The problem is that (i) they come from a different dataframe,⁶ the one I called texasgas.2, and (ii) the things we want to plot on the \(y\)-axis are now called .fitted rather than consumption. This means using a geom_line with two extra things in it: a data to say which dataframe to get things from, and an aes with a y = to say what y to use, like this:

ggplot(texasgas, aes(x = price, y = consumption)) + geom_point() +
  geom_line(data = texasgas.2, aes(y = .fitted))

Two things to observe here:

the one in lecture also has a geom_smooth which plots the fitted line we had there. We don’t need that here, because we already know one straight line fitted to the whole dataset is no good.
what we are doing in the geom_line is called “overriding”. Usually when you add things to a graph you want to use the same things that are in the aes in the ggplot. But if you want to change them, as we do here, you “override” them by putting, inside the thing you are adding, a data = (if necessary) and another aes to say what you want to change. Anything not mentioned is kept the same as in the original ggplot. Here that’s the thing that’s x, which is still price. The label on the \(y\)-axis is still consumption because the geom_point that used it was done first.

The model fits one line for prices below 60, and a second, less steep, line for prices above 60. These lines join up at 60. (This is sometimes called a “broken stick model”, because it looks like a tree branch that is still joined up but has been snapped in the middle.)

Three points for the graph (including the use of a second aes inside the geom_line), and one for comment on it. If you find another way to make the graph that is not too far off what we’ve done in class, that is also good.

Extra: now we see why this model fits a lot better than the first one: it conforms to the actual shape of the relationship much better, with a steep downward trend for prices less than 60 and a much less steep downward trend for prices greater than 60.

The broken stick model is better than a quadratic curve here, because that would go up again and this trend doesn’t. Another model you might consider is something like our asymptote model from the windmill data, but upside down because consumption might be going down to a limit here. That limit might be zero, or there might be an amount of natural gas that people will always use (say, for cooking or heating) no matter how expensive it gets.

I don’t know whether you were expecting to see the broken-stick thing. Sometimes it’s hard to look at a model and see what it is doing. The secret is then to look at predictions and see what they are doing. Then you can describe what’s going on.⁷

Let’s take a look at the model output again:

summary(texasgas.2)


Call:
lm(formula = consumption ~ price + gt, data = texasgas)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.744  -5.084   1.722   9.442  22.749 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 218.8263    17.4986  12.505 5.34e-10 ***
price        -2.8534     0.3560  -8.015 3.56e-07 ***
gt            2.7092     0.5144   5.266 6.30e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.23 on 17 degrees of freedom
Multiple R-squared:  0.8572,    Adjusted R-squared:  0.8404 
F-statistic: 51.01 on 2 and 17 DF,  p-value: 6.55e-08

That slope on price applies for any price less than 60 (where gt is zero): a significantly negative downward trend. The slope for gt is actually the difference in slopes between “less than 60” and “greater than 60”, so the P-value on the gt line is saying that the slope is significantly different between a price less than 60 and a price greater. As to what that second slope actually is:

-2.8534 + 2.7092

[1] -0.1442

It’s still negative, but a lot closer to zero (the line is a lot closer to being flat when price is greater than 60).

Footnotes

If we worked for the natural gas company, that is.↩︎
I think this is what microeconomics would predict, but my one economics course was forty years ago, so don’t hold me to that.↩︎
As much of it as I can remember, anyway.↩︎
The problem with doing so in general is that it makes you infer a trend that is not really there, but in this case there really is a trend.↩︎
To insert a link, assuming you are in the visual editor, type / and some letters of the word “link”, then hit Enter, which pops up a box in which you can insert the URL you would like to link to, and the text that is clickable.↩︎
Strictly speaking, a “fitted model object” rather than a dataframe, but for our purposes it behaves like a dataframe.↩︎
This strategy is one we use a fair bit in STAD29, when the models get more complicated and the estimates don’t make sense by themselves.↩︎