Worksheet 6

Published

February 22, 2024

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There might be extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

1 Home prices revisited

A realtor kept track of the asking prices of 37 homes for sale in West Lafayette, Indiana, in a particular year. The asking prices are in http://ritsokiguess.site/datafiles/homes.csv. There are two columns, the asking price (in $) and the number of bedrooms that home has (either 3 or 4, in this dataset). The realtor was interested in whether the mean asking price for 4-bedroom homes was bigger than for 3-bedroom homes.

You have seen these data before (on Worksheet 4).

  1. Read in and display (some of) the data.

  2. Draw a suitable graph of these data. Bear in mind that even though the number of bedrooms is a number, it should really be treated as categorical. (This asks for the same graph that you drew before.)

  3. Create and save a dataframe that has just the 4-bedroom houses in it.

  4. Obtain and plot the bootstrap sampling distribution of the sample mean asking price for the 4-bedroom houses. (Your plot should probably be a histogram with 10 bins if you have 1000 simulations, more bins if you have more simulations.)

  5. Comment briefly on the shape of your plot.

  6. Repeat parts (c) through (e) for the 3-bedroom houses.

  7. Running the two-sample \(t\)-test on the logs of the asking prices, the P-value for a one-sided test (for an alternative that 4-bedroom houses had a higher mean asking price) was \(6.5 \times 10^{-6}\). Run the same two-sample \(t\)-test, but on the asking prices themselves. How does the P-value, and thus your conclusion, differ?

2 Prices of cheese slices

The prices per ounce of 13 different brands of individually wrapped cheese slices, at a certain supermarket at a certain time, were recorded. The prices are in cents. The data are in http://ritsokiguess.site/datafiles/cheese_prices.txt, in one column.

  1. Read in and display (most of) the data.

  2. Make a one-sample boxplot of your dataset. What feature of your graph suggests that a \(t\)-procedure may not be appropriate?

  3. What is a 95% confidence interval for the median price per ounce of all individually-wrapped slices of cheese (as sold by supermarkets like the one sampled, at about that time)?

  4. This store’s own brand of cheese slices was not included in the data. The store manager wants to sell the store’s brand at 21 cents per ounce. Is the median price of all other brands (of which the ones taken are a sample) significantly greater than this?

  5. How could you have known that your test of the previous part would give a significant result? Explain briefly.

  6. Obtain a 95% confidence interval for the mean price of all cheese slices (of which the ones observed are a sample). Compare this with the confidence interval for the median that you found earlier. What does this tell you about the appropriateness of \(t\) procedures for these data? Explain briefly.

3 Fuel efficiency comparison

Some cars have on-board computers that calculate quantities related to the car’s performance. One of the things measured is the fuel efficiency, that is, how much gasoline the car uses. On an American car, this is measured in miles per (US) gallon. On one type of vehicle equipped with such a computer, the fuel efficiency was measured each time the gas tank was filled up, and the computer was then reset. Twenty observations were made, and are in http://ritsokiguess.site/datafiles/mpgcomparison.txt. The computer’s values are in the column Computer. The driver also calculated the fuel efficiency by hand, by noting the number of miles driven between fill-ups, and the number of gallons of gas required to fill the tank each time. The driver’s values are in Driver. The final column Diff is the difference between the computer’s value and the driver’s value for each fill-up. The data values are separated by tabs.

  1. Read in and display (some of) the data.

  2. What is it that makes this paired data? Explain briefly.

  3. Draw a suitable graph of these data, bearing in mind what you might want to learn from your graph.

  4. Is there any difference between the average results of the driver and the computer? (Average could be mean or median, whichever you think is best). Do an appropriate test.

  5. The people who designed the car’s computer are interested in whether the values calculated by the computer and by the driver on the same fill-up are usually close together. Explain briefly why it is that looking at the average (mean or median) difference is not enough. Describe what you would look at in addition, and how that would help.

My solutions:

Home prices revisited

  1. Read in and display (some of) the data.

Solution

The exact usual:

set.seed(457299)
my_url <- "http://ritsokiguess.site/datafiles/homes.csv"
asking <- read_csv(my_url)
Rows: 37 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): price, bdrms

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
asking

There are indeed 37 homes; the first of them have 4 bdrms, and the ones further down, if you scroll, have three. The price column does indeed look like asking prices of homes for sale.1

\(\blacksquare\)

  1. Draw a suitable graph of these data. Bear in mind that even though the number of bedrooms is a number, it should really be treated as categorical. (This asks for the same graph that you drew before.)

Solution

Two groups of prices to compare, or one quantitative column and one column that appears to be categorical (it’s actually a number, but it’s playing the role of a categorical or grouping variable). So a boxplot.

This is how we figured it out before:

ggplot(asking, aes(x = factor(bdrms), y = price)) + geom_boxplot()

\(\blacksquare\)

  1. Create and save a dataframe that has just the 4-bedroom houses in it.

Solution

This is filter, to grab the rows that you want:

asking %>% 
  filter(bdrms == 4) -> bed4
bed4

There are 14 4-bedroom houses. I called my dataframe bed4 because (i) you can’t start a name with a number (not without some care, anyway) and (ii) we’re going to do the 3-bedroom houses shortly.

\(\blacksquare\)

  1. Obtain and plot the bootstrap sampling distribution of the sample mean asking price for the 4-bedroom houses. (Your plot should probably be a histogram with 10 bins if you have 1000 simulations, more bins if you have more simulations.)

Solution

Via the usual procedure:

tibble(sim = 1:1000) %>%
  rowwise() %>% 
  mutate(my_sample = list(sample(bed4$price, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(x = my_mean)) + geom_histogram(bins = 10)

\(\blacksquare\)

  1. Comment briefly on the shape of your plot.

Solution

To me, this looks very close to normal. That is to say, even though the distribution of asking prices of 4-bedroom houses was somewhat skewed to the right (look at your boxplot), even a sample size of 14 was big enough to overcome the non-normality in the asking prices for the 4-bedroom houses.

Before you start running away with things, this is only for one of our two groups, and (presumably) the better one normality-wise. To do a two-sample \(t\)-test, we need both groups to behave themselves, which brings us to the next part.

\(\blacksquare\)

  1. Repeat parts (c) through (e) for the 3-bedroom houses.

Solution

Same ideas exactly:

asking %>% filter(bdrms == 3) -> bed3
bed3

There are 23 of these, so we can expect a bit more help from the Central Limit Theorem, but will it be enough?

tibble(sim = 1:1000) %>%
  rowwise() %>% 
  mutate(my_sample = list(sample(bed3$price, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) -> d
  ggplot(d, aes(x = my_mean)) + geom_histogram(bins = 10)

For me, this is actually not bad, perhaps a little right-skewed (which you would expect because of those upper-end outliers). So overall, a two-sample \(t\)-test on the asking prices is not as bad as you might think.

\(\blacksquare\)

  1. Running the two-sample \(t\)-test on the logs of the asking prices, the P-value for a one-sided test (for an alternative that 4-bedroom houses had a higher mean asking price) was \(6.5 \times 10^{-6}\). Run the same two-sample \(t\)-test, but on the asking prices themselves. How does the P-value, and thus your conclusion, differ?

Solution

You might not be completely happy running the \(t\)-test now, but we are doing it anyway to see how the results compare:

t.test(price~bdrms, data = asking, alternative = "less")

    Welch Two Sample t-test

data:  price by bdrms
t = -4.4753, df = 20.976, p-value = 0.0001045
alternative hypothesis: true difference in means between group 3 and group 4 is less than 0
95 percent confidence interval:
      -Inf -73385.25
sample estimates:
mean in group 3 mean in group 4 
       147560.8        266792.9 

The P-value of \(1 \times 10^{-4}\) is actually quite a bit bigger than the one we had before, but the conclusion is the same: we still have strong evidence that the mean asking price for 4-bedroom houses is greater than for 3-bedroom houses.

Brief extra: my suspicion is that the P-value is larger, and thus the evidence is a bit weaker, because the three outliers in the 3-bedroom group are inflating the SD of that group, thus making the \(t\)-statistic a bit smaller and the P-value a bit larger.

\(\blacksquare\)

Prices of cheese slices

  1. Read in and display (most of) the data.

Solution

The data are in one column, so you can pretend the values are delimited by anything:

my_url <- "http://ritsokiguess.site/datafiles/cheese_prices.txt"
cheese <- read_delim(my_url, " ")
Rows: 13 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
dbl (1): price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cheese

The one column of data is called price.

\(\blacksquare\)

  1. Make a one-sample boxplot of your dataset. What feature of your graph suggests that a \(t\)-procedure may not be appropriate?

Solution

One quantitative variable would suggest a histogram, but the appearance of your histogram depends very much on the number of bins you choose. A clearer picture in this case comes from a one-sample boxplot, for which you set x to 1 (a constant) and y to what you want to plot:

ggplot(cheese, aes(x = 1, y = price)) + geom_boxplot()

There is a long lower tail with an outlier shown, so the distribution is skewed to the left. (The outlier is only just extreme enough to be an outlier as far as the boxplot is concerned, so I think the best description is that the distribution is skewed to the left.)

Another reasonable plot (if you have been reading ahead, both in this course and in this question) is a normal quantile plot:

ggplot(cheese, aes(sample = price)) + stat_qq() + stat_qq_line()

The most evident problem here is that the two lowest values are noticeably too low: that there are two outliers at the low end. Note that the conclusion here (two low outliers) is different from the conclusion from the boxplot (left-skewed). You could make the case for left-skewed here by saying that the points form something like a curved pattern: too low on the left, on the line in the middle, too low on the right.

None of this discussion has talked about sample size, which really ought to be part of the issue. The sample size here is 13, not (apparently) very big, so there won’t (apparently) be much help from the Central Limit Theorem (but see the bootstrap below).

\(\blacksquare\)

  1. What is a 95% confidence interval for the median price per ounce of all individually-wrapped slices of cheese (as sold by supermarkets like the one sampled, at about that time)?

Solution

You did read the word “median”, didn’t you?

This uses ci_median from package smmr, which you will need to have installed and loaded. (I didn’t say anything like “build it yourself” or anything that would imply doing it as I did it first, by trial and error.) The level of 95% is the default, so specify the dataframe and column:

ci_median(cheese, price)
[1] 21.60244 27.99492

From 21.6 to 28.0 cents per ounce.

State your interval and round it off suitably. You’ll recall that the ends of a confidence interval for a median (based on the sign test) are data values, and the data values here are to one decimal, so that is the best number of decimals to quote. You could use two decimals, the second one of which would be zero here. Any greater number of decimals is failing to consider your reader.

\(\blacksquare\)

  1. This store’s own brand of cheese slices was not included in the data. The store manager wants to sell the store’s brand at 21 cents per ounce. Is the median price of all other brands (of which the ones taken are a sample) significantly greater than this?

Solution

A sign test, therefore, testing a null median of 21:

sign_test(cheese, price, 21)
$above_below
below above 
    2    11 

$p_values
  alternative    p_value
1       lower 0.99829102
2       upper 0.01123047
3   two-sided 0.02246094

We want to know whether the median price of all the (other) cheese slices is greater than 21 cents, so the right P-value is the upper-tailed one, 0.011.

That is to say, the cheese slices sampled have a median price per ounce significantly greater than the proposed price of the store brand slices (or that the store brand cheese slices will have a significantly lower median price).

\(\blacksquare\)

  1. How could you have known that your test of the previous part would give a significant result? Explain briefly.

Solution

The null median of 20 was outside the 95% confidence interval for the median obtained earlier, so we know that the two-sided P-value would be less than 0.05. In addition, we know that the proposed price for the store brand cheese slices is below the lower end of the interval, so we know we are on the correct side, and therefore the one-sided P-value will be less than \(0.05/2 = 0.025\).

If you did a two-sided test above, you were wrong there, but in this part, something like my first sentence is all you need, since that would answer the question from that point of view.

\(\blacksquare\)

  1. Obtain a 95% confidence interval for the mean price of all cheese slices (of which the ones observed are a sample). Compare this with the confidence interval for the median that you found earlier. What does this tell you about the appropriateness of \(t\) procedures for these data? Explain briefly.

Solution

The \(t\)-interval for the mean is what we did earlier in the course:

with(cheese, t.test(price))

    One Sample t-test

data:  price
t = 24.44, df = 12, p-value = 1.327e-11
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 22.42796 26.81819
sample estimates:
mean of x 
 24.62308 

The interval is 22.43 to 26.81 cents per ounce. (Here you can justify two decimals, since the data had one, and we are now talking about a mean.)

(Coding wise, there is no need to specify a null mean, since we are not doing a test, and there is no need to specify a confidence level, since we are using the default 95%. Saying this shows that you understand what you are doing.)

The interval for the median was 21.6 to 28.0 cents per ounce. This is longer than the one for the mean, but otherwise in about the same place. Make a call about whether you think this is similar to or different from the \(t\) interval for the mean. My take is that the principal difference is in the length of the interval rather than where it is, so that there is no substantial difference between the two intervals, and therefore we should prefer the \(t\)-interval for the mean because it is shorter (and that is because it makes better use of the data). If you want to argue that the intervals are different, then you need to follow it up by saying that we should prefer the sign test, which you can support by looking back at your plot and noting the non-normality. I am happy with either direction, properly argued.

\(\blacksquare\)

Fuel efficiency comparison

Some cars have on-board computers that calculate quantities related to the car’s performance. One of the things measured is the fuel efficiency, that is, how much gasoline the car uses. On an American car, this is measured in miles per (US) gallon. On one type of vehicle equipped with such a computer, the fuel efficiency was measured each time the gas tank was filled up, and the computer was then reset. Twenty observations were made, and are in http://ritsokiguess.site/datafiles/mpgcomparison.txt. The computer’s values are in the column Computer. The driver also calculated the fuel efficiency by hand, by noting the number of miles driven between fill-ups, and the number of gallons of gas required to fill the tank each time. The driver’s values are in Driver. The final column Diff is the difference between the computer’s value and the driver’s value for each fill-up. The data values are separated by tabs.

  1. Read in and display (some of) the data.

Solution

library(tidyverse)

This is like the athletes data, so read_tsv:

my_url <- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
fuel <- read_tsv(my_url)
Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fuel

20 observations, with columns as promised. The first column Fill-up labels the fill-ups from 1 to 20, in case we needed to identify them. (This is an additional clue that we have paired data, or repeated measures in general.2)

See the Extra for another way to read in these data (and why it works).

\(\blacksquare\)

  1. What is it that makes this paired data? Explain briefly.

Solution

There are two measurements for each fill-up: the computer’s calculation of gas mileage, and the driver’s. These go together because they are two calculations of what is supposed to be the same thing. (Another hint is that we have differences, with the veiled suggestion that they might be useful for something.)

If you want to have a general way of tackling this kind of problem, ask yourself “is there more than one observation of the same thing per individual?” Then go on and talk about what the individuals and observations are for the data you’re looking at. In this case, the individuals are fill-ups, and there are two observations per fill-up: one by the driver and one by the computer. Compare this with the children learning to read (from lecture): there were \(23+21 = 44\) children altogether (individuals), and each child had one reading score, based on the reading method they happened to be assigned to. So those are 44 independent observations, and a two-sample test is the way to go for them.

\(\blacksquare\)

  1. Draw a suitable graph of these data, bearing in mind what you might want to learn from your graph.

Solution

In a matched pairs situation, what matters is whether the differences have enough of a normal distribution. The separate distributions of the computer’s and driver’s results are of no importance. So make a graph of the differences. We are specifically interested in normality, so a normal quantile plot is best:

ggplot(fuel, aes(sample = Diff)) + stat_qq() + stat_qq_line()

The next best plot is a histogram of the differences:

ggplot(fuel, aes(x = Diff)) + geom_histogram(bins = 6)

This one actually looks left-skewed, or has an outlier at the bottom. The appearance may very well depend on the number of bins you choose.

\(\blacksquare\)

  1. Is there any difference between the average results of the driver and the computer? (Average could be mean or median, whichever you think is best). Do an appropriate test. (2024: you haven’t seen the matched pairs sign test yet, so do the \(t\)-test and think about whether that’s the one you really wanted to do.)

Solution

The choices are a matched-pairs \(t\)-test, or a matched-pairs sign test (on the differences). To choose between those, look back at your graph. My take is that the only possible problem is the smallest (most negative) difference, but that is not very much smaller than expected. This is especially so given the sample size (20), which means that the Central Limit Theorem will help us enough to take care of the small outlier.

I think, therefore, that a paired \(t\)-test is the way to go, to test the null that the mean difference is zero (against the two-sided alternative that it is not zero, since we were looking for any difference). There are two ways you could do this: as literal matched pairs:

with(fuel, t.test(Computer, Driver, paired = TRUE))

    Paired t-test

data:  Computer and Driver
t = 4.358, df = 19, p-value = 0.0003386
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 1.418847 4.041153
sample estimates:
mean difference 
           2.73 

or, as a one-sample test on the differences:

with(fuel, t.test(Diff, mu = 0))

    One Sample t-test

data:  Diff
t = 4.358, df = 19, p-value = 0.0003386
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 1.418847 4.041153
sample estimates:
mean of x 
     2.73 

with the same result.

If you thought that low value was too much of an outlier, the right thing would have been a sign test that the median difference is zero (vs. not zero), thus (using the smmr package):

sign_test(fuel, Diff, 0)
$above_below
below above 
    3    17 

$p_values
  alternative     p_value
1       lower 0.999798775
2       upper 0.001288414
3   two-sided 0.002576828

In any of those cases, we conclude that the average difference is not zero, since the P-values are less than 0.05. (The right one for the sign test is 0.0026.)

I don’t mind which test you do, as long as it is one of the ways of doing matched pairs, and you justify your choice by going back to your graph. Doing one of the tests without saying why is being a STAB22 or STAB57-level statistician (in fact, it may not even be that), not a professional-level applied statistician.

\(\blacksquare\)

  1. The people who designed the car’s computer are interested in whether the values calculated by the computer and by the driver on the same fill-up are usually close together. Explain briefly why it is that looking at the average (mean or median) difference is not enough. Describe what you would look at in addition, and how that would help.

Solution

The mean (or median) difference can be close to zero without the individual differences being close to zero. The driver could be sometimes way higher and sometimes way lower, and the average difference could come out close to zero, even though each individual pair of measurements is not that close together. Thus looking only at the average difference is not enough.

There are (at least) two (related) things you might choose from to look at, in addition to the mean or median:

  • the spread of the differences (the SD or inter-quartile range). If the average difference and the spread of differences are both close to zero, that would mean that the driver’s and computer’s values are consistently similar.
  • a suitable confidence interval here (for the mean or median difference, as appropriate for what you did) would also get at this point.

To think about the spread: use the SD if you did the \(t\)-test for the mean, and use the IQR if you did the sign test for the median, that is, one of these:

fuel %>% summarize(diff_sd = sd(Diff))

or

fuel %>% summarize(diff_iqr = IQR(Diff))

as appropriate. Make a call about whether you think your measure of spread is small or large. If you think it’s small, you can say that the differences are consistent with each other, and therefore that the computer’s and driver’s values are typically different by about the same amount (we concluded that they are different). If you think your measure of spread is large, then the driver’s and computer’s values are inconsistently different from each other (which would be bad news for the people who designed the car’s computer, as long as you think the driver was being reasonably careful in their record-keeping).

If you said that the confidence interval for the mean/median was the thing to look at, then pull the interval out of the \(t\)-test output or run ci_median, as appropriate. The computer’s miles per gallon is consistently between 1.4 and 4.0 miles per gallon higher (mean) or 1.1 and 4.5 miles per gallon higher (median). Make a call about whether you consider this interval short or long, bearing in mind it’s based on the difference between two numbers that are about 40, and then say that the computer’s measurement is larger than the driver’s by a consistent amount (if you think the interval is short) or by an amount that varies quite a bit (if you think the interval is long).

\(\blacksquare\)

Footnotes

  1. At least, for somewhere that is not Toronto!↩︎

  2. If we had 40 independent miles-per-gallon measurements, there would be no reason to have the Fill-up column, because there would be no basis to link a Computer value with a particular Driver one.↩︎