STAC32 Assignment 5

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Shrimp

The data in http://ritsokiguess.site/datafiles/shrimp.csv are 18 determinations of the amount (percentage of the declared total weight) of shrimp in shrimp cocktail.

  1. (1 point) Read in the data and display at least some of it.

As ever:

my_url <- "http://ritsokiguess.site/datafiles/shrimp.csv"
shrimp <- read_csv(my_url)
Rows: 18 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): percent

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
shrimp

There are indeed 18 observations, and the one column is called percent.

  1. (3 points) Draw a normal quantile plot of these data.

Thus:

ggplot(shrimp, aes(sample = percent)) + stat_qq() + stat_qq_line()

  1. (2 points) What do you learn from your normal quantile plot? Explain briefly.

There are two possible answers here:

  • the distribution has two outliers, one at the upper end and one at the lower.
  • the distribution has long tails compared to the normal.

I don’t think there is enough data to say that one of these answers is definitely better than the other. The way to (try to) decide is to ask yourself whether those two extreme values seem definitely different from the rest of the distribution (that is, the high one is much higher than the other values and the low one is much lower), or whether they are part of a long tail. My take (from which I have no problem with you disagreeing) is that the low value is definitely lower than the others (the ones next to it are close to the line), but the high value might be part of a long upper tail because the next few values are also noticeably above the line, and they are getting further from the line as you move rightward. That is to say, there is a trend of observations getting less like what you would expect from a normal distribution.

I don’t think you can say that the top right observation is close to the line, even if you think the next three or so observations are, so that you have to say something (at least) about the top right and bottom left observations.

Extra: I obtained the bootstrap sampling distribution of sample mean, to assess whether I should be concerned about those outliers/long tails with a sample size of 18:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(shrimp$percent, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

This is a little bit skewed to the left, to reflect that the lower outlier is more low than the high outlier is high. In fact, the high outlier is not really a problem: if it were, there would also be a long upper tail to the sampling distribution of the sample mean, which there really isn’t.

Shrimp continued

For the shrimp data (that you just drew a normal quantile plot of), do the following questions. You may use functions from the smmr package as needed.

  1. (2 points) Explain briefly why the median would be a better summary statistic than the mean for these data.

The distribution of percent values has (depending on how you saw it) upper and lower outliers or long tails. The mean can be distorted by unusual values; the difficulty here is that it could be pulled up or down depending on whichever outlier/tail turns out to have the bigger influence. You might say that the mean could be pulled “out” from where it should be near the middle of the data.

The median is not affected by outliers or long tails, so it would be a trustworthy measure of “centre”.

I am not asking about whether we should be doing a \(t\)-test; this is like the IRS example from lecture where we said that the median is a better summary of that right-skewed data than the mean is, without thinking about \(t\)-tests or sample sizes at all.

  1. (3 points) The process by which the shrimp cocktail is made is supposed to produce shrimp cocktail with a median of 30.5 percent shrimp. Test whether the process is working correctly or not. What do you conclude, in the context of the data?

To test a median, use the sign test, and use the one from smmr (most easily):

sign_test(shrimp, percent, 30.5)
$above_below
below above 
    2    16 

$p_values
  alternative      p_value
1       lower 0.9999275208
2       upper 0.0006561279
3   two-sided 0.0013122559

The null hypothesis is that the (population) median is 30.5, and the alternative is that the median is different from 30.5. (A two-sided alternative is correct because we want to know about any departure from 30.5.) The two-sided P-value is 0.0013 (say what it is, so that it is clear that you have the correct one). This is smaller than 0.05, so we reject the null in favour of the alternative, and thus conclude that the median percentage of shrimp in the shrimp cocktail is not equal to 30.5, and hence that the process is not working correctly.

Extra: we are not interested in doing a \(t\)-test here because we are not interested in saying anything about the population mean, but if we were:

with(shrimp, t.test(percent, mu = 30.5))

    One Sample t-test

data:  percent
t = 2.9792, df = 17, p-value = 0.00842
alternative hypothesis: true mean is not equal to 30.5
95 percent confidence interval:
 30.87773 32.71116
sample estimates:
mean of x 
 31.79444 

The P-value of 0.0084 is small, but not as small as the one from the sign test. The reason for that difference is probably that the standard deviation is inflated by the outliers/long tails, so that we are less certain about what the mean is, compared to the median. (That is to say, if you think about the sampling distribution of the sample mean, it will get pulled up or down further by the extreme values.)

  1. (2 points) How is something else in your test output consistent with the results of the test you just did? Explain briefly.

The other thing in the output is the count of data values above and below the null median. Here, 16 of them are above 30.5 and only 2 of them are below. This is very far from an even split, and so is consistent with the population median being something other than 30.5 (presumably, something bigger).

  1. (2 points) Obtain a 90% confidence interval for the median percent of shrimp in this shrimp cocktail.

This is ci_median, but with a non-default confidence level, so you have to say what it is:

ci_median(shrimp, percent, conf.level = 0.90)
[1] 31.20586 32.39531

31.2 to 32.4. Minus a half point if you don’t make some attempt to round it off. The best thing here is to note that the ends of a confidence interval for the median are either data points or (occasionally) halfway between data points, so you can round off to the same accuracy as the data (one decimal place, here). Or you can give a second decimal, which will (here, and usually) be zero.

Extra: comparing the one for the mean:

with(shrimp, t.test(percent, conf.level = 0.90))

    One Sample t-test

data:  percent
t = 73.175, df = 17, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
 31.03859 32.55030
sample estimates:
mean of x 
 31.79444 

This is a little longer than the one for the median, which is consistent with what we said before: the mean is estimated less accurately than the median is.

  1. (1 point) How are your confidence interval and your hypothesis test consistent with each other? Explain (very) briefly.

30.5 was rejected and is also outside the confidence interval. (“30.5 is outside the confidence interval” is enough. One-point questions are only looking for one small thing.)

Extra: this was bound to happen, because of the way we defined the confidence interval for the median: if a value like 30.5 is rejected, it must be outside the confidence interval.