STAC32 Assignment 5

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Concrete

The data in http://ritsokiguess.site/datafiles/ex14.23.csv are measurements of 7-day flexural strength of nonbloated burned clay aggregate concrete samples (psi). I don’t know any more than you do what that is, except to say that it is strength of concrete, in pounds per square inch.

(a) (1 point) Read in and display (some of) the data.

Nothing at all surprising here:

my_url <- "http://ritsokiguess.site/datafiles/ex14.23.csv"
strengths <- read_csv(my_url)
Rows: 30 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): strength

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
strengths

Note that I called the dataframe strengths, and the one column in it is called strength (singular), so that it is clear whether I am referring to the dataframe or the column within it. It might have been less confusing to call the dataframe concrete or something like that. Use a name that tells you what is in the dataframe.

(b) (3 points) Draw a suitable graph of the data, and comment briefly on why you might have doubts about running a t-procedure (test or confidence interval) here.

A histogram is the obvious first choice (but there are others; see below):

ggplot(strengths, aes(x = strength)) + geom_histogram(bins = 7)

Is that one unusually large value, or is it indicative of a shorter right tail and we happened not to observe any values around 600? Also, for the rest of the values, are they skewed to the left apart from the (apparent) outlier?

The key thing is an observation that the distribution is not normal in shape in some fashion: the upper outlier, or the otherwise left-skewed shape, or both. The observation you make may depend on the number of bins you choose for your histogram. I chose 7 bins after some experimentation; 5 looks like this:

ggplot(strengths, aes(x = strength)) + geom_histogram(bins = 5)

I think the shape is less clear here; the outlier we saw before has gotten swallowed up into what looks like a long right tail. This, to my mind, is too few bins to really get a good sense of the shape. (5 bins is less than what Sturges’ rule says, and that is for bell-shaped data; here, we have something non-normal happening, and we need more bins to see what it is.)

I thought this would be too many bins:

ggplot(strengths, aes(x = strength)) + geom_histogram(bins = 10)

but actually it shows off the shape very clearly: something bell-shaped, with possibly a slightly longer left tail, and a clear outlier. So you can go up about as far as this many bins. The reason this works, I think, is that when you have an outlier, which is by its nature separate from the rest of the distribution, you are going to get some empty bins. Not counting the bin with the single outlier and the two empty bins next to it, there are seven bins for the other 29 observations, which is a reasonable number to see the shape of their distribution.

One point for a reasonable graph; some other possibilities are below:

A one-sample boxplot:

ggplot(strengths, aes(x = 1, y = strength)) + geom_boxplot()

which does a better job of indicating a slightly left-skewed shape plus an outlier the other end.

A normal quantile plot is also a good choice, now that you know how to draw one:

ggplot(strengths, aes(sample = strength)) + stat_qq() + stat_qq_line()

The impression from this one is that the main problem is with the high outlier; the rest of the points are close to the line. If anything, the tails are slightly longer than a normal, but the outlier is the problem.

So: (i) make a sensible choice of graph, with a sensible number of bins (to show the shape) if you drew a histogram, (ii) make some kind of comment about non-normality, supported by your graph. The outlier is actually the biggest problem; the long left tail will be taken care of by the sample size. Hence, (iii), you should add a comment about sample size here (30): you can argue that the sample size may not be large enough to overcome the effect of the outlier.

(c) (3 points) The median 7-day flexural strength of nonbloated burned clay aggregate concrete is supposed to be 450 (psi), and it is a problem if the (population) median is less than this. Is there any evidence of a problem here?

This is implying a test for the population median, which means a sign test. The null hypothesis is that the median is 450, and the alternative is that the median is less, which means a lower-tailed test. The way I coded the sign test, you run the test the same way, but read the output according to the sidedness you want:

sign_test(strengths, strength, 450)
$above_below
below above 
   19    10 

$p_values
  alternative    p_value
1       lower 0.06802297
2       upper 0.96928583
3   two-sided 0.13604595

The P-value is 0.068, which is not quite less than 0.05, so we do not reject the null hypothesis, and therefore do not have evidence of a problem. Make sure you give the P-value you are using, so that we can be sure you have the right one.

The aboves and belows look rather unbalanced; 19 values below 450 and only 10 values above,1 but this is not unbalanced enough to reject the null hypothesis with.

If I were advising on this project, I would give this conclusion, but also offer a warning that the P-value is rather small, so that they should keep an eye on things in case it gets any worse; the result is on the low end of acceptable.

(d) (2 points) Obtain a 99% confidence interval for the population median 7-day flexural strength of nonbloated burned clay aggregate concrete. Explain briefly how your confidence interval is consistent with the test result.

This is ci_median. Don’t forget to specify the confidence level, since it is not the default 95%:

ci_median(strengths, strength, conf.level = 0.99)
[1] 374.0084 459.9969

The interval is from 374 to 460 psi (stated and rounded off; the data are integers, so you can give one decimal place, or remember that the ends of a CI for the median are data values. Any more decimals are meaningless noise.). This interval is consistent with the test result because it contains the value 450, so that both procedures are saying that a median of 450 is consistent with the data.

Extra: I slid something by you: the level of the confidence interval does not correspond to the \(\alpha\) of the test, and also the test is one-sided, so it isn’t easily comparable to the interval anyway. To be more careful: a confidence interval is a two-sided thing, so we should use the two-sided P-value of 0.136. At \(\alpha = 0.01\), which is what goes with a 99% CI, we would be nowhere near rejecting the null median of 450, and therefore 450 should be clearly within a 99% CI, as it is.

Extra extra: if you do a confidence interval at level \(1-0.136 = 0.874\), therefore, the null median 450 should be right on the edge of the interval:

ci_median(strengths, strength, conf.level = 0.874)
[1] 386.0067 449.9952

and so it is; the upper end of the interval is exactly 450 to within rounding error.

(e) (3 points) Obtain and plot a bootstrap sampling distribution for the sample mean for these data. Does the result make you wonder about the need to run a sign test? Explain briefly.

The usual procedure to obtain the bootstrap distribution, as below. The plot we know of so far is a histogram:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(strengths$strength, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(x = my_mean)) + geom_histogram(bins = 10)

I think that looks pretty much bell-shaped, and therefore the t-test would have been all right here. In a situation where the t-test and the sign test can both be done, the t-test is more powerful and is therefore preferred.

Make a call based on the shape of your plot. You might see a little right-skewness (the residual effect of the outlier), and if you see enough to bother you, then say that our decision to do a sign test2 was a good one.

Extra 1: was quite happy with a histogram here, since the “official” normal quantile plot question is the next one. But if you want to give it a try, the below is how it goes. The tails of the distribution are rather variable (and the tails are what we care about), so I think 10,000 simulations (or more) is better:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(strengths$strength, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

The distribution is a tiny bit skewed to the right still (or has a long right tail), but my take is that this is a tiny enough deviation from normality not to be worried about.

Extra 2: what would happen if we did a t-test for a mean of 450 against a one-sided alternative?

with(strengths, t.test(strength, mu = 450, alternative = "less"))

    One Sample t-test

data:  strength
t = -1.8284, df = 29, p-value = 0.0389
alternative hypothesis: true mean is less than 450
95 percent confidence interval:
     -Inf 447.9636
sample estimates:
mean of x 
    421.2 

The mean is significantly less than 450, at \(\alpha = 0.05\). This, depending on your point of view, means either that we should have done a t-test in the first place (because the sampling distribution of the sample mean was normal), or is complete nonsense (because the sampling distribution of the sample mean was not close enough to normal).

The confidence interval looks like this (re-doing the test two-sided):

with(strengths, t.test(strength, conf.level = 0.99))

    One Sample t-test

data:  strength
t = 26.74, df = 29, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
 377.7831 464.6169
sample estimates:
mean of x 
    421.2 

This does include 450, which seems inconsistent with the test.3 But the confidence level does not match up with the \(\alpha\) we used for the test. For our one-sided test, we put all our 0.05 at one end, so to make a two-sided confidence interval, we have to have that much \(\alpha\) at both ends, and thus the corresponding CI is a 90% one:

with(strengths, t.test(strength, conf.level = 0.90))

    One Sample t-test

data:  strength
t = 26.74, df = 29, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
 394.4364 447.9636
sample estimates:
mean of x 
    421.2 

This doesn’t quite go up to 450, and thus is consistent with the t-test.

2 Blood viscosity

The Institute of Nutrition of Central America and Panama (INCAP) has carried out extensive dietary studies and research projects in Central America. In one study reported in the November 1964 issue of the American Journal of Clinical Nutrition (“The Blood Viscosity of Various Socioeconomic Groups in Guatemala”), serum total cholesterol measurements for a sample of 49 low-income rural indigenous people were reported. The data are in http://ritsokiguess.site/datafiles/xmp14.10.csv.

(a) (1 point) Read in and display (some of) the data.

As usual:

my_url <- "http://ritsokiguess.site/datafiles/xmp14.10.csv"
viscosity <- read_csv(my_url)
Rows: 49 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): cholesterol

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
viscosity

There is one column called cholesterol, and 49 observations of it as promised.

(b) (2 points) Draw a histogram of the data.

You will probably have to experiment with the number of bins. Find one that shows whatever you think the shape is.

ggplot(viscosity, aes(x = cholesterol)) + geom_histogram(bins = 7)

This is not completely normal: a bit skewed to the right, with maybe too many observations above 200 compared to a normal distribution. (You don’t need to say this: comment is coming below. But it’s worth your while to think it now, so that you have an idea of what to say later.)

(c) (2 points) Draw a normal quantile plot of the data.

Thus:

ggplot(viscosity, aes(sample = cholesterol)) + stat_qq() + stat_qq_line()

This is actually not far off normal, just a slightly longer right tail (to ponder for in a moment).

(d) (2 points) Comment briefly on how any non-normality shows up in each of your two plots.

My histogram seems to be showing right skewness, while the normal quantile plot is showing more of a long right tail. These are both consistent with a right tail that is longer than you would expect from a normal distribution.

Extra: It is perhaps reasonable to trust the normal quantile plot a bit more when assessing normality (since that is its main job). This one says that the highest values are a bit too high (that is, the right tail is too long), but, compared to the normal, the left tail is pretty much as you would expect. Looking at the histogram, the lower tail looks too short for a normal, but that depends on things like the standard deviation of the normal distribution we’re comparing it with, and is hard to judge on a histogram. (Also, the large number of data values around 140 shows up on the normal quantile plot as the points dipping below the line between 140 and 160, which you might also take as evidence of skewness, a curve, on that plot. If the values below 120 were more bunched up, this would definitely be skewness.)

Footnotes

  1. and evidently one value exactly equal to 450, which was discarded.↩︎

  2. I asked you to do one, so it was really my decision rather than ours, but anyway.↩︎

  3. The interval is very like the one for the median, just shifted up a bit.↩︎