STAC32 Assignment 5

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Concrete

The data in http://ritsokiguess.site/datafiles/ex14.23.csv are measurements of 7-day flexural strength of nonbloated burned clay aggregate concrete samples (psi). I don’t know any more than you do what that is, except to say that it is strength of concrete, in pounds per square inch.

(a) (1 point) Read in and display (some of) the data.

Nothing at all surprising here:

my_url <- "http://ritsokiguess.site/datafiles/ex14.23.csv"
strengths <- read_csv(my_url)
Rows: 30 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): strength

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
strengths

Note that I called the dataframe strengths, and the one column in it is called strength (singular), so that it is clear whether I am referring to the dataframe or the column within it. It might have been less confusing to call the dataframe concrete or something like that. Use a name that tells you what is in the dataframe.

(b) (3 points) Draw a suitable graph of the data, and comment briefly on why you might have doubts about running a t-procedure (test or confidence interval) here.

A histogram is the obvious first choice (but there are others; see below):

ggplot(strengths, aes(x = strength)) + geom_histogram(bins = 7)

Is that one unusually large value, or is it indicative of a shorter right tail and we happened not to observe any values around 600? Also, for the rest of the values, are they skewed to the left apart from the (apparent) outlier?

The key thing is an observation that the distribution is not normal in shape in some fashion: the upper outlier, or the otherwise left-skewed shape, or both. The observation you make may depend on the number of bins you choose for your histogram. I chose 7 bins after some experimentation; 5 looks like this:

ggplot(strengths, aes(x = strength)) + geom_histogram(bins = 5)

I think the shape is less clear here; the outlier we saw before has gotten swallowed up into what looks like a long right tail. This, to my mind, is too few bins to really get a good sense of the shape. (5 bins is less than what Sturges’ rule says, and that is for bell-shaped data; here, we have something non-normal happening, and we need more bins to see what it is.)

I thought this would be too many bins:

ggplot(strengths, aes(x = strength)) + geom_histogram(bins = 10)