STAC33 Assignment 4

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

one and two sample t

1 Wind turbine blades

The vibratory stress of wind turbine blades under certain conditions should be no greater than 14.5 (in the units used in our data), and it is desirable if it is less than that. Ten observations of vibratory stress are in the file http://ritsokiguess.site/datafiles/ex06.15.csv.

(a) (2 points) Read in and display the data.

Nothing surprising here:

my_url <- "http://ritsokiguess.site/datafiles/ex06.15.csv"
turbine <- read_csv(my_url)
Rows: 10 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): vib_stress

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
turbine

The column of vibratory stresses is called vib_stress.

(b) (2 points) Make a one-sample boxplot of the data (hint: use x = 1 in the code for your plot).

A boxplot needs an x and a y, and we only have one variable, so follow the hint and use x = 1 for the other one:

ggplot(turbine, aes(x = 1, y = vib_stress)) + geom_boxplot()

Extra: it’s difficult to make a histogram with only ten observations. The best I could do was this:

ggplot(turbine, aes(x = vib_stress)) + geom_histogram(bins = 6)

which is probably too many bins really, but if you use fewer, you get even less of a sense of the shape. (I think this plot exaggerates the right-skewness, also.)

(c) (2 points) The statistician on this project decided to run a \(t\)-test. Why is that a sensible idea? Explain briefly.

This time, I have given you the side to take of the decision, and it’s your job to come up with a reason for the statistician’s decision. The usual two considerations:

  • The distribution of vibratory stresses is slightly skewed to the right (the upper whisker is longer than the lower one, but there are no outliers indicated).
  • The sample size of 10 is not large, but may be large enough to overcome this small amount of skewness.

Some sort of discussion like that. You should get to a point that the sample size, even though it is not large, is (or may be) large enough (because the Central Limit Theorem is actually rather powerful).

(d) (3 points) Is there evidence that the mean vibratory stress is less than 14.5? What do you conclude, in the context of the data?

The “evidence” thing means to do a hypothesis test. The thing we are looking for evidence of goes in the alternative hypothesis. Let \(\mu\) denote the population mean vibratory stress; then, in symbols:

  • \(H_0: \mu = 14.5\) (or if you prefer, \(H_0: \mu \ge 14.5\))
  • \(H_a: \mu < 14.5\).

The null must have an equals sign in it somewhere, and the alternative must have a strict inequality. You can write the hypotheses in words if you prefer, but if you use a symbol like \(\mu\), you must say what that symbol represents. The last thing your reader wants to be doing is wondering “what is \(\mu\)?” while trying to make sense of what you have written.

Then do the test, alternatively using the $ notation (as in turbine$vib_stress, where the first thing is the name you gave your dataframe, and the second is the name of the (one) column in it).

with(turbine, t.test(vib_stress, mu = 14.5, alternative = "less"))

    One Sample t-test

data:  vib_stress
t = -2.0791, df = 9, p-value = 0.03368
alternative hypothesis: true mean is less than 14.5
95 percent confidence interval:
     -Inf 14.12142
sample estimates:
mean of x 
     11.3 

The P-value is 0.034, less than 0.05, so we reject the null in favour of the alternative, and conclude that the mean vibratory stress is less than 14.5.

As I said in the worksheet, the key things here are the P-value,1 and a conclusion in the context of the data, so that the people who collected the data know what to do next. In this case, the vibratory stress is a “desirable” value, so they may not have to do anything. You also need to make it clear what the hypotheses are, either by saying what they are, or by drawing a conclusion that shows you know what null hypothesis you are rejecting.

Extra: By the time you get to this, you may already have seen in class the bootstrap sampling distribution of the sample mean. Here, it looks like this:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(turbine$vib_stress, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) -> d
d
ggplot(d, aes(x = my_mean)) + geom_histogram(bins = 10)

That looks close to normal, indicating that the sample size was big enough to overcome the mild skewness. (The Central Limit Theorem often works better than you expect.)

(e) (2 points) Obtain a 90% confidence interval for the mean vibratory stress. Ensure that you state your confidence interval suitably for your reader.

Redo t.test, but this time two-sided, and with a conf.level because the confidence level is not the default 95%:

with(turbine, t.test(vib_stress, conf.level = 0.90))

    One Sample t-test

data:  vib_stress
t = 7.3418, df = 9, p-value = 4.365e-05
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
  8.478582 14.121418
sample estimates:
mean of x 
     11.3 

(or use the dollar sign again). The last sentence of the question means that you need to pull the confidence interval out of the output and round it off suitably. Go back and look to see that the data values are given to two decimal places, so you can use two or three in the confidence interval. Hence the mean vibratory stress is between 8.48 and 14.12, or between 8.579 and 14.121.

Extra 1: note that the null mean of 14.5 is outside this interval, which is consistent with our test result, which said we should reject 14.5 as a mean. We should perhaps be careful about this, though, because the rejection was at \(\alpha = 0.05\) in a one-sided test, and the confidence level here is 90% (and our confidence intervals in this course are two-sided).

Extra 2: a rather long digression that extends ideas from B57 (that you might care to read through, maybe later):

In the original problem, the data were asserted to have come from a Rayleigh distribution, with density function \(f(x) = (x/\theta) \exp(-x^2/(2\theta))\) for \(x > 0\), and in that case, the goal is to estimate \(\theta\). There is (inevitably) a Wikipedia page about it, but our parameter \(\theta\) is called \(\sigma^2\) there. Showing that it is a density function (ie. that it integrates to 1) is not too bad: substitute \(u = x^2 / (2\theta)\) and note that \(du\) will cancel out the term in front of the exponential. Finding \(E(X)\) and \(E(X^2)\) is more of a challenge, but I’m guessing that the same substitution and some integration by parts should crack it.

It turns out that \(E(X^2) = 2\theta\). This leads to a method of moments style estimator of \(\theta\), which is presumably going to be based on \(\sum_{i = 1}^n x_i^2\), where the \(x_i\) are the data I gave you and thus \(n = 10\). Let’s work out \(E(\sum x_i^2)\) and see where that leads:

\(E(\sum x_i^2) = \sum E(x_i^2) = 2n\theta\)

and hence we can make an unbiased estimate of \(\theta\) as \(\hat\theta = (1/2n) \sum_{i = 1}^n x_i^2\), because this by design has \(E(\hat\theta) = \theta\). For our data \(\hat\theta\) takes the value

turbine %>% 
  summarize(theta_hat = sum(vib_stress^2) / 2 / length(vib_stress))

According to Wikipedia, the mean of the distribution should be \(\sqrt{\pi/2}\) times the square root of this:

turbine %>% 
  summarize(theta_hat = sum(vib_stress^2) / 2 / length(vib_stress)) %>% 
  mutate(mean = sqrt(3.1415926/2)*sqrt(theta_hat))

and that is pretty close to the middle of the confidence interval we got before for the mean, so it makes sense. (I didn’t have another good intuition for whether that value of theta_hat was reasonable.)

This is not necessarily the same as the maximum likelihood estimate of \(\theta\). You can try to get hold of it via calculus if you like. It seems to come out all right; my shaky calculus suggests that the moment estimator based on \(\sum x_i^2\) actually is the MLE, but I wanted to show you how to get hold of the MLE numerically using R, no calculus required.

I haven’t shown you functions in R yet, but the idea, though crucial here, is not very different from the same idea in Python. Let’s write a function that evaluates the density function at a given x and \(\theta\) first:

rayleigh <- function(x, theta) {
  (x/theta) * exp(-x^2/2/theta)
}

In R, the name of the function goes on the left, the inputs to the function go after the special word function, and the body of the function is enclosed in curly brackets (that would be indented in Python). I actually indented mine in R as well; that serves no syntactic purpose (unlike in Python), except that it makes it easier to see what’s going on. R actually does have a return function that works like Python’s, but it’s better R style to evaluate something on the last line of the function, not save it anywhere, and that is what gets returned. Thus, this function is really a one-liner (and could have been written on one line), but my functions tend to get more complex as they develop,2 so I prepare for them to have more than one line from the start.

I should test this function. My guess is that I use the estimate for theta that we got earlier (74.5) and check to see that it looks like the right kind of thing over the range of our data, which is from near 0 up to

turbine %>% 
  summarize(stress_max = max(vib_stress))

So let’s evaluate our function at a bunch of values of x over about that range, and draw a graph of it:

tibble(x = seq(0, 20, 0.5), theta = 74.5)

which shows you what the seq does, and then

tibble(x = seq(0, 20, 0.5), theta = 74.5) %>% 
  mutate(density = rayleigh(x, theta)) %>% 
  ggplot(aes(x = x, y = density)) + geom_line()

which looks a bit right-skewed with a peak a bit less than 10, so I pronounce myself satisfied.

Aside: that looks like a nice smooth curve, but what I actually did was to join the adjacent points together with lines, so that what looks like a curve is actually lots of little straight lines joined together.

Now, how are we going to use this? We have to do two things: we have to use our function rayleigh to work out the log likelihood, and then we have to maximize our log likelihood for theta, given the data we observed. The first part is not quite obvious, so let’s play around a bit first and then write our function:

Start with our data:

turbine

We want the density evaluated at each of those values of vib_stress, for the given theta (for now, we’ll use 74.5 again):

turbine %>% 
  mutate(density = rayleigh(vib_stress, 74.5))

The likelihood is the product of those density values, which as you can see will be very small and may even “underflow” on the computer (be indistinguishable from zero). We are going to take the log of the likelihood anyway, so instead of taking the product and then logging at the end, let’s log each of those density values first and then add up the log-densities. This will be numerically a lot more stable:

turbine %>% 
  mutate(density = rayleigh(vib_stress, 74.5)) %>% 
  summarize(loglik = sum(log(density)))

The log-likelihood is seriously negative, as it generally is.

Two more things, before we turn this into a function:

  • our function optimizer is actually a minimizer, so let’s return the negative of the log likelihood from our function instead. Minimizing that is the same as maximizing the likelihood itself.
  • our code above still has the log-likelihood as a dataframe, but we need it as a single number, so use pull to get it out.

That brings us to this function, which uses the function rayleigh that we wrote earlier:

negloglik <- function(theta, d) {
  d %>% 
    mutate(density = rayleigh(d$vib_stress, theta)) %>% 
    summarize(mloglik = -sum(log(density))) %>% 
    pull(mloglik)
}

This takes two inputs: a value of theta to evaluate the (negative) log likelihood at, and a dataframe with a column in it called vib_stress. I had to make a couple of edits to my earlier code: the name of the input dataframe, now d, and the value 74.5 that I had before with the input value theta. The log-likelihood is a function primarily of the parameter, so I have made theta the first input here (contrast with what I did in rayleigh), which turns out to make the optimization (below) go more smoothly.

We can also draw a graph of this to test it. Since this is the negative log likelihood, it should have a minimum in the middle, Our estimate of theta from before was 74.5, so let’s have some values of theta between 50 and 100:

tibble(theta = seq(50, 100, 1)) %>% 
  rowwise() %>% 
  mutate(mloglik = negloglik(theta, turbine)) %>% 
  ggplot(aes(x = theta, y = mloglik)) + geom_line()

which does indeed have a minimum, and that minimum appears to be at about 75. The graph also suggests that the minimum appears to be unique, and that there are no local minima to get stuck in. Numerical minimization routines can “converge” on local minima because the function is not lower anywhere nearby and they can easily come to the conclusion that the function is not lower anywhere, being based as they tend to be on the local behaviour of the function. Remember in calculus class where you had to find the absolute maximum and minimum of an inevitably rather wiggly function, and that meant finding all the local maxima and minima and then seeing which ones were the biggest and smallest out of all of them? This is the issue, except that (i) you are smarter than a computer’s minimization routine, and (ii) this function seems to have a unique minimum, so the computer should get the right answer anyway.

Now, to find exactly where the minimum is, we need to minimize the (negative) log-likelihood over theta. R’s function minimizer is called optimize. This takes, at a minimum:

  • an interval containing the solution (which could be quite wide)
  • the name of the function you are trying to minimize (negloglik here)
  • any additional inputs to your function

We are pretty sure that the maximum likelihood estimate is between 50 and 100, based on what we’ve seen so far:

optimize(negloglik, c(50, 100), d = turbine)
$minimum
[1] 74.50529

$objective
[1] 29.76936

Boom. The maximum likelihood estimate is 74.50529, and the log-likelihood at that maximum is \(-29.76936\).

For those of you who know your algorithms: this uses golden section search combined with parabolic interpolation, and has superlinear convergence. This is better than Newton’s method (what you might know of as Newton-Raphson) for a couple of reasons: (i) Newton’s method can go jumping off to the back of beyond if the derivative is close to zero at some point other than the minimum, and (ii) you need the derivative as well as the function, which is extra work here that we would like to avoid. (In this one, the derivative of the log-likelihood is easy enough to write down, but if your log-likelihood is messy, its derivative will be even more messy, and those are just the cases where you will want a numerical answer.3) If you want to learn more, the Numerical Recipes books give a great description of this and many other algorithms. The books have code in several different languages, including C, but the real feature of the books is the clear description of the algorithms. For this, you can go back to the 1992 books there, which are free.

How does this compare with the moments estimator we found before?

turbine %>% 
  summarize(theta_hat = sum(vib_stress^2) / 2 / length(vib_stress)) %>% 
  pull(theta_hat)
[1] 74.50529

It looks as if it’s exactly the same, and that therefore, on this occasion, my shaky calculus is to be trusted.

If you write down the log-likelihood, you might suspect this, because the only piece of it that depends on both \(\theta\) and the data is \(-(1/2\theta) \sum x_i^2\), so that \(\sum x_i^2\) is the sufficient statistic and the MLE does not depend on the data in any other way. Maximum likelihood estimators are not always unbiased, but they often are, and the estimator I got at the start of this lengthy Extra was the only way to make an unbiased estimator out of \(\sum x_i^2\).

2 Soil samples

Soil samples were gathered using both a pitcher tube method and a block method, resulting in observed values of dry density (in pounds per cubic foot). The data are in http://ritsokiguess.site/datafiles/liquefaction.csv. The scientists are interested in whether there is any difference in the mean dry density of soil samples collected using the two sampling methods.

  1. (6 points) Analyze the data, bearing in mind what the scientists are interested in. Your analysis should include the following, and read as a coherent narrative:
  • an appropriate graph
  • the most appropriate \(t\)-test, with explanation of why you think it is most appropriate
  • a conclusion that will help the scientists decide what to do next.

(solution)

This one is more open-ended, so you will have to decide how to organize things to make a story that reads smoothly. The grading scale is two marks for each of the three things above. Below is how I would do it. You can add relevant extra narrative to make the story clearer and to show that you know what you are doing.

I didn’t say explicitly, but you will need to read in the data before you can do anything else:

my_url <- "http://ritsokiguess.site/datafiles/liquefaction.csv"
soil <- read_csv(my_url)
Rows: 35 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): sampling
dbl (1): dry_density

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
soil

As you look at the data, note that the columns are called sampling (with the names of the sampling methods; categorical) and dry_density, the dry density of the soil samples (quantitative). Thus the most appropriate graph is a boxplot:

ggplot(soil, aes(x = sampling, y = dry_density)) + geom_boxplot()

The dry density for the soil samples obtained using pitcher sampling is slightly higher on average than for block sampling (comparing the medians), but there is a lot of overlap, so it is not clear whether that difference will be statistically significant.

Two points for getting this far.

We were asked to choose the “most appropriate \(t\)-test”, so we are taking it for granted that the distributions in the boxplot are close enough to being normally distributed, given the sample sizes. (I am not giving you the opportunity to refuse to run a \(t\)-test!) Later in the course, we learn about Mood’s median test, which you can use when the distributions are not normal enough given the sample sizes, but we are not there yet.

So, the choice is between the two two-sample \(t\)-tests. For the pooled test, we need the two distributions to have about the same spread, but it looks here as if the block sampling measurements have a larger spread (interquartile range) on the boxplot, and so we should use the Welch test:

t.test(dry_density ~ sampling, data = soil)

    Welch Two Sample t-test

data:  dry_density by sampling
t = -1.9202, df = 20.162, p-value = 0.06909
alternative hypothesis: true difference in means between group Block.sampling and group Pitcher.sampling is not equal to 0
95 percent confidence interval:
 -5.3171334  0.2186485
sample estimates:
  mean in group Block.sampling mean in group Pitcher.sampling 
                      101.1091                       103.6583 

I’m not going to object if you describe the spreads as “about the same”, but then you need to be consistent and follow it by doing a pooled \(t\)-test:

t.test(dry_density ~ sampling, data = soil, var.equal = TRUE)

    Two Sample t-test

data:  dry_density by sampling
t = -1.8936, df = 33, p-value = 0.06708
alternative hypothesis: true difference in means between group Block.sampling and group Pitcher.sampling is not equal to 0
95 percent confidence interval:
 -5.2882630  0.1897781
sample estimates:
  mean in group Block.sampling mean in group Pitcher.sampling 
                      101.1091                       103.6583 

I don’t mind if you disagree with me, but I do need you to be consistent with yourself.

Two more points: one for running a two-sample \(t\)-test and displaying the results, and one for having a good reason to run the one you did (whichever one it was).

Aside: you’ll remember from B57 that the assumption behind the pooled \(t\)-test (which was the one you derived) is that the two populations have equal variances. You might feel, then, that you should compare the sample SDs rather than IQRs, and if you want to do that, you know how to do it:

soil %>% 
  group_by(sampling) %>% 
  summarize(n = n(), mean_density = mean(dry_density), sd_density = sd(dry_density))

Those SDs are actually very close, and that would be a good reason to run the pooled test.4

The last two points are for a conclusion that the scientists can use, that is, in the context of the data:

  • The null hypothesis is that the two sampling methods have the same (population) mean dry density
  • The alternative hypothesis is that the two sampling methods have different mean dry density (two-sided, because the scientists were interested in any difference)
  • The P-value is 0.06909 (Welch) or 0.06708 (pooled). State the appropriate one for your test.
  • The P-value is greater than 0.05, so we cannot (quite) reject the null hypothesis.
  • Therefore, there is not evidence that the (population) mean dry density differs between the two sampling methods.

The key things are the P-value (the appropriate one for the test you did), and the final conclusion in words. It is a good idea to go through the process as I did, because it shows you know what you are doing, but if it is clear that you know what the hypotheses are by the way you stated your final conclusion, that is enough.

You can state your hypotheses using symbols like \(\mu_1\) and \(\mu_2\) if you like, but you need to define your symbols before you use them, for example by saying “let \(\mu_1\) be the population mean dry density for samples taken by the block method, and let \(\mu_2\) be the same taken by the pitcher method. Then our null hypothesis is \(H_0: \mu_1 = \mu_2\) and our alternative is \(H_a: \mu_1 \ne \mu_2\).”

The reason the scientists might be interested in your results might be that one of these two sampling methods is much easier or cheaper than the other, and if they know that the two methods give the same results,5 then the cheaper method is just as good as the more expensive one. (On the other hand, if the results are different, the scientists would need to investigate why that is.)

Extra 1: the two tests above give almost identical P-values, so that it literally didn’t matter which one of the two you chose. In other cases, though, it might matter, so that it is important to have a good reason for doing the test you chose to do. The place where it makes a difference is when you have samples of very different sizes and the smaller sample has the larger spread, so that the mean of the population with the smaller sample is estimated very imprecisely. In that case, the P-values from the Welch and pooled tests can be very different, and the one from the pooled test is wrong because the population variance that goes with the smaller sample is larger than the other one. From that point of view, if you had to pick one two-sample test and use it always, you would pick the Welch test. You might lose a tiny bit of power if the population variances really are equal, but the usual situation when both tests apply is that the P-values will be almost identical.

Extra 2: the summary statistics I did above show that there were only 11 observations collected by block sampling. The boxplot for these looks skewed to the right, and there is some question as to whether the sample size is large enough to overcome this skewness. A bootstrap sampling distribution of the sample mean will shed some light on this. The first thing we need to do is to grab just the 11 block sampling measurements:

soil %>% filter(sampling == "Block.sampling") -> block
block

and then as usual:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(block$dry_density, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) -> d
ggplot(d, aes(x = my_mean)) + geom_histogram(bins = 10)

You might have run into a normal quantile plot (probably not in this course yet) as a tool for assessing normality, and you could reasonably argue that it is a better tool than a histogram here. Here’s how it goes in R:6

ggplot(d, aes(sample = my_mean)) + stat_qq() + stat_qq_line()

That really looks very much normal, if anything slightly short-tailed, and so that right-skewness we saw in the distribution of dry densities for block sampling is not actually a problem: even a sample size of 11 was large enough to overcome it. (If the sample size had not been large enough, we would have seen an obvious right skewness here as well.)

This is another example of the Central Limit Theorem helping more than you might expect.

Footnotes

  1. In this case, your reader might prefer to use \(\alpha = 0.01\), and they would end up not rejecting \(H_0\). By giving the P-value, you allow them to make a decision on their terms rather than yours.↩︎

  2. Example: checking that the inputs are legitimate, in this case both positive.↩︎

  3. When I was learning numerical analysis in grad school, from this guy, he said that any minimization algorithm that used derivatives was troublesome, because the first place you’d screw up was in coding the derivative function. We were doing minimization of functions of several variables, so the derivative was actually a vector, which of course made the problem worse. Also, it seems to be the thing that numerical analysis people look as if they were in an 80s heavy metal band.↩︎

  4. Sometimes with small samples, the exact appearance of the boxplot can depend on exactly what the data values near the quartiles are.↩︎

  5. Failing to reject a null is a weak conclusion, though, so they would need replications of this study, or larger sample sizes, to provide convincing evidence that the two methods have the same population mean dry density.↩︎

  6. If this were being graded, you would say where you got the code from.↩︎