STAC33 Assignment 5

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

bootstrap sampling dist of sample mean

power (poisson)?

1 Arsenic

Arsenic occurs naturally in rocks, soils, and waters that come in contact with these rocks and soils. Arsenic in solution is odorless and tasteless. Exposure to arsenic at high levels poses potential serious health effects as it is a known carcinogen. It has been associated with the development of diabetes. Exposure to arsenic in drinking water has been identified as a health concern in regions of the United States where bedrock contains unusually high levels of arsenic, such as areas of New Hampshire and Maine. This is especially a concern for people whose water supply is a private well (rather than a municipal water supply).

Levels of arsenic were measured in the toenails of 19 subjects from New Hampshire, all with private wells as their main water source. The data are in http://ritsokiguess.site/datafiles/arsenic.csv. The one column of data is called Arsenic with an uppercase A, so you can call the dataframe arsenic with a lowercase a.

(a) (2 points) Read in and display (some of) the data.

As ever, and following the suggestion in the question:

my_url <- "http://ritsokiguess.site/datafiles/arsenic.csv"
arsenic <- read_csv(my_url)

Rows: 19 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): Arsenic

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

arsenic

(b) (2 points) Make a suitable plot of the column of arsenic levels. What do you notice about the shape of the distribution?

One quantitative variable suggests a histogram. Since we are most interested in the overall shape of the distribution (rather than fine details), we should use a smallish number of bins — enough to show the shape clearly, but not so many that we get a fuzzy impression. I found 8 bins to be pretty good:

ggplot(arsenic, aes(x = Arsenic)) + geom_histogram(bins = 8)

The distribution is very skewed to the right.

Another plot that works pretty well here (though see the Extra below) is a one-sample boxplot:

ggplot(arsenic, aes(x = 1, y = Arsenic)) + geom_boxplot()

This says that the distribution is skewed to the right but also that there are two high outliers. Those values above 0.8 seem to be much higher than the rest, so based on this graph, I would call them outliers. The outliers don’t show up so well on my histogram because they don’t look so separate from the rest of the distribution. The impression, I think, will depend on how many bins you choose for your histogram. Here’s 10 bins:

ggplot(arsenic, aes(x = Arsenic)) + geom_histogram(bins = 10)

and here is 12:

ggplot(arsenic, aes(x = Arsenic)) + geom_histogram(bins = 12)

I think 12 bins is too many because there is that funny “hole” in the distribution just below 0.25 that is probably not indicative of anything real (because the frequencies around it are pretty small: only 3 observations on the right). The highest values look more like outliers the more bins you have, though. I think 10 bins is a good compromise, because it shows the shape very clearly: a right skew plus two outliers.

The value of having R open is that you can try as many numbers of bins as you like, until you find one that tells the story of what you think is going on.

Extra: The one-sample boxplot told a pretty clear story here, but that was because our data had one clear mode. Boxplots assume that the distribution is unimodal, in that they don’t have a mechanism for showing two modes (they assume that the median is a sensible measure of centre, which it won’t be if you have two modes, because the median is likely to be between the modes). To illustrate, the easiest way I know to make bimodal data is to combine two different distributions together. Let’s generate some random normals with different means:¹

set.seed(457299)
x1 <- rnorm(50, 100, 12)
x2 <- rnorm(40, 150, 12)

Our first sample has 50 observations, a mean of 100, and an SD of 12; the second has 40 observations with a mean of 150 and an SD of 12. (The SDs are small enough that the distributions should not overlap much.) Next, glue them together so that the column z has all \(50+40 = 90\) observations mixed together:²

d <- tibble(z = c(x1, x2))

and make a histogram of them:

ggplot(d, aes(x = z)) + geom_histogram(bins = 12)

A one-sample boxplot of the same data looks like this:

ggplot(d, aes(x = 1, y = z)) + geom_boxplot()

The histogram shows the bimodal shape, with peaks around 100 and 150, but the boxplot does not. The median is around 115, which is in between the two peaks. All the boxplot really shows is a distribution with short tails (short whiskers and no outliers), but it says nothing beyond that about the shape. To see the shape, you need a histogram.

(c) (2 points) To obtain a confidence interval for the mean arsenic level (of all New Hampshire residents getting their water from a private well), would you feel comfortable using `t.test`, based on what you have seen so far? Explain briefly.

My reaction is that the distribution of arsenic levels is very right-skewed (or that the outliers are too far out) and the sample size of 19 is not very big, so I wouldn’t expect the Central Limit Theorem to help too much, so I wouldn’t feel comfortable using t.test.

I don’t mind so much what you say, as long as it is logical and based on both the shape of the distribution of arsenic levels and the sample size. Another view you might take is that the Central Limit Theorem actually helps a lot, and so a sample size of 19 might be big enough to overcome even that considerable skewness.

(d) (4 points) Obtain a bootstrap sampling distribution of the sample mean for these data. What do you think about using `t.test` now (to obtain a confidence interval for the mean arsenic level)? Explain briefly.

The steps in the code are:

set up 1000 (or more) simulations
work rowwise
for each simulation, draw a bootstrap sample of the original data
work out the mean of the bootstrap sample
make a histogram of them.

So, something like this:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(arsenic$Arsenic, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(x = my_mean)) + geom_histogram(bins = 10)

The exact shape you get will depend on the random number seed you used (if any), or the peculiarities of the random numbers you obtained (if not).

Three points for that. One for making an intelligent comment about the normality here and what it implies about what you would do, either:

I think this looks close to a normal distribution, and therefore I would be happy to use t.test to obtain the confidence interval.
I think this is still a little skewed to the right, so I would prefer not to use t.test to obtain the confidence interval (or, I would use it but be cautious about the result, or something like that). I would favour this explanation for the picture I got, but your graph may be different.

I have to say that this is a lot more bell-curve-like than I was expecting, and so the sample size of 19 is actually a lot more helpful (because of the Central Limit Theorem) than I would have guessed. Hence it is also reasonable to say, even if you don’t like the normality here either, that it’s a lot better than it was.

Extra: there are a couple of ways to sharpen up your conclusion about whether to use t.test or not. One is to use more bootstrap samples (than 1000), for example:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(arsenic$Arsenic, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) -> dd

This time, I saved my simulations, so that I can draw a couple of plots. For the histogram, you can use some more bins (because there are now 10,000 bootstrap sample means rather than 1000):

ggplot(dd, aes(x = my_mean)) + geom_histogram(bins = 15)

Once again, slightly skewed to the right. (Maybe the slight skewness is a bit clearer here.)

Another plot that is useful if normality is what you are specifically concerned about (which we are here) is a normal quantile plot. You may not have seen this “officially” in class yet, so a preview here:

ggplot(dd, aes(sample = my_mean)) + stat_qq() + stat_qq_line()

The thing that the plot needs is called sample.³

The idea here is that if the data come from a normal distribution, the points will follow the line (to within randomness). If they don’t, the points will deviate from the line in some clear way. A curve, for example, indicates skewness; the graph can also indicate long or short tails compared to normal, or outliers. In this one, the curved shape indicates that the lowest values (on the \(y\)-axis) don’t go down low enough, and the highest values go a bit too high, so the skewness is to the right. That said, though, none of the points are very far from the line, so the skewness is not severe.⁴

I guess I should say that if the original distribution is skewed to the right (as ours was), the bootstrap sampling distribution of the sample mean will also be skewed to the right, but less so (and how much less so is determined by the sample size). So the real consideration is whether the distribution here is “close enough” to normal, whatever you think that is. The message from this one is that even if you think the sample size of 19 is not big enough, it wouldn’t take a much bigger sample size for the normality to be entirely acceptable. Such is the power of the Central Limit Theorem.

2 Poisson power

The Poisson distribution is a discrete distribution defined on the nonnegative integers, with \(P(X=x) = e^{-\lambda} \lambda^x / x!\), where \(\lambda >0\) is a parameter. The distribution has mean \(\lambda\) and variance \(\lambda\). Let \(X_1,... X_n\) be a random sample of independent values from a Poisson distribution with mean \(\lambda\); then the sample mean of these values, \(\bar{X}\), has a sampling distribution that is approximately normal if either \(n\) or \(\lambda\) (or both) is large. The R function rpois generates random samples from a Poisson distribution. It has two inputs, the sample size and the value of \(\lambda\), in that order.

(a) (2 points) Draw a random sample of size 20 from a Poisson distribution with mean 2.

Following the instructions:

rpois(20, 2)

 [1] 3 2 1 3 3 2 1 0 1 2 3 1 3 3 2 3 0 2 2 0

Your sample will be different from mine (unless you happen to have the same random number seed that I do), but you should expect to see a few zeros and some big values like 4 or 5.⁵ The grader will be looking at your code and the reasonableness of your output.

(b) (3 points) For samples of size 30 from a Poisson distribution with mean 5, simulate the sampling distribution of the sample mean (hint: borrow ideas from the bootstrap, but you don’t have a sample to sample from, so sample from the distribution instead.) Does your distribution look close to normal in shape?

The difference from the bootstrap idea is the sample line, where instead of sampling from the data that you don’t have, you use rpois to sample from the distribution that your data would be coming from:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rpois(30, 5))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(x = my_mean)) + geom_histogram(bins = 10)

This looks pretty normal, so either the sample size or the population mean (I’m guessing the sample size here) are “large” enough.

(c) (3 points) By simulation, estimate the power of the \(t\)-test to reject the null hypothesis that the population mean is 5, in favour of a two-sided alternative, when the population has a Poisson distribution and its mean is actually 6.5, using a sample size of 30.

Sample from the truth, and test the given null hypothesis:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rpois(30, 6.5))) %>% 
  mutate(my_t_test = list(t.test(my_sample, mu = 5))) %>% 
  mutate(my_p = my_t_test$p.value) %>% 
  count(my_p <= 0.05)

The power is estimated as 0.896. Or whatever proportion of times you correctly rejected the null hypothesis, which will vary a bit from mine because randomness.⁶

Extra: the best way to test for a Poisson mean is actually not to use a \(t\)-test. To see what to do instead, assume that we are in the position of having a normal sampling distribution of the sample mean by virtue of a large sample size or something. Then we can work out the mean and variance of \(\bar{X}\):

\(E(\bar{X}) = (1/n) E(\sum X_i) = (1/n) n \lambda = \lambda\)

so that, unsurprisingly, the sample mean is an unbiased estimator of the population mean, and:

\(var(\bar{X}) = (1/n^2) var(\sum X_i) = (1/n^2) n\lambda = \lambda / n\)

The way you would make a test statistic for testing \(H_0: \lambda = \lambda_0\) is to standardize \(\bar{X}\) and use the null-hypothesis value of the parameter, substituting for the parameter everywhere you see it:

\(z = { \bar{X} - \lambda_0 \over \sqrt{\lambda_0 / n} }\)

The reason you usually have a \(t\)-test for a mean is that the variance on the bottom is estimated from the data, which introduces some extra chi-squared-flavoured variability and makes the whole thing \(t\) rather than normal. But here, you will know \(\lambda_0\) because it’s given in \(H_0\), and hence the variance of \(\bar{X}\) is known. This means that the test statistic is a \(z\) rather than a \(t\), in just the same way that the test for the mean with normal data is \(z\) rather than \(t\) when you know the population variance. In that situation, it seemed rather far-fetched that you would mysteriously know the population variance even if you had a hypothesized value for the population mean, but here is a case when you do.⁷

To do an example, let’s make up some data from a Poisson whose mean is actually 6.5:

my_sample <- rpois(30, 6.5)
my_sample

 [1]  6  3  1  5  4  5 11  7  7  9  4  6  7  8  6  1  4 11  3  7  9  4  9  3  7
[26]  7  3  7  6  8

and then work out the test statistic for testing that it came from a population with mean 5:

z <- (mean(my_sample) - 5) / sqrt(5/30)
z

[1] 2.28619

because \(H_0\) says that \(\lambda\) is 5 and the sample size is 30.

To get a (two-sided) P-value, note that this is in the upper tail, so find the probability of being above it (as one minus the probability of below), and double it, as if you were using tables. pnorm is the cumulative normal in R:

2 * (1 - pnorm(z))

[1] 0.02224312

and so, for our sample, we would (correctly) reject that the population mean was 5 when it was actually 6.5. Using the same data but a \(t\)-test gives something slightly different. pt is the cumulative \(t\)-distribution, with the second input being the df:

t_stat <- (mean(my_sample) - 5) / (sd(my_sample) / sqrt(30))
t_stat

[1] 1.967053

2 * (1 - pt(t_stat, 30 - 1))

[1] 0.05880986

On this occasion, we would actually just fail to reject \(H_0\),⁸ and make a type II error. This happened because the sample variance was a bit bigger than the variance we would have expected from a Poisson distribution of mean 5, and hence the test statistic was a bit smaller (and less significant). (A Poisson distribution with a bigger mean is also more variable, and so doing a \(t\)-test rather than a \(z\)-test is costing us some power if the population mean is bigger than 5. But probably not very much, as our simulation earlier showed that the power was already pretty high.)

Footnotes

I’m using set.seed, like I did in lecture, so that the random numbers generated are the same every time I run this.↩︎
What is often aptly called a “mixture distribution”.↩︎
Technical detail: plots of the actual data start with geom; plots of something calculated from the data start with stat.↩︎
You might be curious about how this graph works. The idea is that the data are on the \(y\)-axis, and on the \(x\)-axis are the \(z\)-scores you would expect to see if the data were normal. If the data really are normal, the observed (\(y\)) and expected (\(x\)) should agree to within randomness and the points will be close to the line. The \(x\)-coordinates of the points on the plot are called “normal order statistics”; for example, the one way over on the left says that if you took a sample of size 10,000 from a standard normal distribution, the expected value of the smallest one is near \(-4\). That might seem extreme for a \(z\)-score, but you just took a huge sample and picked the very smallest value. The line on the plot goes through the first and third quartiles, of the data and of the normal order statistics, so as not to be influenced by outliers.↩︎
I was unlucky enough to get no values bigger than 3.↩︎
If yours seems too different from mine, expect the grader to scrutinize your code.↩︎
This will happen if you have a one-parameter distribution where the one parameter is related to the mean; the population variance will also depend on that one parameter. Other examples would be the exponential distribution (use the form where the parameter equals the mean), or the binomial distribution, treating \(n\) as known and \(p\) as the only parameter.↩︎
The intuition here is that the test statistic is very close to the 1.96 that would give you a P-value of exactly 0.05 if it were a \(z\), but because it’s a \(t\), it needs to be a little bigger to be significant, like nearer to 2, and hence the P-value is a bit bigger than 0.05.↩︎