STAC32 Assignment 4

Packages

library(tidyverse)

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Absences

In a particular company, 15 employees were sampled, and the number of days each employee was absent from work over a certain time period was recorded. The data are in http://ritsokiguess.site/datafiles/absent.csv, with one column called days.

(1 point) Read in and display some of the data.

As ever, no surprises:

my_url <- "http://ritsokiguess.site/datafiles/absent.csv"
absent <- read_csv(my_url)

Rows: 15 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): days

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

absent

Indeed, one column called days, and the promised 15 rows. absent or absence or something like that is a good name for the dataframe.

(2 points) Draw a normal quantile plot of the days of absence. What do you learn from the plot?

This:

ggplot(absent, aes(sample = days)) + stat_qq() + stat_qq_line()

The upward-opening curve shape tells you that the distribution is skewed to the right (or you can reason it as the lower values are too bunched up and the higher ones are too spread out). I think this is a better answer than “there are upper outliers” (with the rest of the points being close to the line) because it seems that there is a curved pattern in all the points, and therefore the non-normality has to do with the entire distribution, rather than with just those two highest observations.

Extra: a histogram is kind of difficult to make successfully, with only 15 observations: you don’t want to use too many bins, but the distribution (as we now know) is very skewed, so we need more bins that we would if the distribution were normal. Try 8:

ggplot(absent, aes(x = days)) + geom_histogram(bins = 8)

Even with 8 bins, you don’t get much of a sense of the left side of the distribution, but the impression here is also that we have a long tail, rather than those two observations on the right being outliers. (If we had had something more like a bell shape on the left and then those two distant observations, it would have been a different story, but as it is, those two observations on the right look more as if they are part of a long tail.)

(3 points) Obtain and plot a bootstrap sampling distribution of the sample mean (using a normal quantile plot).

Find some relevant code (from your lecture notes or the worksheet), copy, paste, and edit:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(absent$days, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

The steps are:

set up a new dataframe with space for 10000 (ideally) simulations
work rowwise
generate one random sample from the data with replacement for each simulation, using list to store the whole sample in one cell of the dataframe each time
work out the mean of each sample
make a normal quantile plot of those means

Extra: if you want a better sense of what is going on (rather than treating the whole chunk of code as black magic), run it one line at a time. To do this, you can move the %>% at the end of a line down to the next line. Then running the code above where you moved the %>% to (eg. by going to the top line and hitting ctrl-enter) will run all the lines above the %>% on a line by itself and display the result. When you are happy with what you see, move the %>% back to where it was, and then move the %>% at the end of the next line down to the line following, and so on. In these long pipelines, it is good to know what each line is doing, and if you are building one yourself, it is smart to work one line at a time and get that working before you move to the next.

(2 points) Use your plot of Question 3 to argue against using a \(t\) procedure to obtain a confidence interval for the mean number of days absent (across the whole company).

The bootstrap sampling distribution of the sample mean is not normal: it is still skewed to the right (it is still curved upwards), and so the sample size of \(n = 15\) is not large enough to overcome the non-normality in the sample. Or, with this sample size of 15, the Central Limit Theorem will not be powerful enough to enable us to trust the result of a \(t\)-procedure.

Extra 1: remember that there is no absolute sample size that will always be big enough for any data; it depends on the particular data you have. These data were fairly severely skewed to the right, so we would need a big sample before we would be happy with a \(t\) confidence interval. If the data were less skewed, a sample size of 15 might be big enough. It depends.

I asked you to make an argument for not using a \(t\) confidence interval so that the grader would have a specific argument to grade. If I had left it up to you, you might have said “the points are close to the line, therefore using \(t\) is OK”, which I think is only borderline defensible here, but it is admittedly a judgement call.

Extra 2: having said that \(t\) is no good, the question is then one of what we do instead. We don’t have an answer to that yet. The answer that we see later in the course is obtain a confidence interval for the median number of days absent; the procedure we use then does not require any assumption of normality, so it will work no matter how skewed the distribution is. But the company might be interested specifically in the mean or total days absent over the entire company (for example, for budgeting purposes, or for assessing whether there is a problem with the mental health of the employees), and for that we don’t have an answer. If you are taking C53, you might have learned about estimating things like a population total (which you do by estimating the population mean and then scaling up by the population size); the actual estimate of the population total is fine no matter what the shape of the distribution, but estimating the uncertainty in that estimated total is where the normality comes in. In this case, if we were estimating the total number of days absent over the entire company, estimating the population mean and scaling it up (by the total number of employees) would be fine, but using a \(t\) or normal distribution to estimate the uncertainty in that estimated total would be questionable, based on what we have seen here.

Extra 3: I wrote a package called bstrap to simplify the process of obtaining a bootstrap sampling distribution of the sample mean:

library(bstrap)
plot_dist(absent$days)

which is about what we got above.

To install this, install the devtools package (via install.packages("devtools"), in the console), and then, still in the console, run this:

library(devtools)
install_git("http://worktree.ca/nxskok/bstrap")

Power from Poisson populations

The Poisson distribution is a discrete distribution, taking values on the whole numbers 0 and greater. It is often used to model the number of events happening over time or space, where events happen at a constant rate independently of other events (such as flaws in a length of cloth, or goals in a hockey game, or calls received by a call centre). It has one parameter, the mean (usually denoted \(\lambda\), “lambda”). In R, the function rpois draws a random sample from a Poisson distribution. It has two inputs: the first is the size of the random sample, and the second is the mean of the Poisson distribution.

(3 points) Take a random sample of size 1000 from a Poisson distribution with mean 1.5, and make a normal quantile plot of the result. Hint: save the random sample in the column of a dataframe, perhaps created with tibble.

This is the most direct way:

tibble(x = rpois(1000, 1.5)) %>% 
  ggplot(aes(sample = x)) + stat_qq() + stat_qq_line()

Something more in line with bootstrapping or power simulation would go like this:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_x = rpois(1, 1.5)) %>% 
  ggplot(aes(sample = my_x)) + stat_qq() + stat_qq_line()

The annoyance here is that by setting up for 1000 simulated values, you have to make sure you generate 1000 instances of 1 Poisson random value, which you probably find unnecessarily confusing. But the result is the same either way.

(1 point) Describe the shape of your Poisson distribution.

It is skewed to the right (a curved pattern on the normal quantile plot), or has a long upper tail (some unusually high values), even after accounting for the discreteness. Values coming out of a Poisson distribution must be smallish whole numbers, so there will be a lot of repeated values (the horizontal streaks on the graph), but the non-normal pattern should be clear on your plot anyway. (Your plot won’t be exactly the same as mine because randomness, but you should see the same kind of pattern.)

(3 points) Estimate by simulation the power of a \(t\)-test to reject a mean of 2, against a two-sided alternative, when the data come from a Poisson distribution with mean 3 and the sample size is 15.

This one actually has to be simulation because the true population is not normally-distributed, but we get to that later. The process is:

set up the simulation and work rowwise (you can do more than 1000 simulations if you want, but 1000 is enough)
draw a bunch of samples from the truth
run the desired \(t\)-test on each random sample, using the null mean, and save all the results
pull out the P-value from each one
count how many of those P-values would lead to (correct) rejection of the (wrong) null hypothesis

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rpois(15, 3))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 2))) %>%
  mutate(p_val = t_test$p.value) %>%
  count(p_val <= 0.05)

My power is estimated to be 0.572 (\(572/1000\)). Yours probably won’t be exactly the same, because randomness, but it should be in the same ballpark. Expect the grader to scrutinize your code more carefully if your answer isn’t a bit above 0.5.

Extra: the rule of thumb is that if the power is not too far from 0.5, and you do 1000 simulations, you are likely to estimate it to within about 0.03 (3 percentage points), so I would expect your answer to be between about 0.54 and 0.60, unless you are unusually unlucky. The background is that you are estimating a proportion (of times you correctly reject the null), and each one of your simulations either says to reject (“success”) or not (“failure”), like a binomial distribution. So you can use your simulation result to get a confidence interval for the true power, which is done using prop.test:

prop.test(572, 1000)


    1-sample proportions test with continuity correction

data:  572 out of 1000, null probability 0.5
X-squared = 20.449, df = 1, p-value = 6.124e-06
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5406126 0.6028273
sample estimates:
    p 
0.572

indeed, up and down 3 percentage points from my estimate. If you wanted to be more accurate, you would do more simulations; 10,000 of them would reduce the uncertainty by \(\sqrt{10} \simeq 3\), so you would expect to be within 1 percentage point 95% of the time with 10,000 simulations.

You might have seen a formula for a confidence interval for a population proportion, along these lines:

\[ \hat{p} \pm z^* \sqrt{ \hat{p} (1 - \hat{p}) \over n} \]

where \(\hat{p}\) is the sample proportion, \(z^*\) is the relevant number from a normal table (eg. 1.96 for a 95% confidence interval), and \(n\) is the sample size. This is the formula that prop.test is using. There are actually two reasons why the 0.03 rule of thumb is usually pretty good:

if \(n = 1000\) and \(\hat{p} = 0.5\), the margin of error part of the confidence interval is close to 0.03.
the margin of error changes only slowly as \(\hat{p}\) changes, so the rule of thumb is still good as long as \(\hat{p}\) is not too close to 0 or 1:

tibble(p_hat = seq(0.2, 0.8, 0.1)) %>% 
  mutate(margin = 1.96 * sqrt(p_hat * (1 - p_hat) / 1000))

(3 points) By doing at most two more simulations, what can you say about the sample size necessary to achieve power 0.8? (Hint: your goal is not to get the most accurate answer, but to show that you know what to do and why.)

Our current power is smaller than 0.8, so we need to make the sample size bigger. I don’t know about you, but I don’t have much intuition about how much bigger, so all you can do is to try something. I’ll try a sample size of 30:

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rpois(30, 3))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 2))) %>%
  mutate(p_val = t_test$p.value) %>%
  count(p_val <= 0.05)

Copy and paste your simulation code from the previous question, and change only one thing: the sample size on the my_sample line from 15 to (in my case) 30.

Now my power is too big (0.886, bigger than 0.8), so for my final simulation, I need a smaller sample size than 30. The power I want is closer to the 0.886 for \(n = 30\) than the 0.572 for \(n = 15\), so my next guess should be closer to 30 than to 15. How about 25?

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rpois(25, 3))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 2))) %>%
  mutate(p_val = t_test$p.value) %>%
  count(p_val <= 0.05)

This is still a bit too big, but closer than either of the other two guesses, so my best guess at the sample size I need is “a bit less than 25”, and with my permitted two simulations, that’s as close as I can get.

Your process probably won’t be the same as mine, but you need to do something like this:

run a first simulation with a sample size bigger than 15. It doesn’t really matter what, but 20 or bigger is probably a good idea, bearing in mind that the power in the previous question was a long way short of 0.8, so the sample size you need is probably noticeably bigger than 15.
if that simulation gave you a power bigger than 0.8, run a second one with sample size between 15 and the one in your first simulation.
if the first simulation also gave you a power smaller than 0.8, run a second one with sample size bigger than the one in your first simulation.

You now have three simulation results (the one with \(n = 15\) in the previous question and the two here). Use those to say what you now know about the required sample size: between two values, bigger than something, or approximately equal to something. (In my case, I think “approximately 25” and “between 15 and 25” are both reasonable answers, because \(n = 25\) happened to get me pretty close.)

Remember that your goal is to make an honest effort to get reasonably close, but how close you end up getting is not important as long as you follow a reasonable procedure. If your first simulation is something like this, with \(n = 24\):

tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rpois(24, 3))) %>% 
  mutate(t_test = list(t.test(my_sample, mu = 2))) %>%
  mutate(p_val = t_test$p.value) %>%
  count(p_val <= 0.05)

that raises some red flags, because why would you choose a sample size like 24, when you could have chosen a round number like 25 or 30 or even 40? Jumping straight to \(n = 24\) needs a good explanation of why you went to exactly that number, and suggests that you have done some experimentation behind the scenes (that is to say, not doing “at most two more simulations” but by doing several more and showing only the one(s) that got you close, in other words, cheating.) Particularly for the first of the two simulations here, you will have very little idea about the appropriate sample size, and so your only hope is to pick a sample size noticeably bigger than 15 and see where it leads you. Anything other than a round number like 30 or 40 or something like that suggests an accuracy that you cannot have yet. (If you happen to be lucky enough to pick 25 the first time, then you can either stop there and call it close enough, or you can say “I need a sample size just less than 25”, and then proceed to try \(n = 24\).)

(2 points) Explain briefly why you cannot use power.t.test to obtain the power in the situation described above.

power.t.test assumes that the population being sampled from is normal, and the population here actually has a Poisson distribution, which is not normal in shape (we saw above that it is right-skewed).

Two things to say:

power.t.test requires a normal population
this population is not normal.

Say them both for full credit. If you say only the second one, expect to receive a comment along the lines of “why does this matter?”.

Extra: you might then reasonably ask why we were doing a \(t\)-test if the population has a right-skewed shape. The answer is the Central Limit Theorem, as usual. If you had actual data, you could do a bootstrap sampling distribution of the sample mean, and see whether that looked normal. Here, you don’t have data, but you know the population the data would have come from, so instead of using sample, you use the same mechanism, rpois, that you used above, to draw samples. Here’s how it goes for samples of size 15 from a Poisson distribution with mean 3:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(rpois(15, 3))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

The above shows you what sample means might look like when you sample 15 observations from a Poisson distribution with mean 3, repeatedly.

This is close to normal, although with a short lower tail, so that the \(t\)-test will be off, but only by a little bit. With bigger samples, such as the 25 or so we would need to get power 0.80 against \(H_0: \mu = 2\), the normality will be better, and there will then be no problems with the \(t\)-test at all. (Note that this distribution also has stair-steps, but much smaller ones than we saw before: the data values in the samples must be whole numbers, so the sample mean when \(n = 15\) must be some whole number of fifteenths, which is what you see on this graph.)