STAC33 Assignment 6

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 Interspike

A study of the membrane potential for neurons from guinea pigs was carried out. The data consists of 312 measurements of interspike intervals; that is, the length of the time period between spontaneous firings from a neuron. The one column of data is in http://ritsokiguess.site/datafiles/interspike.csv.

(a) (2 points) Read in and display some of the data.

By now, two gimme points:

my_url <- "http://ritsokiguess.site/datafiles/interspike.csv"
interspike <- read_csv(my_url)
Rows: 312 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): interval

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
interspike

My dataframe appears to be called interspike. Pick a name that works for you, maybe firings or something like that.

(b) (2 points) Draw a normal quantile plot of the interspike intervals.

Thus:

ggplot(interspike, aes(sample = interval)) + stat_qq() + stat_qq_line()

(c) (2 points) Describe the shape of the distribution of interspike intervals. How do you know that this is the shape of the distribution?

It is skewed to the right, because there is a curved pattern with a short tail at the lower end (bunched up) and a long tail at the upper end (spread out). One point for saying the shape, one for saying how you know it’s that shape. Saying that it is an upward-opening curve is only a half point for the “how you know”, because this is something that you have memorized without knowing why a curve this way up indicates a right skew.

Extra: for yourself, if you want to convince yourself that you have the right answer, draw a histogram as well:

ggplot(interspike, aes(x = interval)) + geom_histogram(bins = 8)

This number of bins seems to be good. It’s very obviously right-skewed; in fact, it looks something like an exponential or a gamma distribution. These two are often used to model the time between events; you might have learned about the Poisson Process, where the number of events in a time period has a Poisson distribution and the time between events has an exponential distribution.

You can make a quantile plot for any distribution, not just normal:

ggplot(interspike, aes(sample = interval)) + stat_qq(distribution = qexp) + 
  stat_qq_line(distribution = qexp)

The nature of the exponential distribution is that you get a lot of values bottom left and only a few top right: there are only a few large observed values, but the ones we observed are close to the line, and therefore an exponential distribution is a good fit.

Looking at the code, if the distribution you want a quantile plot for is not normal, you specify the distribution as distribution inside stat_qq and stat_qq_line. You have to do it in both places because the points and the line are drawn independently.

R knows, for any distribution such as exp (exponential), the functions dexp (density function), rexp (random sample), pexp (CDF, “probability of less than”), and qexp (inverse CDF). The probability of an exponential random variable with rate1 2 being less than 1.5 is

pexp(1.5, 2)
[1] 0.9502129

and hence

qexp(0.95, 2)
[1] 1.497866

because both the above are different ways of saying that the probability of being less than 1.5 is about 0.95. The quantile plot uses the inverse CDF to approximate what the lowest, second lowest, … highest values in a sample from an exponential distribution should be.

(d) (2 points) These data are times between neuron firings. How could you have guessed without looking at any data that the distribution shape would have been as you found? Explain briefly.

Times between events have a lower limit of 0 but no upper limit (you might wait a long time for the next event to occur). So you would expect to see a short tail on the left (especially if a lot of the times are close to zero) and a long tail on the right corresponding to the few occasions you had to wait a long time for the next firing.

As it turns out, a lot of the values of interval are close to zero, hence the strong right-skewness that you see in our data.

(e) (3 points) Test whether the median time between neuron firings is 1 second, or whether it is different from that. What do you conclude?

This is a test of the median (you might imagine that the researchers chose to use the median because it was a better “centre” for a distribution of this shape). Thus, you need a sign test, which means loading smmr and then:

sign_test(interspike, interval, 1)
$above_below
below above 
  208   104 

$p_values
  alternative      p_value
1       lower 1.993022e-09
2       upper 1.000000e+00
3   two-sided 3.986043e-09

The two-sided P-value is \(3.99 \times 10^{-9}\) (make sure you say which P-value out of the three you are using). This is extremely small, so we can reject the null hypothesis that the median is 1 (second) in favour of the alternative that it differs from 1 second. Thus, the median time between neuron firings is different from 1 second.

You might find it clearer to write down your hypotheses first, then get the right P-value, then reject the null, then write your conclusion in the context of the data. In any case, make sure that you distinguish your hypotheses from your conclusion, so that it is clear to your reader which is which. You need to at least make it clear that you know what the hypotheses are, then give the correct P-value, then your conclusion about time between neuron firings.

Extra 1: We have over 300 observations, and about two-thirds of them are below the null median. With this big a sample, this is very unbalanced above and below, and so it is no surprise that the P-value is so small.

Extra 2: with Extra 1 in mind, we’d expect the median time between neuron firings to be less than 1 second. With so much data, we would expect even a 99% CI to be short:

ci_median(interspike, interval, conf.level = 0.99)
[1] 0.4852029 0.7243470

and so it is, from 0.49 to 0.72 seconds, a long way below 1 second.

(There isn’t actually anything special about 1 second for these data, but I wanted to give you a null median to test, and my histogram suggested that 1 second was at least vaguely plausible for the median.)

Extra 3: to pursue the exponential distribution idea a bit further: the density function (in terms of the “rate”) is \(\lambda e^{- \lambda x}\) for \(x \ge 0\). From your previous work with this distribution, you’ll recall that the mean is \(1/\lambda\). To find the median, recall that the median \(m\) satisfies \(\int_0^m \lambda e^{- \lambda x}\; dx = 0.5\) (“the value that has probability 0.5 of being below it”). Solve that to find that \(m = -\ln 0.5 / \lambda = \ln 2 / \lambda\). This means that, in an exponential distribution, the median is always about 70% of the mean, since

log(2)
[1] 0.6931472

In other words, if you have exponential data, a test for the median is entirely equivalent to one for the mean, because they are both equivalent to a test for a value of \(\lambda\). In our case, a median of 1 goes with \(\lambda = \ln(2)\) and hence a mean of

1/log(2)
[1] 1.442695

This says that, for exponential data, you can make a test for either the median or the mean by using a test for \(\lambda\), which you can obtain by standard likelihood theory, for example by writing down the likelihood function and recognizing the distribution of the sufficient statistic, to get an exact result, or by using likelihood ratio to get an approximate one. These ought to be better than the sign test, because they use the actual data values, and it looks as if our data really does come from an exponential distribution.

All right, likelihood time:

\[ L(\lambda) = \prod_{i = 1}^n \lambda e^{- \lambda x_i}\]

or, more easily, the log-likelihood:

\[ l(\lambda) = \sum_{i = 1}^n (\ln \lambda - \lambda x_i) = n \ln \lambda - n \lambda \bar{x}\]

The sufficient statistic is therefore \(\bar{x}\), and if you want to, you can run with that to get your test (it has a gamma distribution as I recall). But I’m a fan of likelihood ratio, because that needs no more than our formula for the log-likelihood, our data, and some calculation.

First, we define a function to calculate the log-likelihood for any value of \(\lambda\) and any data:

loglik <- function(lambda, x) {
  xbar <- mean(x)
  n <- length(x)
  n * log(lambda) - n * lambda * xbar
}

We need the maximum likelihood estimate of \(\lambda\). In this case, the calculus is very simple (with the result \(\hat \lambda = 1 / \bar{x}\)), but in other cases, you might not be so lucky. In that case, you can find the MLE by literally maximizing the log-likelihood:

ans <- optimize(loglik, c(0, 5), interspike$interval, maximum = TRUE)
ans
$maximum
[1] 1.146897

$objective
[1] -269.2388

optimize finds the max or min of a function of one variable2 over an interval. The inputs are the function, the endpoints of an interval in which you think the answer lies, any additional inputs to the function (a value for what our function calls x), and finally to make sure that it finds a maximum (rather than the minimum that is the default). The answer says that the MLE of \(\lambda\) is about 1.147, and the log-likelihood at that MLE is \(-269.24\).

Is that MLE right? It should be one over the sample mean:

1/mean(interspike$interval)
[1] 1.146891

It agrees to almost five decimals, so I call that good.

To do a likelihood ratio test, we also need the log-likelihood at the null value of \(\lambda\). We were testing that the median was 1, and the median is \(\ln 2 / \lambda\) so that for a median of 1 (our null hypothesis), we should take \(\lambda = \ln 2\):

l0 <- loglik(log(2), interspike$interval)
l0
[1] -302.9156

To make the likelihood ratio test, we take the ratio of the likelihoods at the maximum and at the null, or, in practice, the difference in log-likelihoods:

stat <- ans$objective - l0
stat
[1] 33.6768

To get a P-value for that, we say that twice the test statistic we just calculated has3 a chi-squared distribution if the null is true. The degrees of freedom is how many parameters we estimated to find the maximum (one) minus how many we estimated under the null (none), so 1. If the null is true, the log-likelihood under the null and at the maximum should be close together, so we need the upper-tail probability:

pchisq(2 * stat, 1, lower.tail = FALSE)
[1] 2.269279e-16

This is even smaller than the P-value we got from the sign test, so we have even stronger evidence that the median is not 1. This ought not to be very surprising: the sign test makes no assumptions about the data, in return for which we might get weaker conclusions, while this test assumes a parametric model (exponential) for the data; if the parametric model is to be believed, which our second quantile plot said that it was, our conclusions can be stronger. The difficulty, which turned out to be not much of one here, was expressing our null hypothesis (about the population median) in terms of the parameter of the distribution.

This Extra is already too long, but I observe that you can use the likelihood ratio test to get a confidence interval, in the same way that we did with the sign test: the values of the parameter inside the interval are the ones for which the likelihood ratio test is not rejected.

2 Spelling

Nine randomly-selected eighth-grade students were given a spelling test. These students were then given a two-week course to help them improve their spelling, and then they were given another similar spelling test. The students’ scores before and after the two-week course, and the difference between them (called differ), are in http://ritsokiguess.site/datafiles/spelling.csv.

(a) (1 point) Read in and display (probably all of) the data.

Only one point for the same old thing this time:

my_url <- "http://ritsokiguess.site/datafiles/spelling.csv"
Spelling <- read_csv(my_url)
Rows: 9 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): before, after, differ

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Spelling

My dataframe has an uppercase S on its name, for some reason. Yours probably does not.

(b) (2 points) How do you know that this is matched pairs data?

There are two measurements on the same individuals (the same nine students were measured both before and after). Get at the idea that the same individuals gave us two measurements each.

“Before” and “after” is a classic example of matched pairs: you measure the individuals beforehand, do something to them, and measure the same individuals again afterwards.

Extra: This is different from the kids learning to read in lecture. There, we only measured the kids once, so we had two independent samples. With the benefit of hindsight, we now see that it would have been smart to measure those kids both before and after so that we could see the improvement attributable to the two learning methods.

(c) (3 points) Make a suitable normal quantile plot. How do you know that what should be approximately normal actually is?

For a matched pairs \(t\)-test, the thing that should be approximately normal is the difference between the two test scores. We already have this in the dataframe, in the column differ, so we don’t even need to calculate it first:

ggplot(Spelling, aes(sample = differ)) + stat_qq() + stat_qq_line()

With such a small sample, pretty much all we can say is that these points are more or less close to the line, and therefore that the differences are close enough to normal.

You might say that the highest value is a bit too high (an upper outlier), or there is a little evidence of a curved shape (some right-skewness), but read the wording of the question carefully: I am saying that you need to get to a conclusion of “approximately normal”, and if you find one of these problems, you need to finish by saying something like “but it’s not too bad and the points are reasonably close to the line”.

Extra: you can do a bootstrap sampling distribution of the sample mean difference, in the usual way. I’m drawing a normal quantile plot of the results:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(Spelling$differ, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

This is not quite normal, but it fails by having a short lower tail compared to the normal, which is about the best way it can fail. Nothing crazy should be happening to the sample mean, so our \(t\)-test should be all right.

The stair-step pattern on the plot is because the test scores, and hence the differences, were all integers, and so there is a limited number of different values the bootstrap sample mean can take: it must be an integer plus some number of ninths. On the other hand, if we’d done this with our interspike intervals from the other question, which were recorded to about five decimals, we would have had an essentially infinite number of possible sample means, and this pattern would not have shown up.

(d) (3 points) Run a suitable matched pairs \(t\)-test. What do you conclude from it, in the context of the data?

This is the standard way to do it. Remember that the two-week course is supposed to improve spelling, so a one-sided alternative is the right thing to do. The way around to state the alternative is how the first column named in t.test compares to the second one, if the alternative is true, hence:

with(Spelling, t.test(before, after, paired = TRUE, alternative = "less"))

    Paired t-test

data:  before and after
t = -2.286, df = 8, p-value = 0.02579
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
       -Inf -0.5596507
sample estimates:
mean difference 
             -3 

This is actually not alphabetical order, but if you switch before and after around, the alternative should be the “greater” that you might be expecting (which is actually because in that case after is listed first, not because it’s alphabetically first).

Alternatively, you can use the differences. Those are after minus before, so if the alternative is correct, they should be greater than zero, and hence:

t.test(Spelling$differ, mu = 0, alternative = "greater")

    One Sample t-test

data:  Spelling$differ
t = 2.286, df = 8, p-value = 0.02579
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
 0.5596507       Inf
sample estimates:
mean of x 
        3 

You should probably get used to doing the test both ways.

The P-value, either way, is 0.026, which is less than 0.05, so we reject the null (the two-week course has no effect) in favour of the alternative (it has a positive effect), and thus the course helps students with their spelling on average (they do better afterwards on average than they did before).

Footnotes

  1. The rate of an exponential is one over the mean.↩︎

  2. If you have more than one variable, you use optim instead.↩︎

  3. Asymptotically, meaning in the limit as the sample size increases.↩︎