Worksheet 11

Published

April 4, 2024

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There might be extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

Note for this one: Stan on r.datatools may not work very well. It goes a lot better if you have R running on your computer. (As I write, install_cmdstan is slowly compiling a lot of C++ code that is part of the internals of Stan.)

There are a lot of parts in the one question, but I wanted to lead you through everything.

Bayesian estimation for the exponential distribution: my solutions

The data in http://ritsokiguess.site/datafiles/wait-times.csv are the times to failure of a certain type of electronic component. It is believed that the failure times, in column wait, come from an exponential distribution with rate parameter \(\beta\) (the mean of the distribution is \(1/\beta\)). From the data, what can we say about \(\beta\) using Bayesian methods?

Stan knows about the exponential distribution, with one parameter (the rate), and also about the gamma distribution with parameters u and v. As parametrized in Stan, the gamma distribution has mean \(u/v\) and variance \(u/v^2\).

  1. Read in and display (some of) the data.

  2. Someone tells you that components coming out of this process have a failure rate (value of \(\beta\)) “almost certainly between 0.1 and 0.5.” Find a way to convert that into a gamma prior distribution for \(\beta\). You don’t need to be very sophisticated.

  3. Write a Stan program to simulate the posterior distribution of \(\beta\) using the prior distribution you just made. Allow the sample size and the parameters of the prior distribution to be input to your program.

  4. Verify that your Stan code compiles, saving the result of the compilation. If it doesn’t, the message you get should give you some hints about what to fix up.

  5. Set up your data ready to pass into Stan.

  6. Run your compiled Stan model with your data, saving the result.

  7. What is the posterior mean for \(\beta\), and what is a 90% posterior interval for \(\beta\) (the middle 90% of the posterior distribution)?

  8. Obtain a histogram of the posterior distribution of \(\beta\).

  9. Obtain the posterior predictive distribution, and draw a histogram of it.

  10. Finally, compare the histogram of your data with the posterior predictive distribution.

1 Bayesian estimation for the exponential distribution

The data in http://ritsokiguess.site/datafiles/wait-times.csv are the times to failure of a certain type of electronic component. It is believed that the failure times, in column wait, come from an exponential distribution with rate parameter \(\beta\) (the mean of the distribution is \(1/\beta\)). From the data, what can we say about \(\beta\) using Bayesian methods?

Stan knows about the exponential distribution, with one parameter (the rate), and also about the gamma distribution with parameters u and v. As parametrized in Stan, the gamma distribution has mean \(u/v\) and variance \(u/v^2\).

(a) Read in and display (some of) the data.

As ever:

my_url <- "http://ritsokiguess.site/datafiles/wait-times.csv"
times <- read_csv(my_url)
Rows: 20 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): wait

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
times

The times to failure are decimal numbers (what Stan calls real) with two decimal places. There are 20 of them.

(b) Someone tells you that components coming out of this process have a failure rate (value of \(\beta\)) “almost certainly between 0.1 and 0.5.” Find a way to convert that into a gamma prior distribution for \(\beta\). You don’t need to be very sophisticated.

We could treat the 0.1 and 0.5 as the middle 95% of our desired gamma distribution. If a gamma distribution were like a normal distribution (which it isn’t, but we ignore that for now), these would be roughly mean \(\pm\) twice the SD, so that the mean would be 0.3 and the SD would be 0.1. Now we need to turn these into values for \(u\) and \(v\); we have the mean and SD, so we know that \(u/v = 0.3\) and \(u/v^2 = 0.01\) (variance). Divide these to get \(v = 30\) and hence \(u = 9\). (The algebra comes out nicely.)

R’s gamma density function has the same two parameters (called the shape and the rate), so we can check to see whether this is anywhere near reasonable:

tibble(x = seq(0, 1, 0.01)) %>% 
  mutate(y = dgamma(x, 9, 30)) %>% 
  ggplot(aes(x = x, y = y)) + geom_line()

Well, that’s not too bad, although there’s more of the distribution above 0.5 than below 0.1. If you wanted to be more sophisticated, you would tweak the values of shape and rate to get 2.5% of the distribution below 0.1 and 2.5% above 0.5, but I’m going to call this good.

Extra: if you wanted to go a bit further, you could try a grid of values for u and v and pick the pair that are best. Except that you have to define “best”. Let’s define that to be the ones for which the probability of being between 0.1 and 0.5 is closest to 0.95 (without worrying about putting 0.025 in each end).

To calculate that, use pgamma, which is the cumulative distribution function for the gamma (like this for our 9 and 30):

pgamma(0.5, 9, 30) - pgamma(0.1, 9, 30)
[1] 0.9587505

which is actually already pretty close, indicating that 9 and 30 are not going to be bad. So let’s define a collection of values for u and v to try:

us <- c(5, 7, 9, 11, 13)
vs <- c(20, 25, 30, 35, 40)

and then make all the combinations of those with crossing:

crossing(u = us, v = vs)

and then work out the probability of being between 0.1 and 0.5 for all of those:

crossing(u = us, v = vs) %>% 
  rowwise() %>% 
  mutate(prob = pgamma(0.5, u, v) - pgamma(0.1, u, v))

and find the closest ones to 0.95:

crossing(u = us, v = vs) %>% 
  rowwise() %>% 
  mutate(prob = pgamma(0.5, u, v) - pgamma(0.1, u, v)) %>% 
  mutate(off = abs(prob - 0.95)) %>% 
  arrange(off)

7 and 25 is actually the winner, but 9 and 30 came pretty close. (You could now do a finer grid of values near those.)

(c) Write a Stan program to simulate the posterior distribution of \(\beta\) using the prior distribution you just made. Allow the sample size and the parameters of the prior distribution to be input to your program.

Create a new Stan file (File, New File, Stan File) and save it as something (mine is called exponential.stan). You’ll see a template for a simple normal model, which you can leave for inspiration (and delete later), or delete now.

I’m going to call the data y (you can call it whatever you like). I would start with the model section, first the likelihood and then the prior:

model {
  // prior
  beta ~ gamma(u, v);
  // likelihood
  y ~ exponential(beta);
}

The likelihood says that our data comes from an exponential distribution with rate beta (that we are trying to estimate), and our prior above it says that beta has a prior gamma distribution with some parameters u and v that we are going to worry about later.

Next, the parameters. There is only one, beta, which is a real number bigger than zero, which you declare like this, above the model:

parameters {
  real<lower = 0> beta;
}

Everything else is data, declared at the top. That includes y, which is a real array of length, say, n (we don’t want to hard-code the sample size), the values of u and v (both real and nonnegative), and, now that we’ve said that y is of length n, we have to declare n too (an integer that is at least 1):

data {
  int<lower = 1> n;
  real<lower = 0> u;
  real<lower = 0> v;
  array[n] real y;
}

So now you have data, then parameters, then model, saved in a Stan file.

(d) Verify that your Stan code compiles, saving the result of the compilation. If it doesn’t, the message you get should give you some hints about what to fix up.

expo <- cmdstan_model("exponential.stan")

There should be no output, but you can look at the result:

expo

data {
  int<lower = 1> n;
  real<lower = 0> u;
  real<lower = 0> v;
  array[n] real y;
}


parameters {
  real<lower = 0> beta;
}

model {
  // prior
  beta ~ gamma(u, v);
  // likelihood
  y ~ exponential(beta);
}

Warning: there used to be problems on r.datatools, that running Stan models would run you out of memory. I should check whether that’s still the case. Give me a moment (you will need to install everything first, which takes some time).

(e) Set up your data ready to pass into Stan.

As a list, with named things that are the ones in your data section:

expo_data <- list(n = 20, u = 9, v = 30, y = times$wait)

(f) Run your compiled Stan model with your data, saving the result.

expo_fit <- expo$sample(data = expo_data)
Running MCMC with 4 sequential chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: gamma_lpdf: Random variable is 0, but must be positive finite! (in '/tmp/RtmpBEv7cr/model-1229586b4c89.stan', line 16, column 2 to column 21)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1 
Chain 1 finished in 0.0 seconds.
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 2 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 2 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 2 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 2 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 2 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 2 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 2 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 2 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 2 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 2 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 2 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 2 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 2 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 2 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: gamma_lpdf: Random variable is inf, but must be positive finite! (in '/tmp/RtmpBEv7cr/model-1229586b4c89.stan', line 16, column 2 to column 21)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 2 finished in 0.0 seconds.
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 3 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 3 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 3 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 3 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 3 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 3 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 3 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 3 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 3 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 3 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 3 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 3 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 3 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 3 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 0.0 seconds.
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 4 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 4 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 4 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 4 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 4 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 4 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 4 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 4 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 4 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 4 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 4 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 4 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 4 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 4 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: gamma_lpdf: Random variable is 0, but must be positive finite! (in '/tmp/RtmpBEv7cr/model-1229586b4c89.stan', line 16, column 2 to column 21)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4 
Chain 4 finished in 0.0 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.6 seconds.

You should see 4 chains with 2000 iterations each speed by. You might get a (very) small number of warnings.

(g) What is the posterior mean for \(\beta\), and what is a 90% posterior interval for \(\beta\) (the middle 90% of the posterior distribution)?

Look at the output from the run:

expo_fit
 variable   mean median   sd  mad     q5    q95 rhat ess_bulk ess_tail
     lp__ -72.28 -72.00 0.71 0.33 -73.78 -71.76 1.00     1716     2739
     beta   0.23   0.22 0.04 0.04   0.16   0.30 1.00     1508     2072

For me, the posterior mean is 0.23, and the 90% posterior interval is 0.16 to 0.30.

I have install_cmdstan going on r.datatools, and it is either very slow or stuck, not sure which.

(h) Obtain a histogram of the posterior distribution of \(\beta\).

This is one of the things I sped through in lecture:

mcmc_hist(expo_fit$draws("beta"), binwidth = 0.02)

You’ll need to adjust the binwidth from the one on the lecture slides, since the values of beta are small. This, as you see, is slightly skewed to the right.

(i) Obtain the posterior predictive distribution, and draw a histogram of it.

This is a bit complicated. First, obtain and save (as a dataframe) the sampled values of beta:

as_draws_df(expo_fit$draws()) %>% 
  as_tibble() -> expo_draws
expo_draws

The numbers in the beta column are the (obtained by simulation) posterior distribution.

Next, draw a random exponential for each beta value, using that beta as the rate:

expo_draws %>% 
  rowwise() %>% 
  mutate(predictive = rexp(1, beta)) -> ppd
ppd

Some of the values in predictive are quite large, because the exponential distribution is right-skewed.

Now to plot them, via a histogram:

ggplot(ppd, aes(x = predictive)) + geom_histogram(bins = 15)

Evidently, it’s possible that we might observe some very big values.

(k) Finally, compare the histogram of your data with the posterior predictive distribution.

The idea is that none of the values we actually observed should be too unlikely in the posterior predictive distribution.

ggplot(times, aes(x = wait)) + geom_histogram(bins = 5)

We observed one value bigger than 20, which is on the upper end of reasonable according to the predictive distribution, but the others are all less than about 15, which looks entirely reasonable according to the predictive distribution.

The idea here is that if the data and the posterior predictive distribution are not roughly similar, then something is wrong: either the model is wrong, or the prior distribution is wrong. But here, all looks reasonably good; the nature of the exponential distribution is that you will occasionally get some very large values even if the exponential is the right model.