STAC32 Assignment 5

Packages

library(tidyverse)
library(smmr)

Component lifetimes

The lifetimes of 20 electronic components were measured, in hours. These are recorded in http://ritsokiguess.site/datafiles/components.csv, in column C1. The engineers want to decide whether the “typical” lifetime is 20 hours, or whether it is longer than that. (“Typical” could be mean or median, as appropriate.)

(1 point) Read in and display (some of) the data.

As usual. Give your dataframe a suitable name:

my_url <- "http://ritsokiguess.site/datafiles/components.csv"
lifetimes <- read_csv(my_url)

Rows: 20 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): C1

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

lifetimes

Extra: these data are ex15.33 from the Devore7 package.

(3 points) Draw a suitable graph of these data, and comment on the distribution’s shape.

We haven’t said anything about normality yet, so it is just fine to draw a histogram, with a suitable number of bins, such as 6:

ggplot(lifetimes, aes(x = C1)) + geom_histogram(bins = 6)

or 7:

ggplot(lifetimes, aes(x = C1)) + geom_histogram(bins = 7)

These histograms suggests “bimodal” (with a hole in the distribution at around 60 hours and a second peak around 80).

If you drew a histogram, say something about shape that is consistent with your histogram.

I think you could also draw a normal quantile plot here, as long as you focus on the distributional shape it shows you:

ggplot(lifetimes, aes(sample = C1)) + stat_qq() + stat_qq_line()

This is either skewed to the right, or there are five upper outliers (the lower tail is actually slightly too short).

Extra 1: I did a bit of thinking about multimodality on normal quantile plots. The logic is that when the dots leap upwards very quickly, the data are less dense than a normal distribution (a “hole”), and where the dots are unexpectedly flat, the data are more dense than a normal distribution (possibly an extra peak). My take is that holes are usually easier to see than extra peaks on a normal quantile plot.

The other kind of normal quantile plot that goes up faster than the line in the middle and slower than it in the tails is the S-bend characteristic of short tails. As a result, multimodality and short tails can be difficult to tell apart. This one, though, more obviously has a hole in it, once you know how to see that.

Note first that the data scale is the $y$-axis, so that is where you look for holes and extra peaks. On that axis, there is a big jump from about 40 to about 70, so there are no observations between these two values at all (this shows up on a six-bin histogram). On the six-bin histogram, there are peaks either side of the hole, one around 20 and one around 80. On the normal quantile plot, there are three points in an almost horizontal line around 25, so that might be a peak, and the nearest thing to a peak above the hole are the two almost-horizontal points around 85, so that might be the other peak. As I say, though, it’s easier to see the hole rather than the extra peaks; given the lack of observations between 40 and 70, you would expect to see a peak each side of that, even if you can’t pin down exactly where they are.

Extra 2: I suspect the story you get from your histogram might depend on the number of bins you chose. There are only 20 observations, so more bins than six or seven is really too many:

ggplot(lifetimes, aes(x = C1)) + geom_histogram(bins = 9)

The second peak is even clearer (around 85), and a big gap between them, but there is a suspicion of a third peak around 12 that I think is almost certainly just chance.

I suspect four bins is too few to see the shape very well:

ggplot(lifetimes, aes(x = C1)) + geom_histogram(bins = 4)

It’s difficult to tell whether that is a second peak around 100 or just a long right tail. I would say you definitely need more bins than this.

(2 points) Argue in favour of using a sign test here rather than a $t$-test.

The usual two things: how non-normal the distribution is, and the sample size. Here, the distribution is clearly non-normal (skewed, or with outliers, or bimodal, as you said before), and the sample size of 20 (look at the number of rows you read in) is not especially large. These both point towards using a sign test for the median rather than a $t$-test for the mean.

I told you which decision you needed to come to, so make sure your argument actually is an argument in favour of that decision.

If you wish (optional), make a bootstrap sampling distribution of the sample mean:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(lifetimes$C1, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

This actually doesn’t look so bad, but it is clearly skewed to the right: the sample size isn’t quite big enough to remove the influence of the long upper tail (five outliers, if that’s what you saw).

There isn’t any remaining influence of the two peaks and the hole (the sample size is at least large enough to smooth them out), but the non-normality of the data still persists here.

(3 points) Run a suitable sign test. What do you conclude?

We wanted to know whether the median was 20 (from the data description) or greater than 20, so enter 20 as the null median, after the dataframe and column:

sign_test(lifetimes, C1, 20)

$above_below
below above 
    7    13 

$p_values
  alternative   p_value
1       lower 0.9423409
2       upper 0.1315880
3   two-sided 0.2631760

The appropriate P-value is the upper-tailed one, 0.1316, so we do not reject the null hypothesis, and conclude that there is no evidence of the median lifetime being greater than 20.

(2 points) Run the (inappropriate) $t$-test. Why do you think the conclusion came out differently?

One-tailed again:

with(lifetimes, t.test(C1, mu = 20, alternative = "greater"))


    One Sample t-test

data:  C1
t = 2.5579, df = 19, p-value = 0.009617
alternative hypothesis: true mean is greater than 20
95 percent confidence interval:
 25.76245      Inf
sample estimates:
mean of x 
   37.785

This time, the P-value is just less than 0.01, and so we would reject the null, and conclude that the mean is greater than 20.

The two tests came out differently because they are testing different parameters: the sign test is testing the median, but the $t$-test is testing the mean. The mean is likely to be made larger because of the long right tail (or upper outliers, or second mode), so that the evidence is more convincing that the mean is greater than 20, as compared to that the median is greater than 20.

The output from the sign test says that 7 of the observed lifetimes are less than 20 and 13 are greater. A 13–7 split is not that even, but it is also not far from a 10–10 split, so it is not convincing enough to conclude that the median is greater than 20.

Blood oxgyen content

56 children had their blood oxygen content measured at the Children’s Hospital in Melbourne (Australia), both with a chemical method analysing gases in the blood (CO) and with a “pulse oximeter” measuring through the skin (pulse). Each measurement was actually taken three times in quick succession; we will analyze the third measurement in each case. The aim is to see whether there are systematic differences between the two methods (which are both supposed to be measuring the same thing).

The data are in http://ritsokiguess.site/datafiles/ox.csv. The columns are: an ID for each child (item), the measurement number (repl, always 3), and the measurements of blood oxygen content by the two methods as described above.

(1 point) Read in and display (some of) the data.

As usual:

my_url <- "http://ritsokiguess.site/datafiles/ox.csv"
oxygen <- read_csv(my_url)

Rows: 56 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): item, repl, CO, pulse

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

oxygen

The columns as promised, and 56 rows (56 children).

Extra: The data came to me like this:

library(MethComp)
data("ox")
ox

so I had to do some reorganizing. The column y contains the oxygen saturation as measured by the method in meth, and the original data contained the results for all three replicates (we are only using the third one). So, grab only the third replicate and reshape:¹

ox %>% 
  filter(repl == 3) %>% 
  pivot_wider(names_from = meth, values_from = y) -> oxygen0
oxygen0

The idea of the pivot_wider is to create new columns CO and pulse (from meth), filled with the values that were in y, matched up with the right method. The column item is used to ensure that the two measurements for each child do get paired up. (If we didn’t have that, we wouldn’t know which value of y to pair up with which.) We see pivot_longer and pivot_wider “officially” later in the course, so don’t worry about the details now; just try to get the idea of what pivot_wider is doing here.

(2 points) How do you know that we should use some kind of matched-pairs test here? Explain briefly.

The situation in which you would use matched pairs is when you have matched-up measurements on each experimental unit. Here, the experimental units are the children, and the two things that are matched up are the two methods of measuring oxygen saturation.

Hence, these measurements are matched pairs because each child had their oxygen saturation measured by both methods.

Extra: note that the key word in the data description is “both”. If, instead of saying “both … and” it had said “either … or”, this would have meant that each child had their oxygen saturation measured by only one of the two methods (presumably chosen at random), and then we would have been in a two-sample situation. From a practical point of view, though, blood oxygen is something you would expect to be able to measure by different methods on the same child at the same time (or close to the same time), unlike the kids learning to read in lecture: learning to read is something you can only do once.

It is better to do matched pairs if you can, because then you have a comparison between the two measurements on the same individual, and the only reason those should be different is because they were measured using different methods. In the two-sample case, you randomize to get two samples that are as similar as possible, but the people in the two samples might end up being different for all sorts of reasons other than the treatment whose effect you are trying to assess. Thus, a two-sample experiment has extra sources of variability that might make it more difficult to detect the differences you really want to find.

(3 points) Run a suitable $t$-test on these data. What do you conclude from it? Briefly justify any choices you make.

We have just said that this is matched pairs, so that is the kind of $t$-test to run. The data description says that interest is in “any systematic differences” between the two methods, so we should run a two-sided test. (That was the choice that you needed to briefly justify.)

A two-sided test is the default, so no alternative is needed:

with(oxygen, t.test(CO, pulse, paired = TRUE))


    Paired t-test

data:  CO and pulse
t = 2.3276, df = 55, p-value = 0.02364
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.3117392 4.1739750
sample estimates:
mean difference 
       2.242857

The P-value is 0.024, less than 0.05, so we can conclude that the mean blood oxygen content values are different between the two methods. (The null hypothesis is that the two means are equal, and the alternative is that they are not equal.)

There is another, equally good, way to do this: you can do it as a one-sample $t$-test on the differences. In fact, if you read ahead, you’ll see that you will need the differences again later, so it is a good idea to calculate and save them now, and then you will have them for later when you need them:²

oxygen %>% 
  mutate(diff = CO - pulse) -> oxygen

Now that you have the differences, the $t$-test is this:³

with(oxygen, t.test(diff, mu = 0))


    One Sample t-test

data:  diff
t = 2.3276, df = 55, p-value = 0.02364
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.3117392 4.1739750
sample estimates:
mean of x 
 2.242857

and the results and conclusion are exactly the same. The conclusion here is literally that the mean difference is not zero, which is the same thing as saying that the two methods have different means.

Extra: usually the reason for doing a study like this is that one of the two methods is cheaper or easier than the other. My understanding is that the CO method will involve taking a blood sample and sending it to the lab, while the pulse method is just clipping a device onto the end of the child’s finger and reading off the result, much easier. If the two methods give the same result, then there is no reason to use the more expensive or difficult one. However, we have just shown that the two methods do not give the same result on average (the output shows that the CO mean is significantly larger), and so for now the people behind this study will have to make a decision about which method is more accurate, and use that one.

This is one of those rare occasions where you do not want to reject the null hypothesis, and so you would need to have a large enough sample to get decent power to reject it if it is actually false.

(3 points) Make a suitable graph to assess the assumptions for your test. What do you conclude?

The key assumption is that the differences are normally distributed, or to be more precise are close enough to normal given the sample size. I calculated the differences above, but if you didn’t, you’ll need to do so first (the same mutate).

Since we are interested in normality specifically, this is a good time to draw a normal quantile plot:

ggplot(oxygen, aes(sample = diff)) + stat_qq() + stat_qq_line()

The normality is not great: there are some outliers at the bottom (or possibly a long lower tail), and possibly an outlier at the top as well. That said, the sample size is 56, large, so we may not have to worry about normality too much.

This is a good place to remind you that there is nothing magic about a sample size of 30. The larger the sample size, the better the Central Limit Theorem works, but also it depends on how non-normal your data are. Depending on that, a sample size of 30 could be easily big enough, or it could be way too small.

You could come to some different conclusions here, as long as your reasoning is sound:

the sample size is large and the distribution of differences is only moderately non-normal, therefore the $t$-test is appropriate.
the distribution of differences is clearly not normal, and the sample size, though large, is not large enough to overcome this non-normality, so the $t$-test is not appropriate.

Come to a conclusion like one of these for a good reason, and I am happy.

The second-best graph is a histogram, which you can then eyeball for bell-curvedness:

ggplot(oxygen, aes(x = diff)) + geom_histogram(bins = 8)

This shows a low outlier, and possibly a long left tail as well (the exact impression you get will probably depend on how many bins you use). The conclusion is about the same as from the normal quantile plot, but this is a second-best choice of plot because it is easier to assess normality from a normal quantile plot, and what we care about here is normality. (If I had asked you to draw a plot first, it would have been a different story, because then the general shape of the distribution is of interest.)

Another reasonable way to answer the question is a bootstrap sampling distribution of the sample mean. You could reasonably say that it is in fact the best way to answer the question, but I didn’t want to insist on you doing one: if you do, it can be full credit for this question, but you can also get full credit by drawing a normal quantile plot.

Here’s the bootstrap sampling distribution, with its own normal quantile plot (and thus 10,000 simulations):

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(oxygen$diff, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

I’d have to say that this is only very slightly skewed (to the left), and therefore my take is that the $t$-test is fine. Note that you do not discuss sample size when talking about this plot, because the sample size is built into the simulation; your call here is only “this is close to normal” or “this is not close to normal”.

Extra: another plot that you might see in this context (but that doesn’t address normality) is called a “spaghetti plot”. This plots the original data (not the differences) with the two treatments on the $x$-axis, the blood oxygen content on the $y$, and the two observations for each individual joined by lines. This works most easily with the data in the original layout (as described in an earlier Extra):

ox %>% 
  filter(repl == 3) %>% 
  ggplot(aes(x = meth, y = y, group = item)) + geom_point() + geom_line()

The group says to only join the points by a line if they belong to the same item (child). The name of the plot comes from how these lines look like strands of spaghetti.

Most of the observations (the ones at the top of the graph) go slightly downhill, meaning that for most of the children, the CO measurement is a bit higher than the pulse one. There are some observations (mostly but not exclusively at the bottom of the graph) where the spaghetti strand goes uphill, and one of them goes substantially uphill (this is the very negative difference on your normal quantile plot). The story here seems to be that children low in blood oxygen content (by one or both methods) look very different from the others, including in comparison between CO and pulse.

This is an informative graph, but it is not informative about normality of differences, which is our main focus here.

(3 points) Run a suitable test for comparing the two methods that does not assume normality. Is your conclusion the same as before?

The right test, not assuming normality, is a sign test that the differences have median 0 (vs. that they don’t). Load smmr, and:

sign_test(oxygen, diff, 0)

$above_below
below above 
   17    39 

$p_values
  alternative     p_value
1       lower 0.999079223
2       upper 0.002280766
3   two-sided 0.004561533

This test is two-sided, so the right P-value is 0.0046. This is much less than 0.05, so we conclude that the median blood oxygen content is not the same for the two methods. This is the same conclusion that we came to before about the means.

Extra: my take was that the $t$-test was all right (because of the large sample size), and therefore both tests seem to be valid. Usually when that happens, the P-value for the $t$-test will be smaller, because that test uses the actual data values rather than counting how many of them are above or below zero. But that’s not what happened here: the P-value for the sign test was quite a bit smaller. My guess as to why that is: maybe the long lower tail on the distribution of differences inflated the standard deviation of differences (even with the large sample), and hence the $t$-test came out less significant than it would have done otherwise.

If you think that the normality was definitely not good enough, then you should prefer the sign test, and then you get the unusual bonus of getting a smaller P-value as well.

Auto gears

A distributor of auto gears wishes to determine which of two manufacturers of gears they should do business with. The distributor receives the gears in batches of 100, and counts the number of defective (non-functional) gears in each batch. They then note the manufacturer (A or B) the batch of gears came from, and the number of defective gears in that batch. They continue until they have received 20 batches of gears from each manufacturer. The data are in http://ritsokiguess.site/datafiles/autogear.csv.

(1 point) Read in and display (some of) the data.

No surprises:

my_url <- "http://ritsokiguess.site/datafiles/autogear.csv"
gears <- read_csv(my_url)

Rows: 40 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): manufacturer
dbl (1): defectives

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

gears

The ones you see are all from manufacturer A, but if you scroll down two pages, you see manufacturer B. It should be pretty clear what the columns represent.

Extra: these are the Autogear data from the BSDA package. For a change, we are using these as is.

(2 points) Make a suitable graph of the two variables.

The obvious thing is a boxplot:

ggplot(gears, aes(x = manufacturer, y = defectives)) + geom_boxplot()

If you want to read ahead and see that normality is going to be relevant (or, take a guess that this is a Mood Median test question and I am likely to be asking at some point about the relevance of doing that), you might prefer to draw a normal quantile plot, which is fine if you make the case for doing that instead of a boxplot:

ggplot(gears, aes(sample = defectives)) + stat_qq() +
  stat_qq_line() + facet_wrap(~ manufacturer)

Remember that you need a separate normal quantile plot for each group, going this way. The coding strategy is to pretend that you are making a normal quantile plot of all the numbers of defectives together, and then at the end say “oh, by the way, I want to do this separately for each manufacturer”.

(2 points) Counts often have a non-normal distribution. Is there any evidence of that here? Explain briefly.

If you drew a boxplot:

The numbers of defectives from manufacturer B (right) have a symmetric distribution with no outliers, so there is no evidence of non-normality here.

From manufacturer A (left), the distribution is clearly right-skewed with an upper outlier (that is probably part of the long tail rather than being an isolated unusual observation). Make sure you mention the right-skewness, since that is the most important feature.

If you drew a normal quantile plot:

The numbers of defectives from manufacturer B (on the right) are very close to normally distributed. I don’t think you can quibble with this at all (it looks almost exactly as you would expect a sample of 20 observations from a normal distribution to look, except there are some repeated values).

The distribution of defectives from manufacturer A (on the left), on the other hand, is clearly skewed to the right. Note the curved shape (or the bunching up at the bottom and the extra spread at the top). Skewness is a better answer than “outliers at the upper end”, because there is something wrong with the whole distribution, not just a few isolated observations.

Discuss both groups, because you are looking for evidence in both places.

Extra: counts often have a thing called a Poisson distribution, which is skewed to the right. It has one parameter (the mean); if the mean is small, the distribution is very skewed, but as the mean gets larger, the distribution becomes more normal in shape.⁴ The logic behind this is that 0 defectives is the lower limit, and if the mean is small, you will observe a lot of zeros and a few values bigger than zero, hence skewed to the right. On the other hand, if the mean is large, most of the observations will be far above zero, like the ones from manufacturer B, and it basically does not matter that there is actually a lower limit, because none of the observations are close to it.

(2 points) The distributor says to you “one of these groups is close to normal, so we will be all right running a $t$-test to compare the manufacturers”. How would you respond to that?

A two-sample $t$-test is assuming that the observations in both samples are close enough to normally distributed. This appears not to be true for the observations from Manufacturer A, and the sample size of $n = 20$ may not be big enough to overcome this.

Make sure you address the distributor’s point: it is not enough for one sample to be close to normal; they both have to be.⁵

Extra: the ol’ bootstrap sampling distribution might shed some light on this. My strategy for a two-sample test is to pick the group that looks worse (manufacturer A here), and get a bootstrap sampling distribution for its sample mean. The idea is that if the worse one is actually OK, the $t$-test is also OK, but if it is not, neither is the $t$-test. Here we go:

gears %>% filter(manufacturer == "A") -> a
tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(a$defectives, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

This is pretty close to normal, but it still definitely has a right-skewed shape. It looks as if a sample size of 20 is not quite big enough, but you wouldn’t need the sample size to be much bigger before you would be happy with a $t$-test.

(4 points) Run a suitable Mood’s median test. What do you conclude, in the context of the data?

Two points for this code and output, having loaded the smmr package at some point, preferably at the top of your writeup:

library(smmr)
median_test(gears, defectives, manufacturer)

$grand_median
[1] 26

$table
     above
group above below
    A     5    12
    B    11     6

$test
       what      value
1 statistic 4.25000000
2        df 1.00000000
3   P-value 0.03925033

The other two points for the interpretation:

We are testing the null hypothesis that both manufacturers have the same median number of defectives.
The alternative hypothesis is that the medians are different (Mood’s median test is two-sided, which is good here because we didn’t have any suspicions ahead of time about which manufacturer might be better)
The P-value is 0.039
The P-value is less than 0.05, so we reject the null in favour of the alternative, and conclude that there is a difference in the median number of defective gears between the two manufacturers.

Make sure you state the hypotheses (or clearly imply that you know what they are), and say what the P-value is as well as whether it is greater or less than 0.05.

Extra: you might be curious about how the $t$-tests compare. First, Welch:

t.test(defectives ~ manufacturer, data = gears)


    Welch Two Sample t-test

data:  defectives by manufacturer
t = -2.2465, df = 37.587, p-value = 0.03063
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
 -9.4122347 -0.4877653
sample estimates:
mean in group A mean in group B 
          23.80           28.75

and second, pooled, which in fact seems as justifiable here (look back at your boxplot):

t.test(defectives ~ manufacturer, data = gears, var.equal = TRUE)


    Two Sample t-test

data:  defectives by manufacturer
t = -2.2465, df = 38, p-value = 0.03056
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
 -9.4106236 -0.4893764
sample estimates:
mean in group A mean in group B 
          23.80           28.75

The three P-values are in fact all almost identical and the conclusion is the same each time, so that in fact it didn’t matter which test we did.⁶ My take is that maybe the sample size of 20 in each group was large enough after all; the two-sample $t$-test tends to be a bit more forgiving that the worse sample indicates, because the subtraction of the two sample means in the $t$-statistic tends to make things a bit more normal than they were before.

(2 points) Is it possible to make a recommendation about which manufacturer the distributor should do business with in the future? Explain briefly.

The Mood’s median test said that there was a difference between the two manufacturers, so one of them must be better than the other in terms of the median number of defectives. The output says that the grand median number of defectives was 26; the majority of observations from manufacturer A were below that, and the majority from manufacturer B were above.

A smaller number of defectives is better, so manufacturer A is the one to recommend.

If you wanted to, you could literally work out the medians for each manufacturer:

gears %>% 
  group_by(manufacturer) %>% 
  summarize(med_def = median(defectives))

and come to the same conclusion, but if you go this way, you also have to mention the significant difference. Otherwise, these two medians could be not significantly different, and then you would have no basis for choosing one manufacturer over the other (any difference you happened to observe in that case is just chance).

Points: $(1+3+2+3+2)+(1+2+3+3+3)+(1+2+2+2+4+2) = 11+12+13 = 36$

Footnotes

If you haven’t seen pivot_wider in class yet: don’t worry, you will learn about it later.↩︎
I saved them back into the same dataframe.↩︎
Or use oxygen$diff if you prefer.↩︎
There is a “normal approximation to the Poisson” in the same way that there is a normal approximation to the binomial, which is used in inference for proportions.↩︎
One of the things you’ll face as a statistician is dealing with people who half-remember their statistics, and gently setting them straight.↩︎
Of course, I wanted you to practice Mood’s median test, so that’s the one I had you do.↩︎