Worksheet 7

Published

October 19, 2025

Questions are below. My solutions will be available after the tutorials are all finished. The whole point of these worksheets is for you to use your lecture notes to figure out what to do. In tutorial, the TAs are available to guide you if you get stuck. Once you have figured out how to do this worksheet, you will be prepared to tackle the assignment that depends on it.

If you are not able to finish in an hour, I encourage you to continue later with what you were unable to finish in tutorial.

Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(smmr)

Fuel efficiency comparison

Some cars have on-board computers that calculate quantities related to the car’s performance. One of the things measured is the fuel efficiency, that is, how much gasoline the car uses. On an American car, this is measured in miles per (US) gallon. On one type of vehicle equipped with such a computer, the fuel efficiency was measured each time the gas tank was filled up, and the computer was then reset. Twenty observations were made, and are in http://ritsokiguess.site/datafiles/mpgcomparison.txt. The computer’s values are in the column Computer. The driver also calculated the fuel efficiency by hand, by noting the number of miles driven between fill-ups, and the number of gallons of gas required to fill the tank each time. The driver’s values are in Driver. The final column Diff is the difference between the computer’s value and the driver’s value for each fill-up. The data values are separated by tabs.

  1. Read in and display (some of) the data.

Solution

library(tidyverse)

This is like the athletes data, so read_tsv:

my_url <- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
fuel <- read_tsv(my_url)
Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fuel

20 observations, with columns as promised. The first column Fill-up labels the fill-ups from 1 to 20, in case we needed to identify them. (This is an additional clue that we have paired data, or repeated measures in general.1)

Extra: another way to read the data, and why it works: since the “tab” is a delimiter, distinguishing one data value from the next, you might be wondering whether you could use read_delim somehow. It turns out that you can, but you have to figure out how to represent “tab” as something that read_delim can understand. One way is to search for it, of course, but another way is to see what happens when you use read_delim the way you already know (assuming the data values are separated by spaces):

my_url <- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
fuel0 <- read_delim(my_url, " ")
Rows: 20 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): Fill-up    Computer    Driver  Diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fuel0

This fails, but how it fails is instructive.2 There is now one column, amusingly called Fill-up Computer Driver Diff, whose values are text, the four numbers glued together with \t in between. This is how R recognizes a tab character.3 So you can try read_delim by using this as the second input rather than a space.

You could get to this point with less insight by saying to yourself “this thing \t, whatever it is, is what separates the data values, so how about I try this as the second input to read_delim?”

Here’s how it goes:

my_url <- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
fuel0 <- read_delim(my_url, "\t")
Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fuel0

and it works, with four numeric columns as promised.

Another option is to use read_delim without a second input:

my_url <- "http://ritsokiguess.site/datafiles/mpgcomparison.txt"
fuel0 <- read_delim(my_url)
Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fuel0

This works, but you need to understand what is going on, something beyond “well, it seems to work”. One way is to look at the message from read_delim that says what got read in:

── Column specification ────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (4): Fill-up, Computer, Driver, Diff

The four quantitative columns, but look above that, where it says Delimiter: \t. The delimiter is what separates each data value from the next one on the same line. The symbol \t means “tab”, which you could assert, but it is better to offer some sort of evidence that this is correct. To do that, search for something like R \t.

The first hit for me was actually not R-related, but was a Stack Overflow question (scroll down to the two answers) that has the answer we need. The symbol \t comes from Unix (and is therefore also in Linux and the C programming language, and because R’s heritage is C, R also).

\(\blacksquare\)

  1. What is it that makes this paired data? Explain briefly.

Solution

There are two measurements for each fill-up: the computer’s calculation of gas mileage, and the driver’s. These go together because they are two calculations of what is supposed to be the same thing. (Another hint is that we have differences, with the veiled suggestion that they might be useful for something.)

If you want to have a general way of tackling this kind of problem, ask yourself “is there more than one observation of the same thing per individual?” Then go on and talk about what the individuals and observations are for the data you’re looking at. In this case, the individuals are fill-ups, and there are two observations per fill-up: one by the driver and one by the computer. Compare this with the children learning to read (from lecture): there were \(23+21 = 44\) children altogether (individuals), and each child had one reading score, based on the reading method they happened to be assigned to. So those are 44 independent observations, and a two-sample test is the way to go for them.

Extra 1: To go back to our gas mileages, here is another way you might have received the data:

This looks more like a two-sample layout, with one column of gas_mileage values and a second column indicating whether the Driver or Computer observed4 them. But the first two rows were observed at the same fill-up, so we know that these are actually paired (by fill-up) and the layout of the data for our analysis is wrong. (Later in the course, we learn how to convert between the previous “wider” layout and this “longer” one. I didn’t show you the code here, because I didn’t want to confuse you at this point.)

Extra 2: if your individuals have more than two measurements each, you can’t shoehorn it into a one-sample model any more, because there are too many differences among5 measurements now. These are repeated measures data, and the modern way to analyze these is using mixed models. The student GPA example in the link shows how it goes; the one in the link is repeated measures because we have six observations (of GPA) for each individual (student).

\(\blacksquare\)

  1. Draw a suitable graph of these data, bearing in mind what you might want to learn from your graph.

Solution

In a matched pairs situation, what matters is whether the differences have enough of a normal distribution. The separate distributions of the computer’s and driver’s results are of no importance. So make a graph of the differences. We are specifically interested in normality, so a normal quantile plot is best:

ggplot(fuel, aes(sample = Diff)) + stat_qq() + stat_qq_line()

The next best plot is a histogram of the differences:

ggplot(fuel, aes(x = Diff)) + geom_histogram(bins = 6)

This one actually looks left-skewed, or has an outlier at the bottom. The appearance may very well depend on the number of bins you choose.

Extra: the distributions of the computer’s and driver’s gas mileage measurements themselves are totally irrelevant to whether you would run a matched pairs test of any kind; the only thing that matters is the distribution of the differences.

The only other graph you could justify (and even then it is a bit of a stretch) would be a scatterplot of the two measurements at each fill-up:

ggplot(fuel, aes(x = Driver, y = Computer)) + geom_point() + 
  geom_abline(intercept = 0, slope = 1)

By adding the line \(y = x\) (which has intercept 0 and slope 1) to the graph, you can see how the pairs of measurements compare to each other; for most of them, the computer’s measurement is larger than the corresponding driver’s measurement (the point is above the line), and for only a few is it smaller. If you had decided to do a matched-pairs sign test, this graph would be interesting (with the line), because the sign test is based on how many points are above the line vs. how many are below. But what this graph does not do is tell you whether you need to do a matched-pairs sign test in the first place, because it does not tell you about the distribution of the differences (in particular, whether that is normal enough). Thus, the scatterplot, even with the line, is inferior to even the histogram of differences.

\(\blacksquare\)

  1. Is there any difference between the average results of the driver and the computer? (Average could be mean or median, whichever you think is best). Do an appropriate test.

Solution

The choices are a matched-pairs \(t\)-test, or a matched-pairs sign test (on the differences). To choose between those, look back at your graph. My take is that the only possible problem is the smallest (most negative) difference, but that is not very much smaller than expected. This is especially so given the sample size (20), which means that the Central Limit Theorem will help us enough to take care of the small outlier.

I think, therefore, that a paired \(t\)-test is the way to go, to test the null that the mean difference is zero (against the two-sided alternative that it is not zero, since we were looking for any difference). There are two ways you could do this: as literal matched pairs:

with(fuel, t.test(Computer, Driver, paired = TRUE))

    Paired t-test

data:  Computer and Driver
t = 4.358, df = 19, p-value = 0.0003386
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 1.418847 4.041153
sample estimates:
mean difference 
           2.73 

or, as a one-sample test on the differences:

with(fuel, t.test(Diff, mu = 0))

    One Sample t-test

data:  Diff
t = 4.358, df = 19, p-value = 0.0003386
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 1.418847 4.041153
sample estimates:
mean of x 
     2.73 

with the same result.

If you thought that low value was too much of an outlier, the right thing would have been a sign test that the median difference is zero (vs. not zero), thus (using the smmr package):

sign_test(fuel, Diff, 0)
$above_below
below above 
    3    17 

$p_values
  alternative     p_value
1       lower 0.999798775
2       upper 0.001288414
3   two-sided 0.002576828

In any of those cases, we conclude that the average difference is not zero, since the P-values are less than 0.05. (The right one for the sign test is 0.0026.)

I don’t mind which test you do, as long as it is one of the ways of doing matched pairs, and you justify your choice by going back to your graph. Doing one of the tests without saying why is being a STAB22 or STAB57-level statistician (in fact, it may not even be that), not a professional-level applied statistician.

Extra 1: you could find out how the average difference is not zero by looking at a confidence interval; the one for the mean difference is from 1.4 to 4.0. If you did a sign test, you would do this to get a confidence interval:

ci_median(fuel, Diff)
[1] 1.102148 4.500000

1.1 to 4.5, a little wider,6 but otherwise very similar.

So we think the computer was estimating the gas mileage to be that much higher than the driver on average.

Extra 2: The P-value for the \(t\)-test was almost 10 times smaller than for the sign test. This is because the \(t\)-test was using the actual differences, not just whether they were greater or less than zero as the sign test did (17 out of the 20 differences were positive). If you thought, as I did, that it was all right to use the \(t\)-test, then the smaller P-value is trustworthy. (The usual thing, if two tests give similar results, as they do here, is that the one with the smaller P-value is better, but that is not always something to rely on. You still need to make an assessment.)

Extra 3: A bootstrapped sampling distribution of the sample mean would be a better way to decide how happy you are with the \(t\)-test. This will incorporate the sample size into your consideration. It is really the same thing as the one for the one-sample \(t\), bearing in mind that it’s the differences you need to work with:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(fuel$Diff, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

This is a tiny bit left-skewed still (the lowest values are very slightly too low, and the highest ones are slightly bunched up). The apparent low-end outlier in the distribution of differences barely even shows up in the sampling distribution. (If we had a smaller sample size, like 10 or especially 5, we would see it here more clearly, and then we would know that a sample size of 5 or 10 is not big enough.)

In other words, despite what you might think from your earlier plot, the matched pairs \(t\)-test would be very much defensible here.

\(\blacksquare\)

  1. The people who designed the car’s computer are interested in whether the values calculated by the computer and by the driver on the same fill-up are usually close together. Explain briefly why it is that looking at the average (mean or median) difference is not enough. Describe what you would look at in addition, and how that would help.

Solution

The mean (or median) difference can be close to zero without the individual differences being close to zero. The driver could be sometimes way higher and sometimes way lower, and the average difference could come out close to zero, even though each individual pair of measurements is not that close together. Thus looking only at the average difference is not enough.

There are (at least) two (related) things you might choose from to look at, in addition to the mean or median:

  • the spread of the differences (the SD or inter-quartile range). If the average difference and the spread of differences are both close to zero, that would mean that the driver’s and computer’s values are consistently similar.
  • a suitable confidence interval here (for the mean or median difference, as appropriate for what you did) would also get at this point.

To think about the spread: use the SD if you did the \(t\)-test for the mean, and use the IQR if you did the sign test for the median, that is, one of these:

fuel %>% summarize(diff_sd = sd(Diff))

or

fuel %>% summarize(diff_iqr = IQR(Diff))

as appropriate. Make a call about whether you think your measure of spread is small or large. If you think it’s small, you can say that the differences are consistent with each other, and therefore that the computer’s and driver’s values are typically different by about the same amount (we concluded that they are different). If you think your measure of spread is large, then the driver’s and computer’s values are inconsistently different from each other (which would be bad news for the people who designed the car’s computer, as long as you think the driver was being reasonably careful in their record-keeping).

If you said that the confidence interval for the mean/median was the thing to look at, then pull the interval out of the \(t\)-test output or run ci_median, as appropriate. The computer’s miles per gallon is consistently between 1.4 and 4.0 miles per gallon higher (mean) or 1.1 and 4.5 miles per gallon higher (median). Make a call about whether you consider this interval short or long, bearing in mind it’s based on the difference between two numbers that are about 40, and then say that the computer’s measurement is larger than the driver’s by a consistent amount (if you think the interval is short) or by an amount that varies quite a bit (if you think the interval is long).

Extra: in a situation like the one here, where there is a significant difference between the means, knowing also that the standard deviation is small would tell us that the computer is consistently giving a value that is bigger than the driver’s by about the same amount every time. Another way to think about this is to plot the driver’s and computer’s values against each other with a (solid) reference line7 like this:

ggplot(fuel, aes(x = Driver, y = Computer)) + geom_point() +
  geom_abline(intercept = 0, slope = 1) +
  geom_abline(intercept = 3, slope = 1, linetype = "dashed")

The solid reference line is where a point would be if the driver and computer had the same value. Here, there seems to be some consistency in the points being up and to the left of, but parallel to, the reference line. This would be saying that the computer’s value would be fairly consistently more than the driver’s by about 3 miles per gallon (eyeballing the plot; the dashed line), with some exceptions.

\(\blacksquare\)

Canned tuna

Light tuna is sold in 170-gram cans. The tuna can be canned in either water or oil. Is there a difference in the typical selling price of tuna canned in water or in oil? To find out, 25 supermarkets were sampled. In 14 of them (randomly chosen), the price of a (randomly chosen) brand of tuna in water was recorded, and in the other 11 supermarkets, the price of a (randomly chosen) brand of tuna in oil was recorded. The data are in http://ritsokiguess.site/datafiles/tuna.csv, with the column canned_in saying whether the tuna was canned in oil or water, and the price being in cents.

  1. Read in and display (some of) the data.

Solution

Making it easy for you:

my_url <- "http://ritsokiguess.site/datafiles/tuna.csv"
tuna <- read_csv(my_url)
Rows: 25 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): canned_in
dbl (1): price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tuna

Scrolling down reveals that the cans of tuna canned in oil are at the bottom.

\(\blacksquare\)

  1. Make a suitable graph of these data. (There are actually two graphs that would be suitable; see if you can make both.) Comment on the shape of the distributions of prices.

Solution

You are probably expecting a boxplot, and with one quantitative variable (price) and one categorical one (canned_in) that is definitely reasonable:

ggplot(tuna, aes(x = canned_in, y = price)) + geom_boxplot()

The other graph you might consider is a normal quantile plot. You might have looked ahead, and realized that normality of each of the two samples is something we should be concerned about. There are two samples, so a facetted normal quantile plot is the thing:

ggplot(tuna, aes(sample = price)) + stat_qq() + stat_qq_line() +
  facet_wrap(~ canned_in)

Both plots are clearly showing that both distributions of prices, both for the tuna canned in oil and canned in water, are skewed to the right.

\(\blacksquare\)

  1. Why might it be a bad idea to run a two-sample \(t\)-test on these data? Explain briefly.

Solution

For a two-sample \(t\)-test, we assume that both sets of price values are drawn from a normal distribution. These ones are very definitely not, since both distributions are skewed to the right, as you said just now.

The second thing to consider is the sample sizes. You could count the dots on your normal quantile plot, or get R to count them for you:

tuna %>% count(canned_in)

These are 14 and 11, not large, and with this amount of skewness the Central Limit Theorem will not help enough (at least, that’s my guess). You could get hold of the bootstrap sampling distributions of the sample means for each group if you wanted to investigate this further.

So, considering these two things, a two-sample \(t\)-test is unwise.

Looking ahead a bit:

The other thing that might have been relevant (but turned out not to be) was that the prices of tuna canned in oil were much more spread out than those canned in water. This shows up on the normal quantile plot as the line for oil on the normal quantile plot being steeper than the one for water. (The two facetted boxplots are on the same scales, so the one that fills more of its facet and has a steeper line is more variable.) This would also show up in the usual way on a boxplot, if you drew one.

This turns out not to be relevant because once we have rejected normality, as we have, it doesn’t matter whether one of the groups is more spread out than the other; the test we’ll see, the Mood median test, will cope perfectly well. You might have run into another nonparametric test for this situation called the rank sum test, or the Mann-Whitney test, or the Wilcoxon test (these are all the same test). For the rank sum test, it matters that both groups have the same spread; the usual way it is phrased is that the two groups have to have the same distribution apart possibly from a difference in medians. I don’t like the rank sum test for this reason; if we can handle a difference in spreads in the \(t\)-test (by using the Welch test), we ought to be able to handle a difference in spreads in the test we use that does not assume normality. I don’t know of a Welch-like version of the rank sum test, and suspect there isn’t one (because of the way ranks work).

\(\blacksquare\)

  1. We just decided that running a \(t\)-test was a bad idea here. Run Mood’s median test for these data to determine whether the median selling price differs between tuna canned in water and in oil. What do you conclude, in the context of the data? Hint: use something from the smmr package.

Solution

For Mood’s median test, use smmr (easier). I didn’t say anything about building it yourself:

median_test(tuna, price, canned_in)
$grand_median
[1] 67

$table
       above
group   above below
  oil       5     5
  water     7     6

$test
       what      value
1 statistic 0.03350816
2        df 1.00000000
3   P-value 0.85475695

Our null hypothesis is that the median price of tuna canned in water and tuna canned in oil is the same. Our alternative is two-sided, since we are looking for any difference: that is, that the two median prices are different. With a P-value of 0.855, much larger than 0.05, we cannot reject the null hypothesis (we “retain” the null, if you like that word better). There is no evidence of a difference in the median price of tuna canned in water and tuna canned in oil.

Extra: you could perhaps make a case for a one-sided alternative (though I specifically said “differs”, implying a two-sided test). One way to support that would be to hypothesize that, if anything, the tuna canned in oil would be more expensive on average that tuna canned in water (because oil is more expensive than water). Then you would have to see whether you are on the correct side: in actual fact, a greater proportion of the cans of tuna canned in water have an above-average price (just over 50% vs 50% exactly), so we are in fact on the wrong side for that alternative and therefore we cannot hope to reject the null.

\(\blacksquare\)

  1. Explain briefly why the counts of values above and below the overall median (in the previous part) are entirely consistent with the P-value that you found.

Solution

We failed to reject that the two group medians are equal, so we would expect to see a more or less even split of values above and below the overall median in both the groups. Specifically, that means we would expect close to half of the values in each group to be below the overall median and close to half of them to be above. (It has to be “half” because the overall median has to have half of all the data values below and half above).

To see how close that was to happening, look at the other part of the output in the previous part. You might have to click on R Console to see it. It turns out that for the tuna canned in oil, 5 prices were above the overall median and 5 were below, an exactly even split, and for the tuna canned in water, 7 were above and 6 below, very close to an even split. (This is, in fact, exactly the sort of thing we would have expected to see if the null hypothesis were actually true.) These are extremely close to being 50-50 splits, and so not only would we not expect to reject the null hypothesis, but we would also expect to get the very large P-value that we did.

(In case you were wondering where the other two values went, since there were 11 oil prices and 14 water ones, they were exactly equal to the overall median and so got counted as neither above nor below.)

\(\blacksquare\)

Neuropathy and pain relief

The data in http://ritsokiguess.site/datafiles/exactRankTests_neuropathy.csv are the results of a study of pain relief in diabetic patients. The patients were randomly assigned to a standard treatment (“control”) or to a new treatment (“treat”), and for each patient a pain score was recorded, with a higher score indicating better pain relief. There were 58 patients altogether, 30 in the control group and 28 in the treatment group.

  1. Read the data and display at least some of it.

As ever:

my_url <- 'http://ritsokiguess.site/datafiles/exactRankTests_neuropathy.csv'
neuropathy <- read_csv(my_url)  
Rows: 58 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): group
dbl (1): pain

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
neuropathy

There are indeed 58 patients, in two groups control and treat.8

Extra: Precisely, a pain score was measured twice for each patient, once before the study started and once after four weeks, and the variable pain you see here is the (natural) log of the baseline score divided by the final score.9 This is the researchers’ way of summarizing pain relief by one number, instead of requiring us to deal with matched pairs (in a two-sample context, which would be rather confusing).

  1. Draw a suitable graph of the two variables.

The obvious thing, given that we are not explicitly looking for normality (yet), but just getting a general sense of centre and spread and shape, is a boxplot:

ggplot(neuropathy, aes(x = group, y = pain)) + geom_boxplot()

Interpretation coming up.

You could also make a facetted normal quantile plot, thus:

ggplot(neuropathy, aes(sample = pain)) + stat_qq() +
  stat_qq_line() + facet_wrap(~ group)

  1. Does your plot suggest that there will be a significant treatment effect, or not? Explain briefly.

Looking at the boxplot, the treatment median is a bit higher than the control median, but the treatment scores (especially) are very variable, so I would guess that the smallish difference in medians is not going to be significant: the difference in medians is not very big compared to how much variability there is.

Note that this is not really a question you can answer with a normal quantile plot, since this is specifically designed to answer the question “are my data normal?”, and for a question like this, you really need a more general-purpose graph such as a boxplot.

  1. The researchers decided to run a Mood’s median test to compare the pain scores for the treatment and control groups. Why do you think they decided to do that, on the evidence you have so far?

For this question, normality is very much the issue, so you can use either a boxplot or (perhaps better) a normal quantile plot to answer it.

From the boxplot, it looks as if neither distribution is very close to normal, because both groups have an extreme low outlier, and it also looks as if both groups have an upper outlier as well. (I would call these outliers rather than long tails because their values seem to be a long way away from the other observations, quite a bit lower or higher).

From my normal quantile plot, the control group on the left appears to have a right-skewed distribution, except for that very unusual low observation (an outlier). The treatment group on the right also has a long upper tail coupled with a low outlier, though not as extreme a one as in the control group. If you want something that summarizes both groups, you could say that they both have a long upper tail and a low outlier, though there is a little bit of a curved pattern that would point towards skewness.

Note that my interpretations of the plots disagree about whether the high values are a long upper tail or outliers. Different plots tend to emphasize different things.

The other thing to think about is sample size. We have 30 observations in the control group and 28 in the treatment group; these are not small samples, but the outliers, especially the low one in each group, seem more extreme than you would expect samples of this size to be able to handle.

Extra: that was as far as I was expecting you to go, but I seem to have the bootstrap sampling distribution of the sample mean here, at least for the control group:

neuropathy %>% filter(group == "control") -> controls
controls
tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(controls$pain, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

This, for me, is surprisingly good. I guess a little copying and pasting will reveal whether the treatment group is as good:

neuropathy %>% filter(group == "treat") -> treats
treats
tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(treats$pain, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

Pretty much, just a bit of funny stuff at the upper end. So maybe a \(t\)-test would have been all right in actual fact.

  1. Run Mood’s median test on these data. What do you conclude?

Having run library(smmr) somewhere:

median_test(neuropathy, pain, group)
$grand_median
[1] 0.1635

$table
         above
group     above below
  control    13    17
  treat      16    12

$test
       what     value
1 statistic 1.1047619
2        df 1.0000000
3   P-value 0.2932234

With a (two-sided) P-value of 0.293, there is no evidence of any difference in median pain scores between the treatment and control groups. That is, there is no evidence of any treatment effect on the pain score (positive or negative).

I didn’t say anything about one-sided or two-sided, which means that you can make your own decision about that. You can take the attitude that nowhere here does it say anything about looking for evidence in a particular direction, which favours a two-sided test. Or you can say that a higher score indicates better pain relief, and a new treatment is supposed to be better (as with the kids learning to read from lecture), so we should do a one-sided test. As ever, have a reason for doing what you did.

To do this test one-sided, there is a bit more work to do:

  • are we on the correct side? Yes, because a small majority of the control patients were below the grand median and a small majority of the treatment patients were above, as you would expect from a treatment that is better than control (if only a little). This comes from the table of aboves and belows in the median_test output.
  • we can therefore halve the two-sided P-value to get a one-sided one:
0.2932 / 2
[1] 0.1466

This is still not small enough to be significant, so there is not strong enough evidence that the new treatment offers better pain relief than the standard one (control).

Extra 1: if you look at the table of aboves and belows, you see that they are close to 50-50 above and below the grand median, certainly not as unbalanced as you would expect to see if there really was a treatment effect. (A Welch two-sample \(t\)-test has a smaller but still non-significant P-value, and is not significant one-sided either.)

Extra 2: in my source for these data, it was believed that only some of the patients would respond positively to the treatment. This would be consistent with not very compelling evidence for a treatment effect (for the treatment group as a whole), and also with the treatment group having a larger spread. According to the boxplot, it does seem to be true that some of the patients in the treatment group had a better pain score than those in the control group, but not by any means everybody.

This would be a tough thing to test, because we don’t know which patients in the treatment group are the ones that responded positively to the treatment. We might guess that they are the ones with the highest pain scores (best pain relief), but there is no certainty of this, because a patient might happen to do very well with the new treatment just by chance. A second issue here is that if the new treatment works for some patients, how do you tell whether a particular patient is going to respond well to it before they start on it?

Footnotes

  1. If we had 40 independent miles-per-gallon measurements, there would be no reason to have the Fill-up column, because there would be no basis to link a Computer value with a particular Driver one.↩︎

  2. Actually, it works, but not the way we wanted it to: there is no error here, just not the kind of data we wanted.↩︎

  3. It may look like two characters, but the backslash is called an “escape character” and doesn’t count.↩︎

  4. I seem to be attributing sentience to the Computer here. Maybe recorded_by would have been a better name.↩︎

  5. To be precise linguistically, there are “differences between” two things, as in matched pairs, and “differences among” three or more things, as in ANOVA.↩︎

  6. This is very similar to the thing with the P-values in Extra 1; the sign test doesn’t use the data as efficiently, so its interval is a little wider.↩︎

  7. This is the line \(y = x\), which has intercept 0 and slope 1.↩︎

  8. Studies of this kind are usually designed to have the same number of patients in each group. The patients in the study will have had to sign a consent form that explains any risks involved in taking part in the study, along with a statement that they can withdraw from the study at any time. What can happen is that the groups start off the same size, but a few people drop out, and so the group sizes end up slightly different from each other. My guess is that this is what happened here.↩︎

  9. The log of a number less than 1 is negative, which means that a pain value that is negative comes from a baseline score that is less than the final score. The idea is supposed to be less pain after the treatment than before, so a negative pain value indicates that the (standard or new) treatment is actually making the pain worse! Fortunately, there are not very many negative pain values.↩︎