STAC32 Assignment 6

Packages

library(tidyverse)
library(PMCMRplus)

Skin cleansing

An experiment was designed to compare three different cleansing agents. Forty-five subjects with similar skin conditions were randomly assigned to three groups of 15 subjects each. A patch of skin on each individual was exposed to a contaminant and then cleansed with one of the cleansing agents. After 8 hours, the residual contaminants were rated on a (whole-number) scale of 0 to 10, with a lower rating being better. The ratings are in column clean. The data are in http://ritsokiguess.site/datafiles/Clean.csv.

  1. (1 point) Read in and display (some of) the data.

As ever:

my_url <- "http://ritsokiguess.site/datafiles/Clean.csv"
cleansing <- read_csv(my_url)
Rows: 45 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): agent
dbl (1): clean

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cleansing

The columns are called agent and clean, which makes it challenging to find a good name for the dataframe. I went with cleansing; you might also choose something like skin.

Extra: These are the Clean data from the BSDA package, except for this:

library(BSDA)
Loading required package: lattice

Attaching package: 'BSDA'
The following object is masked from 'package:datasets':

    Orange
Clean

The data description says that there are 45 observations altogether (15 in each group), not \(3 \times 45 = 135\) observations. This is why:

summary(Clean)
    agent               clean      
 Length:135         Min.   :2.000  
 Class :character   1st Qu.:4.000  
 Mode  :character   Median :6.000  
                    Mean   :5.333  
                    3rd Qu.:7.000  
                    Max.   :9.000  
                    NA's   :90     

There are 90 missing values! So I got rid of those, then saved the data for you:

Clean %>% drop_na() -> clean0
clean0
summary(clean0)
    agent               clean      
 Length:45          Min.   :2.000  
 Class :character   1st Qu.:4.000  
 Mode  :character   Median :6.000  
                    Mean   :5.333  
                    3rd Qu.:7.000  
                    Max.   :9.000  

45 rows, and no missing values any more.

  1. (2 points) Draw an appropriate normal quantile plot of these data.

What is of interest is normality of data within each cleansing agent, so facet the normal quantile plot by agent:

ggplot(cleansing, aes(sample = clean)) + stat_qq() +
  stat_qq_line() + facet_wrap(~ agent)

This plot is best done without scales = "free", so that you can compare the typical values and spreads later.

  1. (3 points) Use your normal quantile plot to argue in favour of using regular ANOVA, as run by aov, to compare the cleansing agents.

There are actually three things you need to argue:

  • sufficient normality in each group
  • sample sizes that are large enough in each group
  • spreads that are close to equal across groups. (Remember this assumption for aov ANOVA that is not required for Welch.)

To take those three points in order:

  • The points are already close to the line within each group (so the values within each group are close to normal). You might mention the groups of horizontal points on the plots, which are because some of the ratings are equal (they are all whole numbers less than 10, so some of them are bound to be repeats).
  • The sample sizes are 15 within each group, which will be large enough to take care of any (small) non-normality you see.
  • On this plot, the lines are not far from having equal slopes, so the spreads are close to equal.

Remember you are arguing for the use of aov, so you need (sufficient) normality and equal spreads, and therefore you need to consider all of these. You could argue that the values within each group are already close to normal and therefore that you don’t need to consider sample size, but you need to say that, or else the grader has no idea about your thought process.

If you forgot about the line-slope thing, draw a boxplot to assess the spreads:

ggplot(cleansing, aes(x = agent, y = clean)) + geom_boxplot()

and (remembering what you are arguing for) say that these boxes are of about equal height.

  1. (3 points) Run a suitable analysis of variance, including any followup if necessary. What do you conclude, in the context of the data?

First, the aov:

cleansing.1 <- aov(clean ~ agent, data = cleansing)
summary(cleansing.1)
            Df Sum Sq Mean Sq F value   Pr(>F)    
agent        2 108.13   54.07   47.44 1.68e-11 ***
Residuals   42  47.87    1.14                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The P-value of \(1.7 \times 10^{-11}\) is extremely small, much less than 0.05, so the agents are not all equal in terms of removing contaminants (or, there are differences among the agents, or at least one agent is different from the others).

To find out how the agents differ, Tukey:

TukeyHSD(cleansing.1)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = clean ~ agent, data = cleansing)

$agent
         diff       lwr        upr     p adj
B-A  3.733333  2.786274  4.6803925 0.0000000
C-A  2.466667  1.519608  3.4137258 0.0000004
C-B -1.266667 -2.213726 -0.3196075 0.0063206

All three of the agents differ from each other in terms of removing contaminants, because all three of the P-values are less, usually much less, than 0.05.

Optionally (we will be coming back to this later), Agent A is better than the other two agents because its mean rating is significantly less than those of Agents B and C. Or, look back at your normal quantile plot and see that the points for Agent A are further down the graph than those for the other agents. Or, again, draw a boxplot and take a look at that.

  1. (3 points) Repeat your analysis of the previous question, but this time using Welch ANOVA.

The corresponding two steps: the ANOVA thing, and the followup if necessary:

oneway.test(clean ~ agent, data = cleansing)

    One-way analysis of means (not assuming equal variances)

data:  clean and agent
F = 46.061, num df = 2.000, denom df = 27.949, p-value = 1.423e-09

This is once again strongly significant. Hence, run Games-Howell (making sure that you have installed PMCMRplus using install.packages and loaded it with library, which I did earlier).

One of these two variations should work for you:

gamesHowellTest(clean ~ agent, data = cleansing)
Error in gamesHowellTest.default(c(2, 4, 3, 3, 2, 4, 5, 3, 2, 4, 3, 2, : all group levels must be finite

This way might work for you, but if you get an error as I did, you’ll need to do the factor thing thus:

gamesHowellTest(clean ~ factor(agent), data = cleansing)

    Pairwise comparisons using Games-Howell test
data: clean by factor(agent)
  A       B     
B 1.1e-09 -     
C 1.3e-06 0.0093

P value adjustment method: none
alternative hypothesis: two.sided

and once again, all three agents differ significantly in terms of removing contaminants.

  1. (2 points) Compare the regular and Welch ANOVA procedures in terms of conclusions and in terms of P-values. Do the two procedures give similar or different results?

The conclusions are identical: in both cases, the ANOVA is (strongly) significant, and all three agents differ significantly (in terms of removing contaminants).

To compare the P-values, it is perhaps clearest to make a table showing the overall P-value and the P-values for comparing each pair of agents, for each ANOVA. To do this, the tips here are helpful (cite your source). This was written as a “Markdown pipe table”:

Comparison aov Welch
ANOVA \(1.7 \times 10^{-11}\) \(1.4 \times 10^{-9}\)
A vs B < \(1 \times 10^{-7}\) \(1.1 \times 10^{-9}\)
A vs C \(4 \times 10^{-7}\) \(1.3 \times 10^{-6}\)
B vs C \(6.3 \times 10^{-3}\) \(9.3 \times 10^{-3}\)

I converted the Tukey P-values to scientific notation for easier comparison. All of the P-values are very small; they are similar in size for the two analyses, and the two analyses agree in terms of which comparison is most strongly significant (A vs B) and which is least strongly significant (B vs C).

To make the scientific notation, I used text like this: $1.7 \times 10^{-11}$, which I think looks nicest when rendered. It is fine to use R’s “scientific notation”, 1.7e-11, or to convert all the P-values to actual numbers (making sure to count the zeros), as in 0.000000000017, though when you start doing this, you will become rapidly aware of why scientific notation exists. (You need one fewer zero than the (negative) power of 10.)

  1. (2 points) Which cleansing agent would you recommend for removing contaminants, and why?

A lower rating is better, and agent A has the lowest mean (work it out, or look at your normal quantile plot or boxplot if you drew one or the Tukey output).

The second thing you need to say is that agent A has a significantly lower mean than the other agents. This is the reason for doing the ANOVA + followup: we want to know whether the agent that is best in this study is reproducibly best.1 If any differences here had been just chance, we would have been completely unable to make a recommendation, because the next study could have ranked the agents in a completely different order.

Sunspots

Sunspots are temporary spots on the Sun’s surface that are darker than the surrounding area. They have been observed since the 18th century. The dataset in http://ritsokiguess.site/datafiles/datasets_sunspots.csv contains sunspot counts for every month between 1749 and 2013. (The data are monthly means, averaging the sunspot counts over the days within that month.)

(for later) Their number varies according to the approximately 11-year solar cycle.

  1. (2 points) Read in and display (some of) the data. Comment briefly on the layout of the data.

The first part is as you’d expect:

my_url <- "http://ritsokiguess.site/datafiles/datasets_sunspots.csv"
sunspots <- read_csv(my_url)
Rows: 265 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (13): year, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sunspots

For the second part, the data are actually in “wide” format with each month in a separate column, and the number of sunspots as data within this column. These data are not tidy, and we will have to reorganize them in order to do something with them.

If you say “not tidy” or “wide”, make sure you say how you know, in such a way that you demonstrate your knowledge of what those terms mean. For example, “all the numbers in the dataframe (apart from year) are numbers of sunspots, so will need to be arranged in one column labelled by which year and month they are from”.

Extra: this is actually one of R’s built-in datasets, the one called sunspot.month. This is the way it displays, but it is actually a ts (or “time series”) object so that behind the scenes it is a collection of sunspot numbers and years and months. Time series analysis is a whole domain of statistics (that we don’t get into in this program), but we are going to be doing something simpler, namely drawing a time plot, so there is some organizing to do first.

The function ts_reshape from the TSstudio package seems to be a nice way of turning a time series into a dataframe. I actually turned the time series into a long data frame and then made it wider so that you would have something to do:

TSstudio::ts_reshape(sunspot.month, type = "long") %>% 
  mutate(date_str = str_c(year, "-", month, "-1")) %>% 
  mutate(date = as.Date(date_str)) %>% 
  mutate(month_name = month(date, label = TRUE)) %>% 
  select(year, month_name, sunspots = value) %>% 
  pivot_wider(names_from = month_name, values_from = sunspots) -> sunspots_wide
sunspots_wide

and that is your dataset.

  1. (3 points) Shortly, we are going to be making a time plot of these data. Rearrange the data into a format that will help us do that. Briefly discuss your thought process.

I hope by now you have realized that ggplot likes “long” data rather than the wide data we have here. If not, you might be able to see that all the month columns have a number of sunspots in them (that is, the same thing measured at different times), which are untidy, and to make them tidy, we should get the numbers of sunspots into one column. Either way, pivot_longer is called for:

sunspots %>% 
  pivot_longer(-year, names_to = "month", values_to = "sunspots") -> sunspots_long
sunspots_long

All those numbers in the dataframe are numbers of sunspots, so sunspots is probably a good name for your new column.

You’ll realize that it is probably a good idea to save this dataframe, because we are shortly going to be using it to make a graph, and you’ll need to display it somewhere, so that your reader can see what it has done. For the first input to pivot_longer, anything that correctly selects all the columns except for year is fine, so for example Jan:Dec will also work.

By “briefly discuss your thought process”, I mean something that justifies the code you ran, or that says what you were thinking as you put the code together. I want to know that you understand what you are doing, not just copying a template from somewhere without understanding what it is doing.

  1. (2 points) In order to make a time plot, we need a column of actual dates. Make such a column. This requires two things:
  • first, make a piece of text that is the year, month, and a 1 glued together in that order, separated by dashes (- characters). For this, use the function str_c, which accepts any number of inputs (columns or text) and outputs the inputs glued together as a piece of text.
  • second, turn that piece of text into an actual R date. For this, use the function ymd, which accepts a piece of text containing something that looks like a year, month, and day in that order, and outputs an R date corresponding to that year, month, and day.

Follow the instructions, creating a new column each time. The final one can be called date (this is going on our graph in a minute), so you’ll need a temporary name for date-as-text, for example like this:

sunspots_long %>% 
  mutate(date_text = str_c(year, "-", month, "-", 1)) %>%
  mutate(date = ymd(date_text)) -> d
d

It’s a good idea to save this one as well. You can either save it back into sunspots_long, or give it a new name. I’m using my “temporary dataframe name” d because all I’m using this dataframe for is to make the graph, and then I won’t need it again.

The reason for the 1 is to pretend that these observations were made on the first day of the month for the purpose of making dates. On the scale we’re working, you could use any number here.

Note, in your display of your dataframe, date_text is chr (text), but date is a special R format called date. Even though it displays like a piece of text, it has the advantage that you can use a date as the time axis on a graph and it will display properly.

  1. (2 points) Make a time plot of the numbers of sunspots, joining the points by lines.

A time plot has the time variable (here date, an actual date) on the \(x\) axis and your variable of interest (that I called sunspots) on the \(y\)-axis. It is otherwise a scatterplot, except by tradition you join each time point to the next one, which is what geom_line does:

ggplot(d, aes(x = date, y = sunspots)) + geom_point() + geom_line()
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_line()`).

Only two points because you ought to be able to do this yourself. The only new thing is using a date on the \(x\)-axis; what ggplot does is to format the time scale with the appropriate time unit. Here, that is years (we have monthly observations over a large number of years). If you had, say, daily observations but only for one year, you would expect the \(x\)-axis to be labelled with months.

  1. (2 points) According to Wikipedia, sunspots run on an approximately 11-year cycle. Is there any evidence of that here?

Look for where the graph spikes upwards (to what calculus people would call a local maximum). There seems to be a lot of regularity about where these are; the gap between them always seems to be about the same. How far apart? See if you can judge where a typical pair of neighbouring maxima are on the time scale.

For example, there is a maximum just after 1900, then four more before 1950, and one just after. It seems that there are about 5 time periods in 50 years, so the cycle is about 10 years (which is close to 11). Some of the 50-year periods have four cycles and some have 5, which would (correctly) point to the cycle being between \(50/5 =10\) and \(50/4 = 12.5\) years.

If you can do it, you could also pick a pair of neighbouring maxima and try to judge which years they are, but the scale of the graph makes that difficult to judge. If you try to go this way, you will need to be convincing about how you are reading the graph. One way is to “zoom in” on a part of the \(x\)-axis.

I can think of two ways to zoom in on a scale:

  • make the same graph, but only of the observations for certain years.
  • tweak the \(x\)-scale of the graph.

We have done the first of these: it’s a filter, but requiring a little care. I’ll use 1900 through 1950, but any dates that show a number of peaks (maxima) and troughs (minima) that is not too big and not too small will do:

d %>% 
  filter(date_text > "1900", date_text <= "1950") %>%
  ggplot(aes(x = date, y = sunspots)) + geom_point() + geom_line()

Dates can only be compared with dates, but text can be compared with other text (alphabetically). If you check, there are now \(50 \times 12 = 600\) rows, as there should be.

The peaks (or the troughs) are a bit more indistinct now, but at least you can read on the scale where they are. The peaks are around 1906, 1918, 1928 (hard to see), 1939, 1948 which are not far from 11 years apart. If you find it easier, see if you can judge the troughs: 1902, 1913, 1923, 1934, 1944 or something like that, likewise not far from 11 years apart.

The other way to go here (that you will need to find out about: cite your (reliable) source, as ever) is to use the whole of d, but to limit the \(x\)-scale. This can be done with xlim, to which you give the two limits as a vector. There is some trickery here because the \(x\)-scale is dates, so you have to give xlim two dates, which you can make using ymd as you did before:

ggplot(d, aes(x = date, y = sunspots)) + geom_point() + geom_line() +
  xlim(c(ymd("1900-01-01"), ymd("1950-01-01"))) 
Warning: Removed 2579 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 2579 rows containing missing values or values outside the scale range
(`geom_line()`).

which gives you more or less the same graph as above, from which you again see whether the peaks or troughs are about 11 years apart. The warnings are telling you that all the observations outside these years were removed (of course).

Points: \((1+2+3+3+3+2+2) + (2+3+2+2+2) = 16+11 = 27\).

Footnotes

  1. Going back to Fisher’s words: “a properly designed study rarely fails to yield this level of significance”. (Italics in original.)↩︎