library(tidyverse)STAC32 Assignment 3
Packages
You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.
If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)
You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.
Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.
Organic matter in soil
A random sample of soil specimens was obtained, and the percent of organic matter in the soil was determined for each specimen. The data are in http://ritsokiguess.site/datafiles/soil.csv, with the column percorg containing the percent of organic matter in each soil specimen.
- (1 point) Read in and display (some of) the data.
A .csv file, so read_csv:
my_url <- "http://ritsokiguess.site/datafiles/soil.csv"
soil <- read_csv(my_url)Rows: 30 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): percorg
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
soilGive your dataframe a name that says something about what is in it (that is not the same name as the column). Something like soil_samples will also do.
Source: ex08.56 from the Devore7 package.
- (2 points) Make a suitable graph of the data. Justify any choices you make.
One quantitative variable, so a histogram is the obvious choice. The number of bins should be at least 4 and not more than about 7, to show the shape clearly:
ggplot(soil, aes(x = percorg)) + geom_histogram(bins = 5)A histogram is a good general-purpose graph. You might have read ahead and come to the conclusion that normality is going to be key, and on those grounds (which you must state), you might conclude that a normal quantile plot is the thing:
ggplot(soil, aes(sample = percorg)) + stat_qq() + stat_qq_line()- (2 points) The soil scientists are considering running a \(t\)-test on these data. What are two reasons that this will be a sensible idea?
The two considerations are normality and sample size:
from the histogram, the distribution appears to be somewhat non-normal: it is bimodal (two peaks), but there are no outliers. Make sure you say what makes you think it’s not normal.
the sample size is 30 (the number of rows in the data you read in), which is large in Central Limit Theorem terms, so the sample size will offer a lot of help in overcoming that non-normality. Remember, there is nothing magic about \(n = 30\); the larger the sample size is, the better a \(t\)-test will be, but whether it will be satisfactory depends on how non-normal the data are as well as the sample size.
If you drew a normal quantile plot, the impression you got from there is that the distribution is short-tailed,1 and then you can either go via the “somewhat non-normal” route, or you can say that the short tails won’t have too big an effect on the mean, and therefore the sample size is easily big enough.
Extra: later, we learn about the bootstrap sampling distribution of the sample mean: if that is normal, the sample size is big enough to overcome any non-normality:
tibble(sim = 1:10000) %>%
rowwise() %>%
mutate(my_sample = list(sample(soil$percorg, replace = TRUE))) %>%
mutate(my_mean = mean(my_sample)) %>%
ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()and that is very much normal, so the sample size is large enough to overcome the non-normality we had.
- (3 points) The soil scientists know that historically the soil in the area from which the specimens were taken has had 3.5 percent organic matter. They are concerned that the organic matter content might have decreased from historical levels. Carry out a suitable \(t\)-test to address their concern. State your conclusion clearly in the context of the data.
The soil scientists need to have evidence for their concern (if, for example, they are going to write a paper about it). Let \(\mu\) be the (current) mean percent of organic matter in the soil (in the area); then the alternative hypothesis will be \(H_a: \mu < 3.5\), and therefore the null will be \(H_0: \mu = 3.5\) (or \(H_0: \mu \ge 3.5\) if you prefer, but there must be an equals sign in the null somewhere). This is a one-sided (lower-tail) alternative, so your code needs to be
with(soil, t.test(percorg, mu = 3.5, alternative = "less"))
One Sample t-test
data: percorg
t = -3.4534, df = 29, p-value = 0.0008613
alternative hypothesis: true mean is less than 3.5
95 percent confidence interval:
-Inf 2.982532
sample estimates:
mean of x
2.481333
(or the equivalent without with but including soil$percorg). The P-value is 0.00086, much less than 0.05, so we reject the null hypothesis in favour of the alternative, and hence conclude that the mean organic matter content of the soil is now less than 3.5 percent. (That is to say, we have evidence that it has decreased since the historical value.)
Ingredients of the best answer:
- state the null and alternative hypotheses, defining \(\mu\) if you use it.
- put the correct
alternativein yourt.testcode (leaving it out does a two-sided test, which is not what we want here) - state the P-value from your output
- compare it with 0.05 (or your \(\alpha\), but there doesn’t seem to be a reason for using a different one here)
- make a decision about rejecting your null hypothesis
- make a conclusion about mean organic material content of soil.
In your words, distinguish your decision about the null hypothesis from your conclusion about organic material in soil, perhaps by using the word “reject” in your statement of your decision, and “conclude” in your statement of your conclusion, so that it is clear which is which. Stating the P-value allows your reader to make their own decision if their chosen \(\alpha\) differs from yours (in this case, your reader would still reject even if their \(\alpha\) was 0.001, but if you don’t say what the P-value is, they wouldn’t know this).
Bike commute
Dr. Jeremy Groves, a British doctor, often uses a bicycle for his 27-mile round-trip commute to work. He bought an expensive, lightweight, carbon bike but also had an older, heavier, steel bike, so he decided to do an experiment. On each day he biked to work he flipped a coin to determine which bike he would ride. He used a bicycle computer to accurately record the commute time each day as well as his maximum and average speed for the day. His data for 56 days are in http://ritsokiguess.site/datafiles/bikecommute.csv. The type of bicycle (carbon or steel) is in the Bike variable and his time (in minutes) is stored in Minutes. The dataset also contains some other variables.
- (1 point) Read in and display (some of) the data.
library(tidyverse)The usual thing:
my_url <- "http://ritsokiguess.site/datafiles/bikecommute.csv"
commute <- read_csv(my_url)Rows: 56 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Bike, Date, Month
dbl (5): Distance, Minutes, AvgSpeed, TopSpeed, Seconds
time (1): Time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
commuteGive the dataframe a sensible name, not bike since that is the name of one of the columns, and you will get confused. You might get away with a lowercase bike at the cost of some confusion, but Bike will not work because it is then unclear whether you mean the whole dataframe, or the one column in it with that name.
Note that the two variables listed in the question are there. The other variables, should you be interested, are:
- the
Dateof the ride, in day-month-year (British format!), as text. Note that this was not read in as a date (it is text as read in), because it is not in YYYY-MM-DD format. Distanceof the ride, in milesTimeto complete the ride, in hours, minutes, and seconds. R has an internal way of storing times that we learn about later.Minutes: time to complete the ride, expressed as a fractional number of minutesAvgSpeed: average speed in miles per hourTopSpeed: maximum speed, dittoSeconds: time to complete the ride in a whole number of secondsMonth: month in which the ride was done, as a number followed by the name of the month.
Source: data BikeCommute from Lock5Data package.
- (2 points) Make a suitable graph of the two variables of interest. What do you conclude from your graph?
Minutes is quantitative and Bike is categorical, so a boxplot:
ggplot(commute, aes(x = Bike, y = Minutes)) + geom_boxplot()The median time taken to complete the ride is almost the same regardless of the bike chosen. This is the major conclusion, so make sure you say it. If you like, comment on the variability (the variability in Minutes is a bit larger for the Carbon bike) and shape (the distribution of Minutes for the Carbon bike is skewed to the right, while the distribution for the Steel bike is more or less symmetric).
Extra: the comparison of medians is rather a surprise given that the steel bike is older and heavier, and so you would expect rides on that bike to be slower and thus to take longer.
- (3 points) Carry out a suitable \(t\)-test. What do you conclude, in the context of the data? (What would be a suitable alternative hypothesis in this case?)
We would expect, without looking at any data, that a more expensive and lighter-weight bike would be easier to ride quickly, and so Minutes would if anything be less for the Carbon bike. So make this our alternative. The Carbon bike is first alphabetically, so our alternative is therefore "less". I think, from looking at the boxplot, that the spreads are definitely different, so a Welch test is called for:
t.test(Minutes ~ Bike, data = commute, alternative = "less")
Welch Two Sample t-test
data: Minutes by Bike
t = 0.36556, df = 46.962, p-value = 0.6418
alternative hypothesis: true difference in means between group Carbon and group Steel is less than 0
95 percent confidence interval:
-Inf 3.091164
sample estimates:
mean in group Carbon mean in group Steel
108.3423 107.7893
The P-value of 0.64 is not less than 0.05, and so we do not reject the null hypothesis, and conclude that the mean number of Minutes was the same for both types of Bike. (Or, there is “no evidence of any difference in mean Minutes for the two types of bike.”)
In the light of our thinking about the alternative hypothesis, this is a surprise (we expected the rides on the more expensive lighter bike to be quicker), but it is entirely consistent with the boxplot.
- (3 points) Obtain and suitably state a 90% confidence interval for the difference in mean
Minutesbetween the two bikes.
A confidence interval is two-sided, so take out the alternative, and it is a non-default level, so add a conf.level:
t.test(Minutes ~ Bike, data = commute, conf.level = 0.90)
Welch Two Sample t-test
data: Minutes by Bike
t = 0.36556, df = 46.962, p-value = 0.7163
alternative hypothesis: true difference in means between group Carbon and group Steel is not equal to 0
90 percent confidence interval:
-1.985216 3.091164
sample estimates:
mean in group Carbon mean in group Steel
108.3423 107.7893
With 90% confidence, the difference between the mean time riding the Carbon bike and the Steel bike is from \(-1.98\) to 3.09 minutes.
The values of Minutes are given in the data to two decimals (they are probably measured to the nearest second), so you can justifiably give a third decimal, but not more. 1 decimal place is also acceptable (on the grounds that this much accuracy is all your reader cares about for this kind of thing).
Extra:
Interpretation (that I didn’t ask for, so you don’t need to supply): the interval goes from about \(-2\) to 3, so the mean difference in Minutes is somewhere between 2 minutes quicker on the Carbon bike to 3 minutes quicker on the Steel bike.
The difference could go either way, which is entirely consistent with the non-significant difference.
- (2 points) The doctor rode several different routes of varying lengths, which are in the column
Distance. Make a suitable graph of this variable along with the two variables of interest. Does this explain the significance or lack of significance of your test? Explain briefly.
We now have three variables:
Minutes, quantitative, responseBike, categorical, explanatoryDistance, quantitative, explanatory.
With two quantitative variables, we should draw a scatterplot with the points labelled by colour; Minutes is the response, so it should go on the \(y\)-axis:
ggplot(commute, aes(x = Distance, y = Minutes, colour = Bike)) + geom_point()This shows that almost all of the shorter rides were on the Steel bike, and almost all of the longer ones were on the Carbon bike. (There are some outliers on the left, but that is the pattern.) The shorter rides would be expected to take less time (because they are shorter), and so the comparison of Minutes between the two bikes will actually be biased in favour of the Steel bike (that is to say, the Carbon bike is actually better than it looks).
Extra: One of the other columns in the dataframe is Dr Groves’W average speed, which might give a better comparison between the bikes:
ggplot(commute, aes(x = Bike, y = AvgSpeed)) + geom_boxplot()and now you see that the average speed is indeed a bit higher on the expensive bike, which is a better way of comparing the rides by allowing for how long they are. Is that difference significant? We can justify a one-sided alternative again, but now the other way around because a higher speed is better. Once again, the spreads seem different, so Welch again:
t.test(AvgSpeed ~ Bike, data = commute, alternative = "greater")
Welch Two Sample t-test
data: AvgSpeed by Bike
t = 0.76982, df = 47.649, p-value = 0.2226
alternative hypothesis: true difference in means between group Carbon and group Steel is greater than 0
95 percent confidence interval:
-0.1795752 Inf
sample estimates:
mean in group Carbon mean in group Steel
15.18731 15.03500
This difference is still not significant (the means are not very different). How wide is a confidence interval for the difference in means?
t.test(AvgSpeed ~ Bike, data = commute)
Welch Two Sample t-test
data: AvgSpeed by Bike
t = 0.76982, df = 47.649, p-value = 0.4452
alternative hypothesis: true difference in means between group Carbon and group Steel is not equal to 0
95 percent confidence interval:
-0.2455673 0.5501827
sample estimates:
mean in group Carbon mean in group Steel
15.18731 15.03500
The confidence interval says that the difference in mean speeds is between 0.25 mph in favour of the steel bike and 0.5 mph in favour of the carbon bike. This seems to be pretty accurately estimated, so we would expect to have enough power to detect a reasonably small difference if there were one.
Footnotes
This kind of normal quantile plot can actually come from short tails or from bimodality, and it is difficult to tell the difference.↩︎