STAC32 Assignment 2

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

Baseball cards

In the game of baseball, players can be classified as “pitchers” (who only throw the ball) and “fielders” (who both field and bat). A baseball fan has 59 cards that contain information about baseball players, specifically:

height: Height in inches
weight: Weight in pounds
bat: L, R, or S (respectively, bats left-handed, right-handed, switch-hitter. The last means that the batter bats left-handed or right-handed according to whether the pitcher they are facing throws left-handed or right-handed.)
throw: L or R (throws left-handed or right-handed; applies whether the player is a pitcher or a fielder).
field: 0 or 1 (respectively, a pitcher or a fielder)
average: Earned run average if the player is a pitcher (lower is better) and batting average if the player is a fielder (higher is better).

The data are in http://ritsokiguess.site/datafiles/Rfit_baseball.csv.

(2 points) Read in and display (some of) the data.

Two points is very generous for this now that you know how to do it:

my_url <- 'http://ritsokiguess.site/datafiles/Rfit_baseball.csv'
baseball <- read_csv(my_url)

Rows: 59 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): bat, throw
dbl (4): height, weight, field, average

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

baseball

(2 points) What are the mean and SD of weights of all the players taken together?

summarize, which you can also spell summarise if you prefer:

baseball %>% 
  summarize(mean_wt = mean(weight), 
            sd_wt = sd(weight))

(3 points) Which are heavier on average, pitchers or fielders? To find out, find the mean weight of each.

baseball %>% 
  group_by(field) %>% 
  summarize(mean_wt = mean(weight))

The pitchers are heavier on average. (Finish with a statement that answers the question, since your reader won’t want to look up what 0 and 1 mean in this context. This is work that you should be doing for your reader.)

Extra: how convincingly heavier on average are they?

ggplot(baseball, aes(x = field, y = weight)) + geom_boxplot()

Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

Oops: field is a yes-no variable that has been coded as a number, so ggplot thinks it is a number. But we know how to fix that (or can read the warning message):

ggplot(baseball, aes(x = field, y = weight, group = field)) + 
  geom_boxplot()

This seems to me like a convincing difference (you are welcome to disagree) since the variability is relatively small, apart from that upper outlier among the fielders.

I found out that most of the 10 heaviest baseball players of all time appear to have been pitchers. This might be because the pitcher doesn’t really need to do anything but pitch (which requires a strong arm); there are other fielders around to run and field the ball. The others seem to have been renowed for their big hitting rather than their fielding.

(3 points) Work out the median and inter-quartile range of the variable average for the pitchers and fielders. Why are the median values so different?

baseball %>% 
  group_by(field) %>% 
  summarize(median_avg = median(average), iqr_avg = IQR(average))

The medians are very different because they are averages of different things! The fielders’ average is their batting average, a number like 0.3, and the pitchers’ average is their earned run average, a number like 3 (both bigger and more variable).

(3 points) Calculate the mean batting averages of fielders who bat left-handed, right-handed, and who switch-hit, along with the number of players in each category. (Hint: is the relevant variable bat or throw?)

Careful: you need to exclude the pitchers (or include only the fielders) first. The relevant variable is bat, since what we care about is whether they bat left- or right-handed; they might throw with the other hand, but that is not relevant here.

baseball %>% 
  filter(field == 1) %>% 
  group_by(bat) %>% 
  summarize(n = n(), mean_avg = mean(average))

I didn’t need any explanation because the goal of this question was for you to calculate the right thing. Batting averages don’t vary very much, and once you separate out the fielders by batting-handedness, there are not many of each, so it is not clear that these small differences are meaningful.

Extra 1: we’ll see later that you might run a hypothesis test to see whether these differences in mean are significant (under the assumption that the players the baseball fan had cards for are a random sample of “all possible players” or something like that). But for now, a graph might offer some insight. Use the first two lines of your previous code, and then pipe the result into a boxplot:

baseball %>% 
  filter(field == 1) %>% 
  ggplot(aes(x = bat, y = average)) + geom_boxplot()

Apart from that low outlier, the left-handed batters seem to have a higher batting average than the other fielders. The switch-hitters have the lowest median, but the right-handers have a lot of variability, so it is not clear whether the difference between right-handers and switch-hitters is meaningful or just chance.

Extra 2: the same but for pitchers (the key variable is now throw):

baseball %>% 
  filter(field == 0) %>% 
  group_by(throw) %>% 
  summarize(n = n(), mean_avg = mean(average), sd_avg = sd(average))

the left-handed pitchers have a slightly higher mean ERA, but there is very little difference:

baseball %>% 
  filter(field == 0) %>%
  ggplot(aes(x = throw, y = average)) + geom_boxplot()

evidently a very tiny difference, and the medians might even be the other way around.

Acidity of lakes

An acidity index was measured in a sample of 155 lakes in the Northeastern United States. Following standard procedures, the data are expressed as \(\log(ANC+50)\), where ANC is the acidity neutralising capacity value. See https://en.wikipedia.org/wiki/Acid_neutralizing_capacity for more details. The data are in http://ritsokiguess.site/datafiles/acidity.csv, with one column called lg_anc.

(1 point) Read in and display (some of) the data.

To no-one’s great surprise:

my_url <- "http://ritsokiguess.site/datafiles/acidity.csv"
acidity <- read_csv(my_url)

Rows: 155 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): lg_anc

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

acidity

One column called lg_anc and 155 lakes (observations). Note for later that the data values are given to two decimal places.

(2 points) Draw a histogram that clearly shows the shape of the distribution.

The wording of the question suggests that there is something interesting about the shape, and the flexibility you have with a histogram is the number of bins you choose.

Extra: Sturges’ rule gives a number of bins that is often good as a first approximation. What you do is to find the next power of 2 at least as high as the number of observations, find out what power of 2 it is, and then add 1 to that. Here, the rule says 9 bins since there are 155 observations and the next power of 2 is \(256 = 2^8\):

ggplot(acidity, aes(x = lg_anc)) + geom_histogram(bins = 9)

This is definitely not normal in shape. It sort of looks as if there is one big peak and one smaller one (the fancy word is “bimodal”, two modes), but it is not clear yet whether that bin between about 5 and 5.5 is really a gap between the two peaks or just by chance a small lg_anc that is really part of a long right tail. The thing about Sturges’ rule is that it really only works for bell-shaped distributions and gives too few bins otherwise. So let’s try more bins. How about 12?

ggplot(acidity, aes(x = lg_anc)) + geom_histogram(bins = 12)

That to me now looks like a definite gap between the right tail of the tall peak near 4 and the left tail of a flatter-topped peak between 6 and 7. There is also possibly an outlier on the left.

What about more bins?

ggplot(acidity, aes(x = lg_anc)) + geom_histogram(bins = 20)

That’s definitely too many: that irregular high frequency around 6.25 and the odd shape of the right side of the second peak seem to be just chance.

ggplot(acidity, aes(x = lg_anc)) + geom_histogram(bins = 16)

This is as many bins as you can reasonably have, with the two peaks clearly separated, but with a bit of irregularity that is probably not meaningful on the second one.

So: choose a sensible number of bins, bigger than 9 but less than 16, that clearly shows two peaks. Expect to do some trial and error, especially here since what is happening is probably not what you expected to see.

(2 points) Assuming that you are interested in the mean lg_anc, why might it still be reasonable to use a \(t\)-test for these data?

The key thing is that we have 155 observations, which is a large sample, so that the distribution of the lg_anc values themselves can be quite a long way from normal (as it is) because the Central Limit Theorem will help us a lot.

Say that we have a large sample, and how large, and say why that will help. Your answer needs to talk about these data, and not be broad generalities that might (or might not) apply to any data.

Extra 1: You might reasonably ask whether the mean is the right thing to estimate at all, when the distribution seems to consist of two subpopulations mixed together. Maybe we should be estimating the means of the two subpopulations, for example using 5.5 as the dividing line between “belongs to the left peak” and “belongs to the right peak”?:

acidity %>% 
  mutate(subpop = (lg_anc < 5.5)) %>% 
  nest_by(subpop) %>% 
  mutate(t_test = list(t.test(data$lg_anc))) %>% 
  mutate(ci = list(t_test$conf.int)) %>% 
  unnest_wider(ci, names_sep = "_") %>% 
  select(subpop, starts_with("ci"))

This divides the data into two sub-samples (the lg_ancs below 5.5 and the lg_ancs above), then works out a confidence interval for the mean for each one separately. What nest_by does is to create a list-column containing everything except subpop (that is, just lg_anc here) for the low lg_ancs and the high lg_ancs separately; then we run t.test for each one, then pull out just the 95% confidence intervals, then display just what we want to see. As ever with these long pipelines, run it one line at a time to see what it does. You can do this by putting your cursor just before a %>% and then hitting Enter to move the pipe symbol down to the next line. This temporarily ends the pipeline here, and you can see what it does up to this point. When you are happy, move the pipe symbol back to where it was (put your cursor in front of it at the beginning of the line and hit Backspace) and investigate the next line.

Extra 2: as you see a lot with problems like these, we can obtain a bootstrap sampling distribution of the sample mean, and see how normal it looks:

tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(acidity$lg_anc, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

and, as is often the case, it is as normal as you could wish for. Assuming that the overall mean is what we care about, there is no problem at all in running a \(t\)-test to estimate it.

(3 points) Obtain a 99% confidence interval for the population mean lg_anc. State your interval appropriately.

This is a standard use of t.test to obtain a confidence interval. The only non-standard thing is that it is not a 95% CI, so you need to specify the confidence level as a proportion less than 1:

t.test(acidity$lg_anc, conf.level = 0.99)


    One Sample t-test

data:  acidity$lg_anc
t = 60.996, df = 154, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
 4.886499 5.323050
sample estimates:
mean of x 
 5.104774

This much is two points; the third point is for pulling out the confidence interval and rounding it off suitably. The original data were recorded to two decimals (go back and look at the display of the data you read in), so you can give your confidence interval to anywhere between the original accuracy (2 decimals) and two more decimals than are in the data (because you have over 100 observations). Any more decimals than that are random noise. So I would say that the 99% confidence interval for the population mean lg_anc is from 4.886 to 5.323.

This is an example of thinking of your reader. Don’t make them search in the output for the result they need; find what they want and tell them what it is, to an accuracy that makes sense for them.

Extra: this also works, and gives the same answer:

with(acidity, t.test(lg_anc, conf.level = 0.99))


    One Sample t-test

data:  lg_anc
t = 60.996, df = 154, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
 4.886499 5.323050
sample estimates:
mean of x 
 5.104774

For myself, I like this one better, but I have no objection to your using the dollar sign. The place where with scores is if you want to refer to two different columns within a dataframe, and then you only have to name the dataframe once.

(3 points) Test whether the population mean lg_anc is 5.32 against an alternative that it is different from this. What do you conclude, in the context of the data?

This is a two-sided test (the alternative is “different from” or “not equal”, as opposed to being “greater” or “less”). So there should be no alternative in your code, but there should be a mu which is your null mean:

t.test(acidity$lg_anc, mu = 5.32)


    One Sample t-test

data:  acidity$lg_anc
t = -2.5717, df = 154, p-value = 0.01107
alternative hypothesis: true mean is not equal to 5.32
95 percent confidence interval:
 4.939445 5.270103
sample estimates:
mean of x 
 5.104774

1.5 points for producing the above output (you can use with again if you prefer).

The P-value is 0.011, which is less than our default \(\alpha\) of 0.05 (I didn’t specify an \(\alpha\) in the question), so we reject our null hypothesis and conclude that the population mean lg_anc is not equal to 5.32. This last is clear enough because you can assume that your reader is interested enough in the acidity of lakes to know that the log of ANC is a standard measure. The second 1.5 points are 0.5 for each of:

stating the actual P-value
stating what you conclude about the null hypothesis (reject it in this case)
saying what you conclude in the context of the problem.

You should be in the habit of stating all three of those, especially while you are still learning to get the procedure right.

The reason for stating the actual P-value is that your reader might want to use a different \(\alpha\) than you did, and they will want to know whether their conclusion should be the same as yours or different. If all you say is that “the P-value is less than 0.05” and they want to use \(\alpha = 0.01\), they won’t know what to do. In this case, the P-value is less than 0.05 but greater than 0.01, so they should fail to reject, but they will only know to do that if you tell them what the P-value is.

The last thing is the usual consideration of your reader. You, as the statistician, are responsible for advising your reader what to do. In science, the usual reason for running an experiment like this in the first place is to see whether there has been a change from previous knowledge of the problem; the experimenter might have some additional insight, for example that acidity of water in lakes has been changing over time, and that value of 5.32 came from an experiment done five years ago.

Extra: you might have been confused by the fact that we rejected a mean of 5.32, but 5.32 was (just) inside our confidence interval. This seems to be inconsistent, because a value inside a confidence interval should not be rejected. But our procedures were not comparing apples with apples: our test used \(\alpha = 0.05\), which would correspond to a 95% confidence interval, not the 99% one we got. So the results of our confidence interval and our test were not comparable. If you want to make the comparison properly, do one of these:

5.32 is inside the 99% confidence interval, and our P-value is greater than 0.01, so the results are consistent.
5.32 is rejected at \(\alpha = 0.05\), so it should be outside a 95% CI:

t.test(acidity$lg_anc)


    One Sample t-test

data:  acidity$lg_anc
t = 60.996, df = 154, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 4.939445 5.270103
sample estimates:
mean of x 
 5.104774

which it is, and these results are consistent as well.

Hand in only your answers to the questions above (about baseball cards and acidity of lakes). The questions below are for extra practice; you can do them if you like, but don’t hand in your answers, since they will not be graded. My solutions to the above questions will include solutions to the ones below.

US Regional Mortality

Description These datasets record mortality rates across all ages in the USA by cause of death, sex, and rural/urban status, 2011–2013. The data were collected by the Department of Health and Human Services (HHS). The US is divided into ten HHS regions, labelled by a city (the location of the HHS regional offices), and include data from states and territories close to the regional office.

The region-wise data give estimated rates separately for each of 10 HHS regions. The location of the regional offices and their coverage area, available from https://www.hhs.gov/about/agencies/iea/regional-offices/index.html, is given below.

HHS Region 01 - Boston: Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont
HHS Region 02 - New York: New Jersey, New York, Puerto Rico, and the Virgin Islands
HHS Region 03 - Philadelphia: Delaware, District of Columbia, Maryland, Pennsylvania, Virginia, and West Virginia
HHS Region 04 - Atlanta: Alabama, Florida, Georgia, Kentucky, Mississippi, North Carolina, South Carolina, and Tennessee
HHS Region 05 - Chicago: Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin
HHS Region 06 - Dallas: Arkansas, Louisiana, New Mexico, Oklahoma, and Texas
HHS Region 07 - Kansas City: Iowa, Kansas, Missouri, and Nebraska
HHS Region 08 - Denver: Colorado, Montana, North Dakota, South Dakota, Utah, and Wyoming
HHS Region 09 - San Francisco: Arizona, California, Hawaii, Nevada, American Samoa, Commonwealth of the Northern Mariana Islands, Federated States of Micronesia, Guam, Marshall Islands, and Republic of Palau
HHS Region 10 - Seattle: Alaska, Idaho, Oregon, and Washington

The data are in http://ritsokiguess.site/datafiles/USRegionalMortality.csv.

There is a lot to do here, but once you work out what to do, it should go pretty fast (except perhaps for the last part, where you may have to exercise some judgment).

(2 points) Read in and display (some of) the data.

I’m being generous again:

my_url <- "http://ritsokiguess.site/datafiles/USRegionalMortality.csv"
USRegionalMortality <- read_csv(my_url)

Rows: 400 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Region, Status, Sex, Cause
dbl (2): Rate, SE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

USRegionalMortality

(2 points) Display the columns for health region, cause of death and death rate.

USRegionalMortality %>% select(Region, Cause, Rate)

If this somehow displays all 400 rows for you, rather than the top 10 with a button to click to display the rest, you should find a way of displaying only some of the rows (eg with slice). If this had been on an exam, giving code to display all the rows would have been fine here.

(3 points) Display the columns whose names begin with S, without naming the columns.

Use the select-helper starts_with:

USRegionalMortality %>% select(starts_with("S"))

Having the S as uppercase is optional; it will work the same with a lowercase S because starts_with is not case-sensitive.

(2 points) Display the columns whose names have the letter A in them somewhere (uppercase or lowercase).

USRegionalMortality %>% select(contains("A"))

Again, the A can be uppercase or lowercase.

(3 points) Select the columns that contain text, as opposed to numbers.

USRegionalMortality %>% select(where(is.character))

(2 points) Display the rows that are for health region 04.

filter to display rows rather than columns:

USRegionalMortality %>% 
  filter(Region == "HHS Region 04")

(3 points) For each of the health regions, display the median death rates from heart disease (but not the death rates from any other cause).

Grab the heart disease rows first, throwing away the others, then do a group-by over regions:

USRegionalMortality %>% 
  filter(Cause == "Heart disease") %>% 
  group_by(Region) %>% 
  summarize(median_death_rate = median(Rate))

(3 points) Display any death rates that are either over 230 or are for Cancer (or both), along with the cause of death.

USRegionalMortality %>% 
  filter(Rate > 230 | Cause == "Cancer") %>% 
  select(Cause, Rate)

(4 points) Make a graph that displays the urban and rural death rates for heart disease for males and females in each region as bars on a bar chart for each region. Hint: use geom_col to display bars whose height you already have. The height goes in y and the categorical variable in x. You can use fill and position the same way as for a regular bar chart.

Take a breath, and sort out what you have:

heart disease rate (in Rate, quantitative), which will be the y of the plot.
urban and rural in Status (categorical)
male and female is Sex (also categorical)
region in Region (also categorical, with 10 categories)

There is one quantitative variable, Rate, and we already know what we are doing with that. But we have three categorical variables, which is one more than we know how to deal with. If we had only two, we could make grouped bar charts, so the idea is to make one of them facets. Which one? Well, it seemed to me that you would want to compare male vs female and rural vs urban within each region, and that would be easiest to do if the regions were facets. So I came up with this (you could have Sex and Status the other way around, which would be equally good):

USRegionalMortality %>% 
  filter(Cause == "Heart disease") %>% 
  ggplot(aes(x = Status, fill = Sex, y = Rate)) + geom_col(position = "dodge") +
    facet_wrap(~ Region)

which is the sort of graph you would actually put into a report. Expect to do some experimentation before you get a graph you are happy with, and for an assignment, you would do well to have some narrative explaining the choices you made. (In a report or presentation for a manager, you would relegate your code and discussion of design decisions to an appendix, but you should be ready to answer questions about your choices.)

On this chart, females have a consistently lower death rate from heart disease than males, no matter which region, and urban residents have a lower death rate from heart disease than rural ones, though in some regions the difference is very small. Region 4 seems to have the biggest difference between urban and rural (this is easiest to see by looking at the males where the numbers are bigger), and regions 5 and 9 seem to have the smallest difference between urban and rural.

Code-wise, the second categorical variable is added using fill, as with a grouped bar chart (which this looks like but isn’t quite, because the bar heights are based on data numbers rather than counts of observations, hence the geom_col). The same position = "dodge" idea is used to put the bars side by side rather than the default of stacking them, which makes no sense here because you don’t add rates (at best, you average them or avoid combining them at all).

Another way you might reasonably think of using facets here is that you could imagine having two categorical variables too many. Then you could have an array of ungrouped bar charts and have the facets both across and up-and-down on the page. That would go like this. There seems to be nothing to choose between using Status or Sex for the second facetting variable. Note the use of facet_grid now:

USRegionalMortality %>% 
  filter(Cause == "Heart disease") %>% 
  ggplot(aes(x = Status, y = Rate)) + geom_col(position = "dodge") +
    facet_grid(Sex ~ Region)

Aesthetically, I don’t completely like this graph because the vertical facets are too narrow to display the numbers of the regions (presumably they are numbered 1 through 10 left to right) and the words “rural” and “urban” overlap, but I think this graph is good as an answer to this question. The way you would fix this is to create new columns with the region numbers (only) and with Status abbreviated to something like Rur and Urb, and then use those in the plot instead of the original Region and Status. This also applies if you has Sex as x instead of Status. If you don’t like the all-dark bars, you could add a fill = Status to distinguish them by colour as well as position.

I also tried drawing this graph with Sex and Region the other way around, but I didn’t like that as much because the point of a graph like this is to compare the heights of the bars, and arranging the regions up and down means that the bars are all stubby and it’s much less easy to tell which heights differ from which.

Extra: with one quantitative variable and a bunch of categorical ones, you were probably thinking about facetted grouped boxplots here, or something similar. If you try that, it comes out like this:

USRegionalMortality %>% 
  filter(Cause == "Heart disease") %>% 
  ggplot(aes(x = Status, fill = Sex, y = Rate)) + geom_boxplot() +
    facet_wrap(~ Region)

There are no actual boxes here, and hence you cannot tell which ones are filled red and which ones filled blue!

The problem is this:

USRegionalMortality %>% 
  filter(Cause == "Heart disease")

There are 10 regions, two sexes and two statuses, so there are \(10 \times 2 \times 2 = 40\) combinations to show on our plot, but there are only 40 observations, because there is only one heart disease rate for each combination. These are observations from only one time period, so there is only one Rate each time. Even though the original data frame has 400 rows, there were 10 different causes of death, so there was still only one observation per combination of region, status, sex, and cause of death. The time when a boxplot would score would be if you had data for several time periods, so you would have several observations per region-status-sex-cause combination, and a boxplot would show how they varied, along with the median rate over all time periods for which you had data.

When you only have one observation per “group”, a boxplot is not informative, because the one observation is also the median and it’s hard to compare. A standard thing to do is to use the one observation as the height of a bar on a bar chart and use geom_col, as I have directed you to do here. I think the heights of my red and blue bars are easier to compare than the not-boxes on my attempted boxplot. As we have defined it, though, it’s not a real bar chart, because we use bar charts to display counts of a categorical variable, and these are calculated death rates. I don’t think it is worth being dogmatic about this, however, because your audience will be used to seeing and interpreting graphs like the bar charts here, and geom_col is as happy to accept a measured variable for its y as it is to accept a count that you previously summarized.

Car repairs

In 1969, cars were much less reliable than they are now. A car magazine kept track of the service records of 33 car models that were popular in 1969. The data are in http://ritsokiguess.site/datafiles/cars69_long.csv. There are three columns: the model of the car, the repair category under consideration (see below), and, in compare_avg, how that model of car compared to average for repairs in that repair category. A + in the dataframe means that this model of car needed repair for the item shown more often than average, and a - means that repair was needed less often than average.

Some of the car model names end in a number (which is often the number of cylinders in the engine), and some of them end in Full, which is short for “Full Size”. These last were big cars that consumed a lot of gas (which in those days was not considered to matter very much).

These are the repair categories:

BR: brake system
FU: fuel system
EL: electrical
EX: exhaust
ST: steering
EM: engine, mechanical
RS: rattles and squeaks
RA: rear axle
RU: rust
SA: shock absorbers
TC: transmission, clutch
WA: wheel alignment
OT: other

(2 points) Read in and display (some of) the data.

A gimme:

my_url <- "http://ritsokiguess.site/datafiles/cars69_long.csv"
cars <- read_csv(my_url)

Rows: 429 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): model, repair, compare_avg

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

cars

The columns as promised. (The AMC Ambassador 8 needed repairs less often than average for most things, but more often than average for brakes and rear axle.) There were 33 cars and 13 repair categories, so there should be this many rows in the dataframe:

33 * 13

[1] 429

and so there are.

(3 points) For each car model, make a table showing how many repair categories that car model was above average on. Save your resulting dataframe (we are going to use it again later).

This is going to be count in some fashion. The most obvious approach is to group by model and then count what is in the compare_avg column:

cars %>% 
  group_by(model) %>% 
  count(compare_avg)

But this also shows the number of items that are below average for each model, which we don’t want to see (and indeed can work out: for each model, the number of plusses and minuses must add up to 13 because there are 13 repair categories). So, don’t waste your reader’s time by showing them stuff they don’t want or need to see. (2 points for getting this far.)

The above is a dataframe just like any other, so what you can do is to grab only the rows where compare_avg is +. In the resulting dataframe, compare_avg is + all the way down, so there is then no need to keep it any more (it has served its purpose), so you can get rid of it now. Run this one line at a time to see what you need to do next:

cars %>% 
  group_by(model) %>% 
  count(compare_avg) %>% 
  filter(compare_avg == "+") %>% 
  select(-compare_avg) -> cars_count
cars_count

and that’s what I was after. 3 points for that, 2.5 if you leave compare_avg in your final answer.

You see that some of the car models were above average on very few items, but some, like the Chevrolet Full, were above average on most of them (that is, had a bad repair record).

Another way you might approach this is to grab only the above-average repair records first, since we don’t need the below-average ones at all:

cars %>% 
  filter(compare_avg == "+")

and from here it is simply a matter of counting how many times each model appears:

cars %>% 
  filter(compare_avg == "+") %>% 
  count(model)

which is a faster way to three points.

Extra: a subtlety here (which we have to take into account later) is that there were 33 car models, but only 32 rows here. What happened to the other one? We will find out.

(1 point) An analysis (that we will see in STAD29) called a cluster analysis was run on these data, dividing the car models into four clusters, labelled A through D. The idea is that the cars in the same cluster should have a similar repair record. The cluster information is in http://ritsokiguess.site/datafiles/cars69_clusters.csv. Read in the cluster information and display at least some of it.

This, like the one on the worksheet, is the same again, so even one point is generous. We will need two dataframes to join on:

my_url <- "http://ritsokiguess.site/datafiles/cars69_clusters.csv"
cars_cluster <- read_csv(my_url)

Rows: 33 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): model, cluster

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

cars_cluster

You might notice that there are 33 rows this time, and you certainly should notice that the cars are in a different order, this time arranged by cluster.

(3 points) Create a dataframe that has, for each car model, both the total number of above-average repair categories and the cluster membership. Your final dataframe should have 33 rows. Save it.

This means to combine the dataframe that you made earlier (that I called cars_count) with the one you just read in (that I called car_cluster). Simply trying to place the dataframes side by side will not work (for one, because they have different numbers of rows, and for two, because the car models are in a different order). So you need some kind of join. My first attempt was this:

cars_count %>% left_join(cars_cluster)

Joining with `by = join_by(model)`

which worked, but only has 32 rows. This is because left_join matches all the rows in the first dataframe that were also in the second, and there were only 32 rows in cars_count. So what happens if you try it the other way around?

cars_cluster %>% left_join(cars_count) -> cars_combined

Joining with `by = join_by(model)`

cars_combined

Now we have 33 rows, so we are good.

Extra: If you scroll down, you see why: the Porsche has a missing value of n, and if you go back to the original data, you can find out why that is:

cars %>% filter(model == "Porsche")

The Porsche had fewer than average repairs on all 13 categories, so it was the most reliable car model in the entire dataset, and its value of n really should be zero rather than missing. If you go through what I called cars_count, you’ll see that the Porsche is not there at all. count does not include zero counts.

(3 points) Make a graph showing how the number of above-average repair categories differs by cluster membership. What do you conclude from your graph? Explain briefly.

Now that you have done all the work, the graph should be straightforward. It uses the dataframe we just made, the one I called cars_combined, and the two variables of interest are cluster (categorical) and n (quantitative), so the graph we need is a boxplot:

ggplot(cars_combined, aes(x = cluster, y = n)) + geom_boxplot()

Warning: Removed 1 row containing non-finite outside the scale range
(`stat_boxplot()`).

(The “non-finite” value was, we now know, the Porsche in cluster B that should really be a zero.)

Say something about how the clusters compare in terms of repair record, for example:

cluster B has the smallest median number of above-average repair categories: that is, it contains the most reliable cars.
cluster D (or clusters C and D) have the largest median number of above-average repair categories; it contains the least reliable cars.

2 points for a sensible graph, the last point for saying something sensible about what you see. Finish with a statement about whether your cars of interest are most or least reliable. The most important thing is this; comment about outliers should be secondary to this. For example, you might say that the cars in clusters C and D have a comparable repair record (almost equally bad) once you take the outlier in cluster C into account.

Extra: the cluster analysis did something more sophisticated than look at total number of above-average repair categories. It also looked at which particular categories were above or below average. Thus, even though clusters C and D are fairly similar in terms of overall repair record, they might differ in terms of the kinds of repair that cars in each cluster tended to need.

To do this, let’s go back to our original dataframe, and add the cluster memberships to it:

cars %>% left_join(cars_cluster)

Joining with `by = join_by(model)`

Now we have three categorical variables of interest: repair type, the comparison with the average, and the cluster membership. This suggests grouped bar charts, facetted by something. I experimented and found that facetting by cluster seemed to work best (least badly). Being above or below average seems to be an outcome, so I made that fill, and the proportion of plusses seemed to be the key, so I used position = "fill" on the bar chart.¹

cars %>% left_join(cars_cluster) %>% 
  ggplot(aes(x = repair, fill = compare_avg)) + geom_bar(position = "fill") +
  facet_wrap(~cluster)

Joining with `by = join_by(model)`

This is rather a busy graph. Within each facet, red is below average (good) and blue is above. So cars in cluster B (top right) have the best overall repair record. Cars in cluster A (top left) seem to be similar to those in cluster B in most categories, but they are much worse on a few: I see BR (brakes), EM (engine), EX (exhaust), and TC (transmission) as being noticeably worse for the cars in cluster A as compared to B.

Clusters C and D (bottom row) are clearly worse overall than clusters A and B. But presumably the cluster analysis put cars in one of these clusters rather than the other for a reason, so they ought to be different from each other somehow. The way to figure this out is to go from left to right and see which of the 13 bars differ substantially in height between clusters C and D. There is judgement here, of course, and your call might differ from mine (which is fine as long as yours is defensible):

BR (brakes): cluster D is worse
EM (exhaust): C is worse
FU (fuel system): D is worse
OT (other): C is a lot worse
RS (rattles): D is worse
RU (rust): D is worse
ST (steering): C is a bit worse

I don’t really see anything that the ones where cluster D is worse have in common.

Now that I think about it, there is a place for focusing on the cluster C vs D comparison, and putting the bars for the same repair category next to each other there:

cars %>% left_join(cars_cluster) %>% 
  filter(cluster == "C" | cluster == "D") %>% 
  ggplot(aes(x = cluster, fill = compare_avg)) + geom_bar(position = "fill") +
  facet_wrap(~repair)

Joining with `by = join_by(model)`

You might say that it is now much easier to compare each repair type between the clusters because the repair types are side by side, now that we have rearranged things to have cluster as x and repair type as facets. You can see at a glance where the two clusters are about the same and where they are a lot different. I originally made this graph but with all four clusters (left to right within facets):

cars %>% left_join(cars_cluster) %>% 
  ggplot(aes(x = cluster, fill = compare_avg)) + geom_bar(position = "fill") +
  facet_wrap(~repair)

Joining with `by = join_by(model)`

but I think it is less easy to tell from this graph that cluster B is best overall and cluster A is second best. The focus in this kind of graph is the question of which repair types distinguish the clusters: RA does not at all, and EX only does to a certain extent.

Footnotes

You see that there is a lot of deciding to be done, and you should expect to need to try several different graphs before you find one you like.↩︎