Worksheet 9

Published

November 3, 2023

Questions are below. My solutions are below all the question parts for a question; scroll down if you get stuck. There is extra discussion below that for some of the questions; you might find that interesting to read, maybe after tutorial.

For these worksheets, you will learn the most by spending a few minutes thinking about how you would answer each question before you look at my solution. There are no grades attached to these worksheets, so feel free to guess: it makes no difference at all how wrong your initial guess is!

1 Cuckoo eggs

The cuckoo is a European bird that lays its eggs in the nests of other birds (rather than building nests itself). The other bird, known as a “host”, raises and cares for the newly hatched cuckoo chick as if it was its own. Each cuckoo returns to the same territory year after year and lays its eggs in a nest of the same host species. Thus, cuckoos are actually several sub-species, each with a different host bird that it lays its eggs in the nests of. In a study, 120 cuckoo eggs were found in the nests of six other bird species: hedge sparrow, meadow pipit, pied wagtail, robin, tree pipit, and wren. These are birds of different sizes, so researchers were interested in whether the cuckoo eggs laid in the nests of different host birds differed in size as well. (For example, wrens are small birds, so you might be interested in whether cuckoo eggs laid in wren nests are smaller than cuckoo eggs laid in the nests of other birds. If this is the case, the cuckoo eggs will look less different from the wren eggs in the same nest.)

The data are in http://ritsokiguess.site/datafiles/cuckoo.txt.

  1. Read in and display (some of) the data. Note that some of the host bird names are misspelled. (You do not need to correct the misspellings.)

  2. Bearing in mind that we will be interested in running some kind of ANOVA shortly, explain briefly why a normal quantile plot, for each host species separately, will be useful.

  3. Draw a suitable normal quantile plot. Based on what you see, what would you recommend as a suitable test to compare the egg lengths in the nests of the different host species? Explain briefly.

  4. Run an (ordinary) analysis of variance, including any follow-up if warranted. What do you conclude, in the context of the data? (Run this analysis even if you don’t think it’s the best thing to do.)

  5. Run a Mood’s median test, and, if appropriate, follow-up tests. What do you now conclude, in the context of the data?

  6. Compare all your significant results from the previous two parts. Are the results substantially different? Explain briefly.

Cuckoo eggs: my solutions

The cuckoo is a European bird that lays its eggs in the nests of other birds (rather than building nests itself). The other bird, known as a “host”, raises and cares for the newly hatched cuckoo chick as if it was its own. Each cuckoo returns to the same territory year after year and lays its eggs in a nest of the same host species. Thus, cuckoos are actually several sub-species, each with a different host bird that it lays its eggs in the nests of. In a study, 120 cuckoo eggs were found in the nests of six other bird species: hedge sparrow, meadow pipit, pied wagtail, robin, tree pipit, and wren. These are birds of different sizes, so researchers were interested in whether the cuckoo eggs laid in the nests of different host birds differed in size as well. (For example, wrens are small birds, so you might be interested in whether cuckoo eggs laid in wren nests are smaller than cuckoo eggs laid in the nests of other birds. If this is the case, the cuckoo eggs will look less different from the wren eggs in the same nest.)

The data are in http://ritsokiguess.site/datafiles/cuckoo.txt.

  1. Read in and display (some of) the data. Note that some of the host bird names are misspelled. (You do not need to correct the misspellings.)

Solution

The values are separated by single spaces, so:

my_url <- "http://ritsokiguess.site/datafiles/cuckoo.txt"
eggs <- read_delim(my_url, " ")
Rows: 120 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (1): bird_species
dbl (1): egg_length

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
eggs

There are indeed 120 eggs, with lengths (evidently measured in millimetres) and with the two “pipets” misspelled. The column called bird_species is the host bird; the eggs themselves are all cuckoo eggs.

\(\blacksquare\)

  1. Bearing in mind that we will be interested in running some kind of ANOVA shortly, explain briefly why a normal quantile plot, for each host species separately, will be useful.

Solution

The major assumption of ANOVA is that the observations within each group are (approximately) normal in shape. Since we are specifically interested in normality, rather than shape generally, a normal quantile plot will be better than a boxplot.

\(\blacksquare\)

  1. Draw a suitable normal quantile plot. Based on what you see, what would you recommend as a suitable test to compare the egg lengths in the nests of the different host species? Explain briefly.

Solution

Do the normal quantile plot facetted by groups (bird_species), since we are interested in normality within each group, not of all the observations taken together:

ggplot(eggs, aes(sample = egg_length)) + stat_qq() +
  stat_qq_line() + facet_wrap(~ bird_species)

We are looking for all the distributions to be roughly normal. Commentary (which you don’t need to write but you certainly need to think):

  • the Hedge Sparrow distribution has two or three mild outliers at the bottom. (This, to me, is better than calling it “long tails” because the high observation is (i) only one and (ii) not really too high.)
  • the Meadow Pipit has long tails (I say this because the pattern seems to be a feature of the whole distribution, rather than of a few observations. Having said that, those four lowest values do seem to be noticeably lower than the rest, so you can sell the idea of those four lowest values being outliers).
  • the Pied Wagtail is if anything short-tailed, so this is not a problem. You could also call this acceptably normal.
  • I cannot see any problems at all with the Robin or the Wren.
  • The Tree Pipit distribution is, as you see it, either acceptably normal, or very slightly skewed to the left (is that a curved shape?)

The second part of the thought process is sample size, because the larger the sample size within a group, the less picky you have to be about normality. You can guess this by looking at how many dots there appear to be on the normal quantile plots, or you can get the exact sample sizes (better):

eggs %>% count(bird_species)

So I would actually say that the Hedge Sparrow is the one that might be a problem. The (much) larger sample size of the Meadow Pipits ought to be large enough to take care of the long tails, but it is not clear whether a sample size of 14 Hedge Sparrows is large enough to take care of the low-end outliers.

So, in terms of what you actually need to write:

  • find one or more distributions whose normality you are concerned about (eg Hedge Sparrow and Meadow Pipit)
  • think about whether the sample sizes for the ones you are concerned about are small enough that the non-normality you found is still a problem. I would say that the large sample of Meadow Pipits (45) is large enough that I don’t need to worry: the long tails there are not extreme. But I am less sure about the Hedge Sparrows: the small sample size (14) might not be enough to accommodate the low-end outliers.
  • if there are any distributions that you are concerned about, express that you don’t think an ANOVA will be appropriate (and that a Mood’s median test will be better).

If you were happy enough with the normality, the best thing is then to think about whether the spreads are equal. You can do this by calculating SDs:

eggs %>% 
  group_by(bird_species) %>% 
  summarize(length_sd = sd(egg_length))

The SDs for the robins and wrens are the smallest of these, but it’s up to you whether you think they are enough smaller than the others to be worth worrying about. I’m inclined to think not, but if you think they are, then you would recommend a Welch ANOVA. A couple of other options for assessing spread are:

  • to draw boxplots, purely for the purpose of assessing spread (because we used the normal quantile plots for the stuff about normality)
  • to use the normal quantile plots to assess spread.

Boxplots look like this:

ggplot(eggs, aes(x = bird_species, y = egg_length)) + geom_boxplot()

The pied wagtail box looks noticeably taller than the others, so the spreads do not all appear to be the same.

To use the normal quantile plot, assess the slope of each line in each plot. Do the slopes look about the same? I think they more or less do, though you could say that the line on the Pied Wagtail plot is steeper than the others.

In terms of a recommendation:

  • if you thought the distributions were not normal enough, recommend Mood’s median test.
  • if you thought that normality was OK, but equal spreads was not, recommend the Welch ANOVA.
  • if you thought that both normality and equal spreads were good enough, recommend a regular ANOVA.

\(\blacksquare\)

  1. Run an (ordinary) analysis of variance, including any follow-up if warranted. What do you conclude, in the context of the data? (Run this analysis even if you don’t think it’s the best thing to do.)

Solution

I wanted to get you some practice at doing this, hence my last sentence:

eggs.1 <- aov(egg_length ~ bird_species, data = eggs)
summary(eggs.1)
              Df Sum Sq Mean Sq F value   Pr(>F)    
bird_species   5  42.94   8.588   10.39 3.15e-08 ***
Residuals    114  94.25   0.827                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This P-value of \(3 \times 10^{-8}\) is much smaller than 0.05, so the mean cuckoo egg lengths are definitely not all the same for each host species (or, if you like, cuckoo egg length depends somehow on host species).

This is enough to say for now. To find out which host species differ from which in terms of mean cuckoo egg length, we need to fire up Tukey:

TukeyHSD(eggs.1)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = egg_length ~ bird_species, data = eggs)

$bird_species
                                diff          lwr         upr     p adj
MeadowPipet-HedgeSparrow -0.82253968 -1.629133605 -0.01594576 0.0428621
PiedWagtail-HedgeSparrow -0.21809524 -1.197559436  0.76136896 0.9872190
Robin-HedgeSparrow       -0.54642857 -1.511003196  0.41814605 0.5726153
TreePipet-HedgeSparrow   -0.03142857 -1.010892769  0.94803563 0.9999990
Wren-HedgeSparrow        -1.99142857 -2.970892769 -1.01196437 0.0000006
PiedWagtail-MeadowPipet   0.60444444 -0.181375330  1.39026422 0.2324603
Robin-MeadowPipet         0.27611111 -0.491069969  1.04329219 0.9021876
TreePipet-MeadowPipet     0.79111111  0.005291337  1.57693089 0.0474619
Wren-MeadowPipet         -1.16888889 -1.954708663 -0.38306911 0.0004861
Robin-PiedWagtail        -0.32833333 -1.275604766  0.61893810 0.9155004
TreePipet-PiedWagtail     0.18666667 -0.775762072  1.14909541 0.9932186
Wren-PiedWagtail         -1.77333333 -2.735762072 -0.81090459 0.0000070
TreePipet-Robin           0.51500000 -0.432271433  1.46227143 0.6159630
Wren-Robin               -1.44500000 -2.392271433 -0.49772857 0.0003183
Wren-TreePipet           -1.96000000 -2.922428738 -0.99757126 0.0000006

This is hard to make sense of because there are 6 groups and 15 comparisons. Seven of the comparisons are significant. The five very small P-values are all Wren and something else; if you look at those carefully, the eggs in the Wren nests are all smaller on average. The other two P-values are only just less than 0.05; in these two cases, the eggs in the Meadow Pipit nests are less long on average than those in the Hedge Sparrow or Tree Pipit nests.

Try to find a way to summarize the seven significant differences in a way that is easy to read and understand (thinking of your reader, again). Simply listing the significant ones doesn’t offer any insight about what they have in common.

\(\blacksquare\)

  1. Run a Mood’s median test, and, if appropriate, follow-up tests. What do you now conclude, in the context of the data?

Solution

Make sure you have this somewhere:

library(smmr)

and then:

median_test(eggs, egg_length, bird_species)
$grand_median
[1] 22.35

$table
              above
group          above below
  HedgeSparrow    11     3
  MeadowPipet     17    28
  PiedWagtail     10     5
  Robin           10     6
  TreePipet       12     3
  Wren             0    15

$test
       what        value
1 statistic 3.032698e+01
2        df 5.000000e+00
3   P-value 1.271619e-05

This is also strongly significant, and indicates that the median cuckoo egg lengths in the nests of the six different species are not all the same. (Or that one or more of the median egg lengths is different, etc etc.) So, to find out which ones are different according to this procedure, fire up pairwise median tests:

pairwise_median_test(eggs, egg_length, bird_species)

There are fifteen of these (page down to see the other five). This actually is a dataframe, so you can sort the (adjusted) P-values into order without too much trouble. Or you can use filter to show only the ones that are less than 0.05. I like the sorting idea better, because then you can see whether there are any others whose P-value is close to 0.05, or, as in this case, confirm that there are not:

pairwise_median_test(eggs, egg_length, bird_species) %>% 
  arrange(adj_p_value)

There are only four significant differences here: Wren with everything except Pied Wagtail. If you look at the table of aboves and belows in the output from Mood’s median test, this is evidently because the Wren eggs are shorter than the others. (Presumably Wren vs. Pied Wagtail is not unbalanced enough to be significant.)

\(\blacksquare\)

  1. Compare all your significant results from the previous two parts. Are the results substantially different? Explain briefly.

Solution

For the ANOVA and the Mood’s median test themselves, both P-values are very small (the one for the ANOVA is smaller), so there is no substantial difference there.

For the followup comparisons, the only ones that differ in significance between the two tests are Wren vs Pied Wagtail, and Meadow Pipit vs Hedge Sparrow and Tree Pipit. These are not significant in the pairwise median tests, but were in Tukey. (In addition, the last two comparisons were only just significant in the Tukey.) Given this, I would say that there is not a substantial difference between the results from the two procedures.

\(\blacksquare\)

2 Tidy homes

Earlier, we dealt with some asking prices of homes that had three or four bedrooms. This time, we will handle the data as they were originally laid out, and make something suitable for the two-group boxplot that we drew before. The original data are in http://ritsokiguess.site/datafiles/homes_wide.csv.

  1. Read in and display the original data. What do you see when you scroll down?

  2. Rearrange your dataframe to be in suitable format for a two-sample \(t\)-test or a boxplot. As a bonus, see if you can figure out how to get rid of the missing values. (There is one more part, for which it will help to save the dataframe that comes out of this part.)

  3. Using your rearranged dataframe, make a boxplot of price according to the number of bedrooms.

Tidy homes - my solutions

Earlier, we dealt with some asking prices of homes that had three or four bedrooms. This time, we will handle the data as they were originally laid out, and make something suitable for the two-group boxplot that we drew before. The original data are in http://ritsokiguess.site/datafiles/homes_wide.csv.

  1. Read in and display the original data. What do you see when you scroll down?

Solution

The usual one-pointer:

my_url <- "http://ritsokiguess.site/datafiles/homes_wide.csv"
asking <- read_csv(my_url)
Rows: 23 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): beds4, beds3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
asking

This looks like two columns of house prices, but if you scroll down, you’ll see the bottom few asking prices for 4-bedroom houses are missing. This is not because those actual asking prices are missing, but because in the original dataset there were only 14 4-bedroom houses. All the columns in a dataframe have to be the same length, so, with things laid out this way, the bottom values in the shorter column have to be in R terms missing.

\(\blacksquare\)

  1. Rearrange your dataframe to be in suitable format for a two-sample \(t\)-test or a boxplot. As a bonus, see if you can figure out how to get rid of the missing values. (There is one more part, for which it will help to save the dataframe that comes out of this part.)

Solution

We want all the asking prices in one column, with a second column labelling which type of house it is an asking price for. This is a standard pivot_longer, like the pig feed data in lecture. everything() is a select-helper meaning “all the columns”, or you can use starts_with(beds), or you can name them (there are only two):

asking %>% 
  pivot_longer(everything(), names_to = "bedrooms", values_to = "price")

You get to choose the names for the new columns; remember to put them in quotes because they don’t exist yet.

The “missing” prices for the 4-bedroom houses (caused by there being fewer of them) are at the bottom. There are two ways to get rid of the missings. The direct way, which I think you have seen before, is drop_na, which says to drop the rows where the variable named is missing:

asking %>% 
  pivot_longer(everything(), names_to = "bedrooms", values_to = "price") %>% 
  drop_na(price)

The missings have gone.

This is a common enough thing to ask for that pivot_longer provides a way to do it in one step. To find out how, in the console ask for the help for pivot_longer by typing ?pivot_longer. It will appear bottom right. Look at the Usage section to see all the things you can ask for, and look down below in Arguments to find out what they do. You might guess that values_drop_na is the thing you want, and reading further down reveals that it is indeed:

asking %>% 
  pivot_longer(everything(), names_to = "bedrooms", values_to = "price",
               values_drop_na = TRUE) -> asking_longer
asking_longer

This I saved for the next part.

In the help for values_drop_na, it says “… and should generally only be used when missing values in data were created by its structure”. What does that mean? Well, according to my comment at the end of (a), the missing values in what I called asking were not actually missing, but were displayed as missing only because a dataframe has to have the same number of rows in all its columns: that is, those missing values were indeed created by the structure. If there was a way of organizing the asking prices so that different parts of it could have different lengths,1 then there would be no missing values.

We’re going to use this in the next worksheet as well, so I’m going to save it for you to use there.

\(\blacksquare\)

  1. Using your rearranged dataframe, make a boxplot of price according to the number of bedrooms.

Solution

This should be easy now:

ggplot(asking_longer, aes(x = bedrooms, y = price)) + geom_boxplot()

which is the boxplot you got on Worksheet 4. The \(x\)-axis is labelled differently, however, for reasons that you will discover later.

I gave you this part partly as a reward for your hard work, and partly as a suggestion that this is how real-life analyses work. Much of your time in data analysis is spent getting the data into the right format, and the analysis, when it comes, often seems very easy in comparison.

\(\blacksquare\)

3 Cuckoo eggs: extras

(a):

Extra 1: note that you can measure the length of an egg without disturbing anything, using something like Vernier calipers, so no birds or yet-to-be-born birds were harmed in this study. Also note that this is an observational study; an experiment would do something like randomly assigning cuckoo eggs to nests of other birds, which would defeat the purpose of this study. (This means that some of the cuckoos may have laid eggs in the nest of the “wrong” species, that is to say, not the species that they customarily lay eggs in, and we have no way of knowing. We need to assume that most of the cuckoos laid their eggs in nests of the “correct” host bird’s nest, which seems to be a reasonable assumption.)

The data, and the story about it, came from here.

Extra 2: if you happen to be interested in how to fix the misspellings: there is a package stringr (part of the tidyverse) whose purpose is to deal with text: searching, replacing and so on. I use this package infrequently enough that I have to look up most of what I do with it, but the key thing is to remember that stringr exists so that you know where to look things up.2 The help page you want is this one. There is a lot of talk in stringr about “regular expressions”, which are a fancy way of looking for text, but we don’t need to worry about that here: our job is to replace Pipet by Pipit wherever it appears in bird_species. We are only making one replacement on each row, so str_replace is fine:3

eggs %>% 
  mutate(correct_species = str_replace(bird_species, "Pipet", "Pipit"))

Having sorted that out, you would probably go back and do it again, overwriting bird_species with the correct spellings, since you don’t need the misspelled names for anything.

The three inputs to str_replace are: the column to look in, the text to find, and the text to replace it by. This replaces the first instance on each row, but we know that there is only one.4

Extra 3: there were originally some extra blank spaces on the end of some of the lines of data in the file. This caused some “parsing errors” in read_delim (it thought there were three columns of data in some rows and not two). I didn’t want to confuse you, so I tidied up the data before saving it for you. (I’m not sure whether the extra spaces were in the original data file or were caused by my copying it.)

(c):

Extra: one way of assessing this further is to look at bootstrap sampling distributions of the sample means for any problematic groups. This is done by filtering the observations you want and then proceeding as if you have one sample. I’m setting the random number seed for reproducibility (if I edit this document, I don’t want my simulations to change):

set.seed(457299)

Here are the Meadow Pipits (which you have to remember to misspell):

eggs %>% filter(bird_species == "MeadowPipet") -> meadow_pipits
tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(meadow_pipits$egg_length, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

Absolutely no problems there. The sample size is definitely big enough.

What about the Hedge Sparrow distribution, which was more nearly normal but a smaller sample size? Same idea exactly:

eggs %>% filter(bird_species == "HedgeSparrow") -> hedge_sparrows
tibble(sim = 1:10000) %>% 
  rowwise() %>% 
  mutate(my_sample = list(sample(hedge_sparrows$egg_length, replace = TRUE))) %>% 
  mutate(my_mean = mean(my_sample)) %>% 
  ggplot(aes(sample = my_mean)) + stat_qq() + stat_qq_line()

This is a little bit worse (showing the importance of sample size in your considerations), but there is no evidence of any problems. (I did 10,000 simulations for these two, because with only 1,000 it was not so clear to me what was happening.) If anything, the hedge sparrow distribution has a lower tail that is a tiny bit long (caused by those lower outliers in the data distribution), but nothing at all to worry about.

(d):

Extra: using R to list the small P-values for you might make it easier to see what they look like. The output from TukeyHSD is a list (the clue is the $bird_species at the top of the output), so pull that off first:

TukeyHSD(eggs.1)$bird_species
                                diff          lwr         upr        p adj
MeadowPipet-HedgeSparrow -0.82253968 -1.629133605 -0.01594576 4.286214e-02
PiedWagtail-HedgeSparrow -0.21809524 -1.197559436  0.76136896 9.872190e-01
Robin-HedgeSparrow       -0.54642857 -1.511003196  0.41814605 5.726153e-01
TreePipet-HedgeSparrow   -0.03142857 -1.010892769  0.94803563 9.999990e-01
Wren-HedgeSparrow        -1.99142857 -2.970892769 -1.01196437 5.810093e-07
PiedWagtail-MeadowPipet   0.60444444 -0.181375330  1.39026422 2.324603e-01
Robin-MeadowPipet         0.27611111 -0.491069969  1.04329219 9.021876e-01
TreePipet-MeadowPipet     0.79111111  0.005291337  1.57693089 4.746193e-02
Wren-MeadowPipet         -1.16888889 -1.954708663 -0.38306911 4.861345e-04
Robin-PiedWagtail        -0.32833333 -1.275604766  0.61893810 9.155004e-01
TreePipet-PiedWagtail     0.18666667 -0.775762072  1.14909541 9.932186e-01
Wren-PiedWagtail         -1.77333333 -2.735762072 -0.81090459 7.026044e-06
TreePipet-Robin           0.51500000 -0.432271433  1.46227143 6.159630e-01
Wren-Robin               -1.44500000 -2.392271433 -0.49772857 3.182821e-04
Wren-TreePipet           -1.96000000 -2.922428738 -0.99757126 5.555995e-07

This is (you can tell from the way it displays) a matrix rather than a dataframe. In addition, the two species being compared each time don’t have a column name, so they are what R terms “row names”. When you use something from the tidyverse, row names disappear:

TukeyHSD(eggs.1)$bird_species %>% 
  as_tibble()

so what you have to do is to use something from base R that keeps the row names,5 and then turn them into a column. The last thing is to put the smallest P-values at the top, so you can see them:

TukeyHSD(eggs.1)$bird_species %>% 
  as.data.frame() %>% 
  rownames_to_column("comparison") %>% 
  arrange(`p adj`)

The p adj with a space is not a legal column name, so put backticks around it when you refer to it. The five smallest P-values are all Wren with something else, and the next two, that are only just significant (P-values 0.04-something), are Meadow Pipit and something else, with the eggs in the meadow pipit nest being shorter.

I didn’t ask you to make a boxplot, but I made one for myself:

ggplot(eggs, aes(x = bird_species, y = egg_length)) + geom_boxplot()

It’s not at all surprising that the eggs in Wren nests are significantly shorter than the others, and if anything else is going to be significant, it looks likely to involve the Meadow Pipit nests compared with the top three. (Meadow Pipit and Pied Wagtail is the next smallest P-value in the Tukey, but at 0.23 it is nowhere near significant.)

(f):

Extra 1: the lack of substantial difference between the results of the two procedures suggests that you could reasonably run either one. In situations like that, it is better to run the ANOVA and Tukey, because these make better use of the data (that is, they are based on the actual data values rather than counts above or below something). Therefore I think it is fair to conclude that the cuckoo eggs in Wren nests are smaller than those in other nests, and there are not any other strongly significant differences. (This would be consistent with wren eggs being smaller and the cuckoo eggs laid in wren nests being correspondingly smaller as well.) Whether there are differences in size between eggs of the other birds is another matter; if there are, they do not seem to be matched by differing sizes of eggs the cuckoos lay in those nests.

Extra 2: I know I’ve been overdoing it with Extras in this question, but I mentioned before that I was wondering whether host species with smaller eggs go with cuckoo eggs that are also smaller. To assess that, we need some data about the typical sizes of eggs of the other species; the lengths are what concern us. I found some information here (this is the page for the Tree Pipit). These are European birds, so the British Trust for Ornithology seems like a good source. I want to make a little dataframe with the average egg length for each species, so tribble seems to be the way to go. I’m using the same misspellings as in our dataset:

egg_size <- tribble(
~bird_species, ~average_egg_length,
"TreePipet", 20,
"MeadowPipet", 20,
"HedgeSparrow", 19,
"Robin", 20,
"PiedWagtail", 20,
"Wren", 16
)
egg_size

Well, there I think you have your answer. Wrens have smaller eggs than the others, which are about the same size, and the cuckoo eggs found in the nests have about the same relationship. A graph:

eggs %>% 
  left_join(egg_size) %>% 
  ggplot(aes(x = average_egg_length, y = egg_length)) +
  geom_jitter()
Joining with `by = join_by(bird_species)`

The cuckoo eggs are generally bigger than those of the host species, but the cuckoo eggs in wren nests (on the left) are generally smaller than the cuckoo eggs found in other nests.

Code notes:

  • left_join looks up the average egg length for each bird_species found in eggs. In my little data frame, I used the same column name bird_species (and the same misspellings!) so that they would be easy to look up.6
  • there were a lot of eggs that would have to plot in the same place on a scatterplot, so I used geom_jitter instead of geom_point. This moves the points around a little bit. The default is to a maximum of “40% of the resolution of the data”, horizontally or vertically. This means that the average egg lengths, which are really all integers, are plotted nearest to the integer they actually are, but spread out so you can see them all.
  • as ever, feel free to run the code one line at a time so that you can see what it does.

Footnotes

  1. An R list would allow this, because a list can contain anything. But that’s beyond our scope now.↩︎

  2. This is more efficient than searching for something like “how to replace text in R”, and hoping to land on a Stack Overflow question where somebody has asked the same thing and gotten an answer using tidyverse ideas. Besides which, it is better to find the help files in the right package rather than relying on someone else’s interpretation of them.↩︎

  3. You would use str_replace_all if you wanted to do something like replacing all the letter e in each bird_species by the word “hello”, and there might be more than one replacement per bird.↩︎

  4. If there might be several replacements to make on the same row, then you would use str_replace_all.↩︎

  5. The as.data.frame does the same thing as as_tibble, except that (i) it keeps the row names as row names, and (ii) it is a bit more forgiving about other things that don’t concern us here.↩︎

  6. I put a space in PiedWagtail the first time, and wondered why it was not finding it!↩︎