STAC32 Assignment 2

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

1 The Cherry Blossom Run

The Cherry Blossom Run is a running race that takes place every year on the streets of Washington, DC. In 2012, over sixteen thousand runners participated. Information about each of the runners is in http://ritsokiguess.site/datafiles/run12.csv. In this question, we are concerned with these variables:

  • gender M or F (for male or female, in terms of eligibility to run in running races for that gender)

  • time time taken to complete the race, in fractional minutes.

(a) (2) Read in and display (a little of) the data.

The file is a .csv, so read_csv will do it:

my_url <- "http://ritsokiguess.site/datafiles/run12.csv"
run12 <- read_csv(my_url)
Rows: 16924 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): gender, location, state
dbl (6): place, time, pace, age, div_place, div_tot

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
run12

Give your dataframe a name that suggests it is information about a running race. On mine, at the top it says there are 16924 rows (“over sixteen thousand”) and nine columns, including time and gender that we will be working with. The display of the data is only 10 rows and something like 8 columns, so we really are seeing only a tiny fraction of it. (These are the 10 fastest runners).

Extra: Quarto’s pager has a maximum of 1000 “pages” of 10 rows each, so the most it can display is 10,000 rows. This data set is so big that the pager cannot even display it all! (If you use View, you will see all of it, but you’ll have to scroll down a ways.)

(b) (3) Draw a histogram of the time taken to run the race, for all the runners together. Explain briefly why the median would be a better measure of centre (location) than the mean.

The coding is straightforward, though you might have to think about a good number of bins:

ggplot(run12, aes(x = time)) + geom_histogram(bins = 20)

You can have more bins than usual because there are so many runners (over 16 thousand). Experiment to find a number of bins that shows the shape. The default is still too many, even with this much data.

The distribution is skewed to the right, so the median would be a better measure of centre than the mean. That is because the mean will be “pulled” into the long tail and here will be larger than the median and, you might say, be larger than it should be.

(c) (3) Draw above-and-below histograms for time for male and female runners on separate histograms within one plot. (Hint: how many variables are there on a single histogram? How many variables do you have?) Do they both have the same shape?

You might not be so familiar with above-and -below histograms, but the idea is that a histogram requires one quantitative variable, but this plot has gender as well, which is “too many categorical variables”, and you therefore need to use facets:

ggplot(run12, aes(x = time)) + geom_histogram(bins = 15) +
  facet_wrap(~ gender, ncol = 1)

Both distributions have the same right-skewed shape as before.

It is better to have the ncol = 1 to put the histograms above and below.1 It is equally good to give each histogram its own scale with scales = "free" since here we only want to compare shape. The advantage to doing it the way I did is that I can compare the medians as well.

You should use a smaller number of bins because each distribution has a smaller number of observations than in the previous part. If you draw your histograms with the same scale as I did, the number of bins is shared between the two histograms. Having a similar number of bins to before is defensible, but you should defend it.

Extra: I can also say, since I used the same scale for both graphs, that the males are on average slightly faster than the females, but there is not much difference, and there are certainly some females as fast as the fastest men:2

run12 %>% arrange(time)

The top 34 finishers were all males, but there were several females ranked in the 40s and 50s. Bear in mind that there were over 16 thousand runners overall, so finishing in the top 50 is a big achievement whoever you are.

(d) (2) Find the median and interquartile range of times for all runners together.

This is a summarize without a group_by (the latter is coming in a moment):

run12 %>% summarize(med_time = median(time), iqr_time = IQR(time))

(e) (3) Find the number of runners, the median, and the interquartile range of times for male and female runners separately. How do the male and female runners compare?

This one has a group_by, and an n() to count the number of runners in each group:

run12 %>% group_by(gender) %>% 
  summarize(n = n(), med_time = median(time), iqr_time = IQR(time))

This confirms what I said above: the males are a bit faster on average, but there is a lot of overlap (or, the males are not much faster relative to how much spread there is).

Extra: the median for all the runners is in between the median for the males and the females (as it should be), but it is closer to the median for the females because, as you see, there were more female runners than male ones.

2 Nepali child heath study

A public health study was carried out on children in Nepal. There are several variables; the ones of interest to us are:

  • id Identifier for the child. The first four digits identify the “ward” (small area) in Nepal where the mother lives, the fifth digit identifies the mother within the ward, and the last digit identifies the child within the family.
  • wt Child’s weight in kilograms
  • ht Child’s height in centimetres
  • age Child’s age in months
  • sex Child’s sex, spelled out
  • mage Mother’s age, in years
  • lit Whether the mother is literate or illiterate.

The data are in http://ritsokiguess.site/datafiles/nepali.csv.

In this question, when I ask you to display something for “all the children” or for “all the data for which something is true”, it is enough to display ten rows or as many of the columns as will fit, along with something that enables the reader to see the rest of the results if they wish.

(a) (2) Read and display some of the data.

A .csv again, so no difficulty here:

my_url <- "http://ritsokiguess.site/datafiles/nepali.csv"
nepali <- read_csv(my_url)
Rows: 1000 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): sex, lit
dbl (7): id, wt, ht, mage, died, alive, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nepali

As ever, give your dataframe a name that says something about what it contains. nepal would also be good, as would children or even children_survey. You might get fed up with typing the last one, but if you type at least three characters of the name and pause, R Studio will offer you some completions and you can pick one.

Extra: you see that the id values are the same for several rows consecutively, which means that there are several (typically five) measurements on each child at different ages, and the two children that make up the first ten rows of the dataframe are siblings (have the same mother).

(b) (2) Display the weights and heights for all the children.

This is select:

nepali %>% 
  select(wt, ht)

This also works, because the child’s weight and height are consecutive columns:

nepali %>% 
  select(wt:ht)

There are, as you see, some missing values. Ending the pipeline without saving the result displays what you have, and does so in a way according to the specifications in the question, so doing this the rest of the way is what I would expect to see. At least, it does as long as you have df-print: paged at the top of your document.

(c) (3) Display the weights and heights, and mother’s literacy for all the children, without naming any of the columns. Describe your thought process briefly.

This means thinking of a property of the columns that will pick out these three and no others. This can be a property of the column names, or of the column contents. In this case, the column contents are mixed, two quantitative and one categorical, so it has to have something to do with the column names.

They all end in t, so ends_with is the way:

nepali %>% select(ends_with("t"))

You need to say that all the required columns have names that end in the letter t.

(d) (2) Display all the data for which the child is female.

Selecting rows, so filter, with two equals signs (logical condition) and the text it has to be equal to in quotes:

nepali %>% 
  filter(sex == "female")

(e) (3) Display the child’s age and weight for the children born of illiterate mothers. Do not display anything for children of literate mothers.

A select and a filter both, but you have to get the data for the illiterate mothers first, because the lit column is not part of the data you are going to keep:

nepali %>% filter(lit == "illiterate") %>% 
  select(age, wt)

There are only 45 measurements from children born to literate mothers (\(1000-955 = 45\)).

(f) (4) Make a graph that allows a comparison of the weights of the children born to illiterate vs. literate mothers, allowing for the fact that the children were measured at different ages. Assume a linear relationship between weight and age whether the mother is literate or not. What do you conclude? Explain briefly.

There are three variables to consider, weight and age (quantitative), and mother’s literacy (categorical). This suggests a scatterplot of weight and age, with the literacy of the mother indicated by colour. We are assessing the difference in weight “allowing for age”, so weight is the response and age is explanatory. If you draw your graph with just the points, it is very hard to interpret; the bit about assuming a linear relationship was meant to encourage you to add regression lines:

ggplot(nepali, aes(x = age, y = wt, colour = lit)) + geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 123 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 123 rows containing missing values (`geom_point()`).

There is a lot of scatter around the regression lines, especially the red points for the illiterate mothers, but the blue line is slightly above the red line all the way up, which says that the children born to literate mothers were slightly heavier on average than those born to illiterate mothers, for any given age. With so much scatter (particularly if you didn’t add the regression lines), you can also make the case that there is no real difference between literate and illiterate mothers (in terms of the age-weight relationship). That is to say, you can go either way with your conclusion, as long as you have a reason that supports the conclusion you drew. Having a conclusion without a reason does not help your reader at all, and therefore you cannot expect any credit for the “what do you conclude” if you don’t provide a reason for your conclusion. (Think of your reader as needing to be persuaded of the correctness of your conclusion.)

The warning message is that there are 123 measurements with missing age or weight or mother’s literacy or more than one of those. These are omitted from the plot.

You might have started reading the question and thought that some kind of boxplot was needed. If you were not considering age at all, that would be reasonable, but as you see from the graph above, older children are generally heavier than younger ones, and there is a danger of biasing the comparison between literate and illiterate mothers if you fail to consider age (if the literate mothers’ children were weighed at older ages, for example).

Extra 1: Having an extra categorical variable might make you think of facets, but the extra variable we have here (age) is quantitative, and so you cannot facet by age. Not directly, anyway.

One not usually recommended approach to this kind of thing is to make the quantitative variable age categorical,3 by chopping it up into intervals: 4

nepali %>% 
  mutate(age_cat = 
         cut(age, breaks = c(0, 20, 40, 60, 80))) -> nepali1
nepali1

Age is broken up into four categories: less than 20 months, 20-40 months, etc. Now we can make a plot of weight vs age category and mother’s literacy. The textbook graph of this, with one quantitative and now two categorical variables, is a grouped boxplot. I usually like having the categorical variable with more categories be x and the other one be fill:

ggplot(nepali1, aes(x = age_cat, y = wt, fill = lit)) +
  geom_boxplot()
Warning: Removed 123 rows containing non-finite values (`stat_boxplot()`).

The story from this graph is actually the same as from the scatterplot with regression lines: no matter which age (group) you look at, children born to literate mothers are slightly heavier on average than those born to literate mothers.

As I mentioned before, another way to do the same thing is with facets. For reasons I explain below, I am making the categorized age be the facets here:

ggplot(nepali1, aes(x = lit, y = wt)) + geom_boxplot() +
  facet_wrap(~ age_cat)
Warning: Removed 123 rows containing non-finite values (`stat_boxplot()`).

The story is the same, though perhaps not visually as compelling. What do you think?

I made the age group be the facets here because the comparison I wanted to make was between children born of literate and illiterate mothers, so I wanted the “literate” and “illiterate” boxes to be side by side. (The grouped boxplot came out the same way, but that was partly by chance. If I had used age group as fill, I don’t think I would have liked it nearly as much.)

Extra 2: spaghetti plot

Something we have ignored so far is that these are not 1000 different children, but a smaller number of children measured at several different ages:

nepali %>% 
  count(id)

There are only 200 actual different children, measured (by the look of it) five times each.5

This is what is known as “repeated measures”. One way of plotting these (which will probably come out very messy here) is known, aptly, as a “spaghetti plot” where the measurements for each child (identified by id) are joined by lines:

ggplot(nepali, aes(x = age, y = wt, colour = lit, group = id)) +
  geom_point() + geom_line()
Warning: Removed 123 rows containing missing values (`geom_point()`).
Warning: Removed 93 rows containing missing values (`geom_line()`).

This plot is really too messy to see many of the individual children’s growth profiles, but you see (i) some points joined by lines, (ii) an uphill and apparently linear trend of weight by age, (iii) just how many illiterate mothers there were.

Extra 3: normalization of tables

Something else that comes out of the repeated measures shows up if we look at the original data with what we just learned:

nepali

The first five rows are of the same child, so the information about that child’s mother (in the four columns mage, lit, died, and alive) is bound to be five copies of the same thing. We can therefore be rather more concise by making two dataframes, one of the children’s information:

nepali %>% select(id:ht, age) -> nepali_children
nepali_children

and one of the mothers’ information:

nepali %>% group_by(id) %>% 
  summarize(mage = first(mage), died = first(died),
            alive = first(alive)) -> nepali_mothers
nepali_mothers

In fact, we could even go further than that: if you look in the data description for how the id is made, the last digit is the number of the child within the family, so that (eg.) children 120011 and 120012 have the same mother, and the mothers could be uniquely identified by the first five digits of the id. (You would have to check the nepali_mothers table to make sure that when the same mother has several children, the values of the other variables are all the same6).

So we can try that:

nepali %>% 
  separate_wider_position(id, c("mother_id" = 5, "cwf" = 1)) -> nepali2
nepali2

This cuts off the first five characters of id to make mother_id, and the last one to make cwf, which is short for “child within family”. I didn’t want to use id in that name, because id implies that you can identify the child uniquely with it, and with cwf, you cannot.

Hence:

nepali2 %>% 
  select(mother_id:ht, age) -> nepali_children
nepali_children

and

nepali2 %>% 
  group_by(mother_id) %>% 
  summarize(mage = first(mage), lit = first(lit),
            died = first(died), alive = first(alive)) -> nepali_mothers
nepali_mothers

and now you see that even though we had 1000 observations, we only had 113 unique mothers.

These last two tables contain all the information from our original dataframe, but they do it by being smaller and containing no redundant information (like the repeated rows of mothers’ information). Note that they both contain a column called mother_id, so that if you wanted to look up the mother’s information for the child that weighed 12.8 kg at 41 months, you could do it with a left_join.

Database people say that when your data contains no redundant information, usually by making it into a larger number of smaller tables, it is in “normal form”. An advantage to normal form is if, say, information about one of the mothers changes (maybe they’re a year older than you thought they were). To correct that, you make one change in the dataframe nepali_mothers, and you’re done. To do the same in the original dataframe nepali, you have to find all the places that that mother appears, and make sure you correct them all, because if you miss one, you have now messed up your data. This is related to what we learn later on about “tidy data”; specifically, when it says “each type of observation unit is one table”, this translates here to not mixing up observations on children and mothers (and thus, to have a separate dataframe for each).

Marks: q1: \(2+3+3+2+3 = 13\); q2: \(2+2+3+2+3+4 = 16\); total \(13+16 = 29\).

Footnotes

  1. If you miss that, you get histograms side by side, which doesn’t answer the question as written.↩︎

  2. You may not have seen arrangeyet, but it sorts the rows of a dataframe into (ascending) order by the variable you give it, which is exactly what we want if we’re looking for the fastest runners.↩︎

  3. It is not usually recommended because there is more information in an actual numerical value than the interval it falls in, particularly if you use a small number of intervals.↩︎

  4. When you do a survey, your age is usually categorized off the top, because they ask you which age group you fall into. But because of the close relationship between age and weight here, we do better to measure the age accurately.↩︎

  5. It looks as if the plan was to measure them five times each, but that didn’t always happen, hence the missing values we saw before.↩︎

  6. Now that I think about it, the mother’s age will increase as she has more children. I don’t know how mage was measured in these data. When I make the mothers’ table, I am taking the values from the first appearance of that mother, hence the first appearing in the summaries for them. Hence the mage in my mother’s table is her age when her first child was born.↩︎