STAC33 Assignment 3

You are expected to complete this assignment on your own: that is, you may discuss general ideas with others, but the writeup of the work must be entirely your own. If your assignment is unreasonably similar to that of another student, you can expect to be asked to explain yourself.

If you run into problems on this assignment, it is up to you to figure out what to do. The only exception is if it is impossible for you to complete this assignment, for example a data file cannot be read. (There is a difference between you not knowing how to do something, which you have to figure out, and something being impossible, which you are allowed to contact me about.)

You must hand in a rendered document that shows your code, the output that the code produces, and your answers to the questions. This should be a file with .html on the end of its name. There is no credit for handing in your unrendered document (ending in .qmd), because the grader cannot then see whether the code in it runs properly. After you have handed in your file, you should be able to see (in Attempts) what file you handed in, and you should make a habit of checking that you did indeed hand in what you intended to, and that it displays as you expect.

Hint: render your document frequently, and solve any problems as they come up, rather than trying to do so at the end (when you may be close to the due date). If your document will not successfully render, it is because of an error in your code that you will have to find and fix. The error message will tell you where the problem is, but it is up to you to sort out what the problem is.

include questions here, H1 for question titles, H2 for parts

questions on numsum, dplyr

1 Math achievement

Over seven thousand students did a math achievement test. For each student, their score on the test was recorded, along with a number of other variables, as described below:

  • School: a number identifying the student’s school (thus, a label rather than a meaningful number).
  • Minority: Yes if the student is a member of a minority racial group, No otherwise.
  • Sex: whether the student identifies as Male or Female.
  • SES: Socio-economic status (on some scale).
  • MathAch: score on the math achievement test.

The data are in http://ritsokiguess.site/datafiles/math_achieve.csv.

(a) (2 points) Read in and display some of the data.

A gimme two points by now:

my_url <- "http://ritsokiguess.site/datafiles/math_achieve.csv"
math_achieve <- read_csv(my_url)
Rows: 7185 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Minority, Sex
dbl (3): School, SES, MathAch

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
math_achieve

There are 7,185 rows, which is indeed over seven thousand, with the columns as described above. Give your dataframe a descriptive name. Mine is all right because the column of math achievement scores is called MathAch (and my name for the dataframe differs from that).

(b) (3 points) Obtain the median and inter-quartile range of math achievement scores for minority and non-minority students.

A group-by and summarize:

math_achieve %>% 
  group_by(Minority) %>% 
  summarize(med_ach = median(MathAch), iqr_ach = IQR(MathAch))

The minority students actually have a lower median score on math achievement than the non-minority students.

(c) (3 points) The socio-economic status value was obtained for each student by completing a questionnaire about their family. I claim that the values are \(z\)-scores. What summary or summaries could you calculate that would enable you to check my claim? Do you think I was right? Explain (very) briefly.

If the SES values are \(z\)-scores, they should have mean 0 and SD 1, so let’s work out the mean and SD (of all the SES values taken together):

math_achieve %>% 
  summarize(mean_ses = mean(SES), sd_ses = sd(SES))

The mean is (very) close to zero, but the standard deviation is clearly less than 1, so I was wrong about the values being \(z\)-scores.

Extra: a histogram should have a bell-curve shape also. I have lots of data, so I can use lots of bins in my histogram:

ggplot(math_achieve, aes(x = SES)) + geom_histogram(bins = 15)

This looks reasonably bell-shaped until you get to the last bin, which seems to have way too few observations in it. You might say that the reason for the SD being too small is that there are not enough big (very positive) observations. If there were as many observations above about 1.5 as there are below -1.5, you could imagine this bringing the SD up towards 1.

(I don’t actually know whether the SES values are \(z\)-scores, but it seems at least plausible that they would scale the questionnaire scores by something like subtracting a mean and dividing by an SD.)

(d) (3 points) Find the mean SES value for each school. Save the results, in a dataframe called school_mean_ses, and display some of them.

This is also a group-by and summarize, but of a different thing:

math_achieve %>% 
  group_by(School) %>% 
  summarize(mean_ses = mean(SES)) -> school_mean_ses
school_mean_ses

I asked you to save this because we are going to use it again later. I think the cleanest way to save it is to use the right-arrow assignment at the end, but you could also use a regular left-arrow assignment right at the beginning, like this:

d <- math_achieve %>% 
  group_by(School) %>% 
  summarize(mean_ses = mean(SES)) 
d

I personally don’t like this as much because you have to read the pipes down to the end and then jump back to the beginning to see what happens to the result (at first glance, it looks as if all you are doing is displaying the result, but the first line reveals that you are saving something in d, and you have to read down to find out what the something is). Having said this, if you like it better, you can certainly do it this way; you may find that the benefit of having a familiar way of saving the result in d outweighs the jumping around you have to do to figure out what is being saved in d. I have no objection to you doing it this way.

Giving you the name to save it in is my attempt to make the grader’s task easier, for when we use this dataframe again.

(e) (3 points) Display the columns that are text, without naming any columns. Hint: is.character is TRUE for columns that are text, and FALSE otherwise.

This is select with a where and something that is TRUE for the columns you want:

math_achieve %>% select(where(is.character))

In R, “!” means “not”, so you could also (with some care) do this:

math_achieve %>% select(where(\(x) !is.numeric(x)))

since all the columns are either numbers or text. The reason for the care is that the thing inside where has to be a function, either the name of a function (like is.character), or a function you write with a name (see later in the course), or one of our disposable nameless functions. This one reads as “select the columns where it is true that the column is not numeric”, which is a bit wordier than the version above.

(f) (2 points) Display the columns whose names have the letters “S” and “E” (consecutively, not case-sensitive) in them somewhere, without naming any columns.

This is precisely what contains does:

math_achieve %>% select(contains("se"))

This is not (as you see) case-sensitive, so it answers the question.

You might be used to thinking of regular expressions for this kind of task, which would suggest matches:

math_achieve %>% select(matches("se"))

(g) (3 points) For the students with an SES greater than 1 (only), find the number of students, and the mean and standard deviation of math achievement scores.

This requires a filter first to select the students we want, and then a summarize with an n() as well as a mean and SD:

math_achieve %>% 
  filter(SES > 1) %>% 
  summarize(n = n(), mean_ach = mean(MathAch), sd_ach = sd(MathAch))

(h) (2 points) Repeat the previous part, but obtain the summary statistics for males and for females.

This means to insert a group_by(Sex) before the summarize. This can be before or after the filter; either is good. To me, it’s more logical after, since we are used to group_by and summarize going together:

math_achieve %>% 
  filter(SES > 1) %>% 
  group_by(Sex) %>% 
  summarize(n = n(), mean_ach = mean(MathAch), sd_ach = sd(MathAch))

For these high-SES students, the males score a little higher than average than the females.

Only two points, because it’s a small addition to what you did before.

Extra: I was wondering about whether there is a male-female difference over other levels of SES as well. The problem is that SES is a quantitative variable (that our filter above has sort-of categorized), and so the kind of summary that we just did only works if SES is made categorical. Usually, if you have a quantitative variable, you want to keep it quantitative (or else you are throwing away information), but if you really want to cut up a quantitative variable into categories, you can do something like this (as in the question about the hurricanes):

math_achieve %>% mutate(ses_cat = cut(SES, breaks = c(-10, -1, 0, 1, 10))) %>% 
  group_by(ses_cat, Sex) %>% 
  summarize(n = n(), mean_ach = mean(MathAch), sd_ach = sd(MathAch))
`summarise()` has grouped output by 'ses_cat'. You can override using the
`.groups` argument.

The males do in fact outscore the females on average over all my SES categories. Not only that, the mean achievement scores increase with SES: being in a family of high socio-economic status is associated with being better at math, on average. Those differences between males and females are small, but they might be significant (in the sense of statistical significance) because the sample sizes are so large. To think about whether those differences are worth getting excited about (in the sense of practical importance), I thought about going back to the actual data and drawing some graphs. With quantitative MathAchieve and SES and categorical Sex, a starting point is a scatterplot with the sexes coloured:

ggplot(math_achieve, aes(x = SES, y = MathAch, colour = Sex)) + geom_point()

This is a terrible graph, because there is so much data, but the males and the females seem to be very mixed up, and there doesn’t seem to be any indication that males score higher than females.1

With two groups, if you add “a” regression line to the plot, you get one regression line for each group:

ggplot(math_achieve, aes(x = SES, y = MathAch, colour = Sex)) + geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

This is where the differences in our summary came from: if SES is higher, math achievement scores is also predicted to be higher, and for any SES level, males are predicted to score a little higher than females. Because there is so much data, the grey envelopes around the lines are small,2 and hence that difference between males and females is probably (statistically) significant, but look how small it is compared to the data values that are literally all over the place! This male-female difference is, I think, definitely not something to get excited about.

(i) (3 points) Create a dataframe that has, for each student, the mean SES score for their school, together with all the other variables in the dataframe you read in from the file. Display (some of) this dataframe.

Here’s what we read in from the file:

math_achieve

and here’s the dataframe we made and saved earlier (in part (d)):

school_mean_ses

What we want to do is to look up the school’s mean SES (using the second dataframe) in the first dataframe. “Look up” should suggest to you a left-join, like this:3

math_achieve %>% 
  left_join(school_mean_ses)
Joining with `by = join_by(School)`

If you scroll down to a student in the next school (which has ID 57), you’ll see that mean_ses changes, and then stays the same until you get to the next school after that.

This is the easier version of left_join because both dataframes have a column called School and no other columns with the same name, so School is what the join will match by. It doesn’t hurt, though, to specify what you are matching by explicitly, which goes like this:

math_achieve %>% 
  left_join(school_mean_ses, join_by(School))

This would protect you in the case that somebody changed the dataframe you read in from the file, or you changed what you saved in school_mean_ses (in (d)), and there ended up being more than one column with the same name.

Footnotes

  1. The plot also seems to show that there is an upper limit on the math achievement scores, and an upper limit with some exceptions to the SES values. The latter might explain why the SES scores had an SD that was too small for them to be \(z\)-scores; they might be \(z\)-scores but with an upper limit sometimes.↩︎

  2. That is to say, the lines are estimated accurately, in the C67 sense of having intercepts and slopes that have small SDs, or that the confidence intervals for the mean response are short all the way along. The latter is actually what the grey envelopes are.↩︎

  3. Seeing that a left-join will solve this problem is perhaps the hardest thing here.↩︎