Chapter 5 Working with dataframes
5.1 Tidying the Jays data
This question is about the Blue Jays data set (that I used in class).
The Blue Jays baseball data set is at link. Read it into R. Check that you have 25 rows and a bunch of variables.
Pick out only the games that were against the New York Yankees (the variable
opp
is equal toNYY
). Investigate all the columns. What do you notice about these games?From the whole data frame, pick out only the games where the attendance was more than 30,000, showing only the columns
attendance
andDaynight
. How many of them are there (just count them)? How many are day games and how many night games (just count those too)?Display the mean and standard deviation of attendances at all day and night games.
Make normal quantile plots of the day attendances and the night attendances, separately. Do you see any evidence of non-normality? (You would expect to on the night attendances because of the big opening-night value.)
5.2 Cars
My cars data file can be found at
link.
The values in
the data file are separated by commas; the car names are up to 29
characters long. Display your results for each part after (a). In R,
displaying a tibble
normally shows its first ten lines, which
is all you need here; there’s no need to display all the lines.
Read the data into R and list the values.
Display only the car names and the countries they come from.
Display everything except horsepower:
Display only the cars that have 8-cylinder engines (but display all the variables for those cars).
Display the cylinders and horsepower for the cars that have horsepower 70 or less.
Find the mean and SD of gas mileage of the cars with 4 cylinders.
5.3 Dolphins
Dolphins and other large marine mammals are at the top of the marine food chain, and so if there is any heavy metal pollution in the sea, it will find its way into the dolphins. The study we look at is of the concentration of mercury. This is expected to be different in males and females because the mercury in a female is transferred to her offspring during gestation and nursing. In this study, there were 28 males and 17 females of various ages. There are three columns in the data file:
mercury
, the concentration in the liver, in micrograms per gramage
in yearssex
of the dolphin, male or female.
The data are in http://ritsokiguess.site/datafiles/dolphins.csv as a CSV file. This question appears to have a lot of parts, but most of them ought not to take you too long.
Read in and display (some of) the data.
Display only the two columns
mercury
andsex
.Display all the columns whose names have exactly three characters, without naming any columns.
Display only the mercury levels for the females.
What is the mean mercury concentration for all the dolphins whose age is less than 15?
What is the mean mercury concentration for all the dolphins whose age is greater than 25?
Make a suitable graph of these data (all three columns).
Explain briefly how your graph and your calculations of mean mercury concentration are telling a similar story.
My solutions follow:
5.4 Tidying the Jays data
This question is about the Blue Jays data set (that I used in class).
- The Blue Jays baseball data set is at link. Read it into R. Check that you have 25 rows and a bunch of variables.
Solution
Save the URL into a variable and then read from the URL, using
read_csv
because it’s a .csv
file:
##
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## row = col_double(),
## game = col_double(),
## venue = col_logical(),
## runs = col_double(),
## Oppruns = col_double(),
## innings = col_double(),
## position = col_double(),
## `game time` = col_time(format = ""),
## attendance = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## # A tibble: 25 x 21
## row game date box team venue opp result runs Oppruns innings wl position gb winner loser save `game time` Daynight attendance streak
## <dbl> <dbl> <chr> <chr> <chr> <lgl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <time> <chr> <dbl> <chr>
## 1 82 7 Monday, Apr 13 boxscore TOR NA TBR L 1 2 NA 4-3 2 1 Odorizzi Dickey Boxberg… 02:30 N 48414 -
## 2 83 8 Tuesday, Apr 14 boxscore TOR NA TBR L 2 3 NA 4-4 3 2 Geltz Castro Jepsen 03:06 N 17264 --
## 3 84 9 Wednesday, Apr… boxscore TOR NA TBR W 12 7 NA 5-4 2 1 Buehrle Ramir… <NA> 03:02 N 15086 +
## 4 85 10 Thursday, Apr … boxscore TOR NA TBR L 2 4 NA 5-5 4 1.5 Archer Sanch… Boxberg… 03:00 N 14433 -
## 5 86 11 Friday, Apr 17 boxscore TOR NA ATL L 7 8 NA 5-6 4 2.5 Martin Cecil Grilli 03:09 N 21397 --
## 6 87 12 Saturday, Apr … boxscore TOR NA ATL W-wo 6 5 10 6-6 3 1.5 Cecil Marim… <NA> 02:41 D 34743 +
## 7 88 13 Sunday, Apr 19 boxscore TOR NA ATL L 2 5 NA 6-7 4 1.5 Miller Norris Grilli 02:41 D 44794 -
## 8 89 14 Tuesday, Apr 21 boxscore TOR NA BAL W 13 6 NA 7-7 2 2 Buehrle Norris <NA> 02:53 N 14184 +
## 9 90 15 Wednesday, Apr… boxscore TOR NA BAL W 4 2 NA 8-7 2 1 Sanchez Jimen… Castro 02:36 N 15606 ++
## 10 91 16 Thursday, Apr … boxscore TOR NA BAL W 7 6 NA 9-7 1 Tied Hutchison Tillm… Castro 02:36 N 18581 +++
## # … with 15 more rows
If you must, copy and paste the spreadsheet into R Studio, and read it
in with read_delim
(or possibly read_tsv
), but
this runs the risk of being defeated by spreadsheet cells that contain
spaces. I don’t think there are any here, but you might run into a
pitcher whose name has more than one word, like (Andy) Van Hekken, who
is in the Seattle Mariners farm system.1
Anyway, 25 rows and 21 columns. As usual, it’s a tibble, so you see 10
rows and as many columns as will fit. This is often enough to see
whether we have the right thing (as we appear to have, here). You can
run through all the columns and check that they’re the right kind of
thing; most of them are text with a few numbers and one time
,
which is game time
, the length of the game in hours and
minutes, which is turned into an R time
in hours, minutes and
seconds.
With all those columns, read_csv
doesn’t tell you what
column specification it inferred for all of them, but you can type
## cols(
## row = col_double(),
## game = col_double(),
## date = col_character(),
## box = col_character(),
## team = col_character(),
## venue = col_logical(),
## opp = col_character(),
## result = col_character(),
## runs = col_double(),
## Oppruns = col_double(),
## innings = col_double(),
## wl = col_character(),
## position = col_double(),
## gb = col_character(),
## winner = col_character(),
## loser = col_character(),
## save = col_character(),
## `game time` = col_time(format = ""),
## Daynight = col_character(),
## attendance = col_double(),
## streak = col_character()
## )
to find it all out.
\(\blacksquare\)
- Pick out only the games that were against the New York Yankees
(the variable
opp
is equal toNYY
). Investigate all the columns. What do you notice about these games?
Solution
This:
## # A tibble: 3 x 21
## row game date box team venue opp result runs Oppruns innings wl position gb winner loser save `game time` Daynight attendance streak
## <dbl> <dbl> <chr> <chr> <chr> <lgl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <time> <chr> <dbl> <chr>
## 1 92 27 Monday, May 4 boxscore TOR NA NYY W 3 1 NA 13-14 4 3.5 Dickey Martin Cecil 02:18 N 19217 +
## 2 93 28 Tuesday, May 5 boxscore TOR NA NYY L 3 6 NA 13-15 5 4.5 Pineda Estrada Miller 02:54 N 21519 -
## 3 94 29 Wednesday, May 6 boxscore TOR NA NYY W 5 1 NA 14-15 3 3.5 Buehrle Sabathia <NA> 02:30 N 21312 +
but you will probably need to click the little right-arrow at the top to see more columns.
What I notice is that these games are all on consecutive nights (against the same team). This is quite common, and goes back to the far-off days when teams travelled by train: teams play several games on one visit, rather than coming back many times.2 You might have noticed something else; that’s fine for this. For example, “each of the games lasted less than three hours”, or “the attendances were all small” (since we looked at all the attendances in class). I just want you to notice something meaningful that seems to be interesting about these games.
You could also print all the columns in two or more goes, using
select
, for example:
## # A tibble: 3 x 11
## row game date box team venue opp result runs Oppruns innings
## <dbl> <dbl> <chr> <chr> <chr> <lgl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 92 27 Monday, May 4 boxscore TOR NA NYY W 3 1 NA
## 2 93 28 Tuesday, May 5 boxscore TOR NA NYY L 3 6 NA
## 3 94 29 Wednesday, May 6 boxscore TOR NA NYY W 5 1 NA
## # A tibble: 3 x 10
## wl position gb winner loser save `game time` Daynight attendance streak
## <chr> <dbl> <chr> <chr> <chr> <chr> <time> <chr> <dbl> <chr>
## 1 13-14 4 3.5 Dickey Martin Cecil 02:18 N 19217 +
## 2 13-15 5 4.5 Pineda Estrada Miller 02:54 N 21519 -
## 3 14-15 3 3.5 Buehrle Sabathia <NA> 02:30 N 21312 +
\(\blacksquare\)
- From the whole data frame, pick out only the games where the
attendance was more than 30,000, showing only the columns
attendance
andDaynight
. How many of them are there (just count them)? How many are day games and how many night games (just count those too)?
Solution
Two steps, since we selecting rows and columns.
## # A tibble: 8 x 2
## attendance Daynight
## <dbl> <chr>
## 1 48414 N
## 2 34743 D
## 3 44794 D
## 4 30430 N
## 5 42917 D
## 6 42419 D
## 7 33086 D
## 8 37929 D
Or this way, since we are selecting consecutive columns:
## # A tibble: 8 x 2
## Daynight attendance
## <chr> <dbl>
## 1 N 48414
## 2 D 34743
## 3 D 44794
## 4 N 30430
## 5 D 42917
## 6 D 42419
## 7 D 33086
## 8 D 37929
There are eight games selected (see the eight rows in the result). Only two of them are night games, while the other six are day (weekend) games.
If you wanted to, you could automate the counting, like this:
## # A tibble: 2 x 2
## Daynight n
## <chr> <int>
## 1 D 6
## 2 N 2
Six day games and two night games.
\(\blacksquare\)
- Display the mean and standard deviation of attendances at all day and night games.
Solution
Two steps: the grouping according to what I want to group by, then summarizing according to what I want to summarize by. Since I am summarizing, only the summaries find their way into the final data frame, so I don’t need to “select out” the other variables:
## # A tibble: 2 x 3
## Daynight mean.att sd.att
## <chr> <dbl> <dbl>
## 1 D 37885. 5775.
## 2 N 20087. 8084.
The mean attendances are about 38 thousand and about 20 thousand. Note that the night games have much the larger SD, possibly because of the large outlier night attendance (opening night). Which we can also investigate.
jays %>%
group_by(Daynight) %>%
summarize(
median.att = median(attendance),
iqr.att = IQR(attendance)
)
## # A tibble: 2 x 3
## Daynight median.att iqr.att
## <chr> <dbl> <dbl>
## 1 D 37929 8754.
## 2 N 17928. 6005.
This time, the night attendances have a smaller spread and a noticeably smaller median (compared to the mean), so it must have been the outlier that made the difference. There was another high value that R marked as an outlier:
So when you take away those unusual values, the night game attendances are indeed less variable.
The right test, as you might have guessed, for comparing the medians of these non-normal data, is Mood’s median test:
## $table
## above
## group above below
## D 7 0
## N 5 12
##
## $test
## what value
## 1 statistic 9.882352941
## 2 df 1.000000000
## 3 P-value 0.001668714
There was one attendance exactly equal to the overall median (as you
would expect: with an odd number of data values, the median is one of
the data values). smmr
removed it; if you did the test by
hand, what happens to it depends on whether you counted aboves or
belows, and this will have a small effect on the P-value, though not
on the conclusion.
The overall median attendance was 21,000, and none of the day games had attendance less than that. With the small frequencies, the accuracy of the P-value is a bit questionable, but taking it at face value, there is a significant difference between median attendances at day and night games.3
\(\blacksquare\)
- Make normal quantile plots of the day attendances and the night attendances, separately. Do you see any evidence of non-normality? (You would expect to on the night attendances because of the big opening-night value.)
Solution
The best way to do this is facetted normal quantile plots. Remember that the facetting part goes right at the end:
ggplot(jays, aes(sample = attendance)) +
stat_qq() + stat_qq_line() +
facet_wrap(~Daynight, ncol = 1)
The day attendances are pretty normal, though it is hard to be sure with only 7 of them.
The night attendances are not normal. The lone point top right is the outlier. On top of that, the lowest attendances are not quite low enough and the second-highest attendance is a bit too high, so there is a bit of evidence of right-skewness as well as just the one outlier.
If you leave out the ncol=1
, you’ll get the two normal
quantile plots side by side (which means that each one is tall and
skinny, and thus hard to read). The ncol=1
displays all the
facets in one column, and though it would be nice to have the
graphs be about square, landscape mode is easier to read than portrait
mode.
One of the reasons for skewness is often a limit on the values of the variable. The Rogers Centre has a capacity around 55,000. The day game attendances don’t get especially close to that, which suggests that everyone who wants to go to the game can get a ticket. In that sort of situation, you’d expect attendances to vary around a “typical” value, with a random deviation that depends on things like the weather and the opposing team, which is the typical situation in which you get bell-shaped data. (If the Jays often sold out their stadium for day games, you’d see a lot of attendances close to the capacity, with a few lower: ie., a left skew.)
As for the night games, well, there seems to be a minimum attendance that the Blue Jays get, somewhere around 15,000: no matter who they’re playing or what the weather’s like, this many people will show up (season-ticket holders, for example). On special occasions, such as opening night, the attendance will be much bigger, which points to a right skew.
\(\blacksquare\)
5.5 Cars
My cars data file can be found at
link.
The values in
the data file are separated by commas; the car names are up to 29
characters long. Display your results for each part after (a). In R,
displaying a tibble
normally shows its first ten lines, which
is all you need here; there’s no need to display all the lines.
- Read the data into R and list the values.
Solution
read_csv
will do it:
##
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
## car = col_character(),
## MPG = col_double(),
## weight = col_double(),
## cylinders = col_double(),
## hp = col_double(),
## country = col_character()
## )
## # A tibble: 38 x 6
## car MPG weight cylinders hp country
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Buick Skylark 28.4 2.67 4 90 U.S.
## 2 Dodge Omni 30.9 2.23 4 75 U.S.
## 3 Mercury Zephyr 20.8 3.07 6 85 U.S.
## 4 Fiat Strada 37.3 2.13 4 69 Italy
## 5 Peugeot 694 SL 16.2 3.41 6 133 France
## 6 VW Rabbit 31.9 1.92 4 71 Germany
## 7 Plymouth Horizon 34.2 2.2 4 70 U.S.
## 8 Mazda GLC 34.1 1.98 4 65 Japan
## 9 Buick Estate Wagon 16.9 4.36 8 155 U.S.
## 10 Audi 5000 20.3 2.83 5 103 Germany
## # … with 28 more rows
\(\blacksquare\)
- Display only the car names and the countries they come from.
Solution
## # A tibble: 38 x 2
## car country
## <chr> <chr>
## 1 Buick Skylark U.S.
## 2 Dodge Omni U.S.
## 3 Mercury Zephyr U.S.
## 4 Fiat Strada Italy
## 5 Peugeot 694 SL France
## 6 VW Rabbit Germany
## 7 Plymouth Horizon U.S.
## 8 Mazda GLC Japan
## 9 Buick Estate Wagon U.S.
## 10 Audi 5000 Germany
## # … with 28 more rows
This almost works, but not quite:
## # A tibble: 38 x 3
## car cylinders country
## <chr> <dbl> <chr>
## 1 Buick Skylark 4 U.S.
## 2 Dodge Omni 4 U.S.
## 3 Mercury Zephyr 6 U.S.
## 4 Fiat Strada 4 Italy
## 5 Peugeot 694 SL 6 France
## 6 VW Rabbit 4 Germany
## 7 Plymouth Horizon 4 U.S.
## 8 Mazda GLC 4 Japan
## 9 Buick Estate Wagon 8 U.S.
## 10 Audi 5000 5 Germany
## # … with 28 more rows
It gets all the columns that start with c
, which
includes cylinders
as well.
\(\blacksquare\)
- Display everything except horsepower:
Solution
Naming what you don’t want is sometimes easier:
## # A tibble: 38 x 5
## car MPG weight cylinders country
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Buick Skylark 28.4 2.67 4 U.S.
## 2 Dodge Omni 30.9 2.23 4 U.S.
## 3 Mercury Zephyr 20.8 3.07 6 U.S.
## 4 Fiat Strada 37.3 2.13 4 Italy
## 5 Peugeot 694 SL 16.2 3.41 6 France
## 6 VW Rabbit 31.9 1.92 4 Germany
## 7 Plymouth Horizon 34.2 2.2 4 U.S.
## 8 Mazda GLC 34.1 1.98 4 Japan
## 9 Buick Estate Wagon 16.9 4.36 8 U.S.
## 10 Audi 5000 20.3 2.83 5 Germany
## # … with 28 more rows
\(\blacksquare\)
- Display only the cars that have 8-cylinder engines (but display all the variables for those cars).
Solution
This:
## # A tibble: 8 x 6
## car MPG weight cylinders hp country
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Buick Estate Wagon 16.9 4.36 8 155 U.S.
## 2 Chevy Malibu Wagon 19.2 3.60 8 125 U.S.
## 3 Chrysler LeBaron Wagon 18.5 3.94 8 150 U.S.
## 4 Ford LTD 17.6 3.72 8 129 U.S.
## 5 Dodge St Regis 18.2 3.83 8 135 U.S.
## 6 Ford Country Squire Wagon 15.5 4.05 8 142 U.S.
## 7 Mercury Grand Marquis 16.5 3.96 8 138 U.S.
## 8 Chevy Caprice Classic 17 3.84 8 130 U.S.
8 of them, all from the US.
\(\blacksquare\)
- Display the cylinders and horsepower for the cars that have horsepower 70 or less.
Solution
This one is selecting some observations and some variables:
## # A tibble: 6 x 2
## cylinders hp
## <dbl> <dbl>
## 1 4 69
## 2 4 70
## 3 4 65
## 4 4 65
## 5 4 68
## 6 4 68
Cylinders and horsepower are consecutive columns, so we can select
them either using the colon :
or by
c(cylinders,hp)
.
You can also do the filter
and the
select
the other way around.
This one works because the rows you want to
choose are determined by a column you’re going to keep. If you wanted
to display the cylinders and horsepower of the cars with mpg
over 30, you would have to choose the rows first, because after you’ve
chosen the columns, there is no mpg
any more.
\(\blacksquare\)
- Find the mean and SD of gas mileage of the cars with 4 cylinders.
Solution
## # A tibble: 1 x 2
## m s
## <dbl> <dbl>
## 1 30.0 4.18
Or you can get the mean and SD of gas mileage for all numbers of cylinders, and pick out the one you want:
## # A tibble: 4 x 3
## cylinders m s
## <dbl> <dbl> <dbl>
## 1 4 30.0 4.18
## 2 5 20.3 NA
## 3 6 21.1 4.08
## 4 8 17.4 1.19
Top row is the same as before. And since the output is a data frame, you can do any of these things with that, for example:
## # A tibble: 1 x 3
## cylinders m s
## <dbl> <dbl> <dbl>
## 1 4 30.0 4.18
to pick out just the right row. This is a very easy kind of question to set on an exam. Just so you know.
\(\blacksquare\)
5.6 Dolphins
Dolphins and other large marine mammals are at the top of the marine food chain, and so if there is any heavy metal pollution in the sea, it will find its way into the dolphins. The study we look at is of the concentration of mercury. This is expected to be different in males and females because the mercury in a female is transferred to her offspring during gestation and nursing. In this study, there were 28 males and 17 females of various ages. There are three columns in the data file:
mercury
, the concentration in the liver, in micrograms per gramage
in yearssex
of the dolphin, male or female.
The data are in http://ritsokiguess.site/datafiles/dolphins.csv as a CSV file. This question appears to have a lot of parts, but most of them ought not to take you too long.
- Read in and display (some of) the data.
Solution
Exactly the usual:
##
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
## mercury = col_double(),
## age = col_double(),
## sex = col_character()
## )
## # A tibble: 45 x 3
## mercury age sex
## <dbl> <dbl> <chr>
## 1 1.7 0.21 male
## 2 1.72 0.33 male
## 3 8.8 2 male
## 4 5.9 2.2 male
## 5 101 8.5 male
## 6 85.4 11.5 male
## 7 118 11.5 male
## 8 183 13.5 male
## 9 168 16.5 male
## 10 218 16.5 male
## # … with 35 more rows
There are indeed 45 dolphins, with the males listed first and the females listed at the end (as you can check).
\(\blacksquare\)
- Display only the two columns
mercury
andsex
.
Solution
Two ways to do it: either list both columns:
## # A tibble: 45 x 2
## mercury sex
## <dbl> <chr>
## 1 1.7 male
## 2 1.72 male
## 3 8.8 male
## 4 5.9 male
## 5 101 male
## 6 85.4 male
## 7 118 male
## 8 183 male
## 9 168 male
## 10 218 male
## # … with 35 more rows
or say “everything but age”:
## # A tibble: 45 x 2
## mercury sex
## <dbl> <chr>
## 1 1.7 male
## 2 1.72 male
## 3 8.8 male
## 4 5.9 male
## 5 101 male
## 6 85.4 male
## 7 118 male
## 8 183 male
## 9 168 male
## 10 218 male
## # … with 35 more rows
The second one is easier coding, so is better.
Note that mercury:sex
won’t do it, because that will get you age
as well (“mercury
through sex
”).
\(\blacksquare\)
- Display all the columns whose names have exactly three characters, without naming any columns.
Solution
Evidently this is age
and sex
. We are selecting columns on the basis of their names, so we will need a select-helper. The one that works is matches
with a regular expression. This is like the one in the lecture notes where we selected height and weight by selecting columns whose names had two letters and the second one was “t”. In this case, the three letters can be anything, so it’s three dots between the start of text and end of text:
## # A tibble: 45 x 2
## age sex
## <dbl> <chr>
## 1 0.21 male
## 2 0.33 male
## 3 2 male
## 4 2.2 male
## 5 8.5 male
## 6 11.5 male
## 7 11.5 male
## 8 13.5 male
## 9 16.5 male
## 10 16.5 male
## # … with 35 more rows
\(\blacksquare\)
- Display only the mercury levels for the females.
Solution
This is selecting rows and columns, so you’ll have a select
and a filter
. If you do the select
first, you’ll only have the mercury levels left, and you won’t know which dolphins are female. So do the filter
first:
## # A tibble: 17 x 1
## mercury
## <dbl>
## 1 2.5
## 2 9.35
## 3 4.01
## 4 29.8
## 5 45.3
## 6 101
## 7 135
## 8 142
## 9 180
## 10 174
## 11 247
## 12 223
## 13 167
## 14 157
## 15 177
## 16 475
## 17 342
\(\blacksquare\)
- What is the mean mercury concentration for all the dolphins whose age is less than 15?
Solution
Grab only the dolphins with age less than 15 first, and then work out their mean mercury
with summarize
:
## # A tibble: 1 x 1
## m
## <dbl>
## 1 55.5
55.5 (micrograms per gram), as you should say.
Another way is to define a new column that indicates whether the age is less than 15 or not, and do group-by and summarize:
dolphins %>%
mutate(young = ifelse(age<15, "yes", "no")) %>%
group_by(young) %>%
summarize(m = mean(mercury))
## # A tibble: 2 x 2
## young m
## <chr> <dbl>
## 1 no 280.
## 2 yes 55.5
Here, you definitely need to extract the right number out of the output. You can also define groups by logical condition directly, which saves you a step:
## # A tibble: 2 x 2
## `age < 15` m
## <lgl> <dbl>
## 1 FALSE 280.
## 2 TRUE 55.5
Doing it this way, you are probably getting some suspicions about the relationship between age and mercury concentration, but that’s coming up.
\(\blacksquare\)
- What is the mean mercury concentration for all the dolphins whose age is greater than 25?
Solution
The same three choices of method:
## # A tibble: 1 x 1
## m
## <dbl>
## 1 309.
309.2 (micrograms per gram), as you should say.
Or define a new column:
dolphins %>%
mutate(young = ifelse(age>25, "yes", "no")) %>%
group_by(young) %>%
summarize(m = mean(mercury))
## # A tibble: 2 x 2
## young m
## <chr> <dbl>
## 1 no 142.
## 2 yes 309.
Or:
## # A tibble: 2 x 2
## `age > 25` m
## <lgl> <dbl>
## 1 FALSE 142.
## 2 TRUE 309.
In any of these ways, 309.2 micrograms per gram.
\(\blacksquare\)
- Make a suitable graph of these data (all three columns).
Solution
This one ought to be a gimme, but I am using the result in the next part. With two quantitative variables and one categorical, a scatterplot with the sexes indicated by colour:
I think this is clearer without regression lines; the upward trend is clear enough, and it is not clear that the trends are linear (look at the males of the largest ages). Add a smooth trend if you like:
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Mercury concentration increases with age for the females, and also does with age for the males up to about age 25, after which it levels off or decreases.
\(\blacksquare\)
- Explain briefly how your graph and your calculations of mean mercury concentration are telling a similar story.
Solution
I wanted to get to a point where we were saying something interesting about the data, rather than just playing with select
and friends.
The graph is showing an increasing trend of mercury concentration with age (at least, for all but the oldest males). The mean calculations are showing that the younger dolphins have a small mercury concentration on average, and the older dolphins have a much larger mean. These are both saying that mercury concentration is increasing with age, so they are consistent.
Extra:
I tried to make this fairly obvious for you by choice of age groups to compare, and this pattern holds for both males and females, so that the mean calculations didn’t need to depend on sex
. But there is no great difficulty in making it work by sex as well – insert a group_by
before the summarize
:
## # A tibble: 2 x 2
## sex m
## <chr> <dbl>
## 1 female 46.7
## 2 male 63.2
and
## # A tibble: 2 x 2
## sex m
## <chr> <dbl>
## 1 female 288.
## 2 male 316.
The male means are slightly bigger in the two cases, though the difference between the sexes is small compared to the age effect. (I ran a regression predicting mercury concentration from age and sex, and the sex effect wasn’t even significant.)
A second extra: what if you wanted to divide age up into age groups for comparison of mercury concentration? This isn’t always a good idea (for reasons I get into below), but if you want to do it, cut
is your friend:
## # A tibble: 45 x 4
## mercury age sex age_group
## <dbl> <dbl> <chr> <fct>
## 1 1.7 0.21 male (0,15]
## 2 1.72 0.33 male (0,15]
## 3 8.8 2 male (0,15]
## 4 5.9 2.2 male (0,15]
## 5 101 8.5 male (0,15]
## 6 85.4 11.5 male (0,15]
## 7 118 11.5 male (0,15]
## 8 183 13.5 male (0,15]
## 9 168 16.5 male (15,25]
## 10 218 16.5 male (15,25]
## # … with 35 more rows
Note that cut
needs input breaks
whose first value is lower than all the ages and whose last value is higher than all the ages. The age group intervals are what mathematicians would call “half-open”: the upper limit is included in the interval and the lower limit is not. The age of 15 is thus in the 0-15 interval and not in 15-25.
Now we can work out mean mercury concentration for each age group:
## # A tibble: 3 x 2
## age_group m
## <fct> <dbl>
## 1 (0,15] 55.5
## 2 (15,25] 242.
## 3 (25,60] 309.
or even do it by sex as well:
## `summarise()` has grouped output by 'age_group'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups: age_group [3]
## age_group sex m
## <fct> <chr> <dbl>
## 1 (0,15] female 46.7
## 2 (0,15] male 63.2
## 3 (15,25] female 189.
## 4 (15,25] male 288.
## 5 (25,60] female 288.
## 6 (25,60] male 316.
Note that there is still no evidence here of the levelling off of the male mercury levels, which happened after age 25, and there was quite a lot of variability there as well.
My take is that if you have numerical data, that’s what you should use, rather than breaking it up into categories. Those dolphins in the 25-60 age group, we don’t know much about their actual ages, and we’ve seen that age and mercury concentration are closely related, so we are throwing away information by categorizing. Some people break things up into categories because that’s what they know how to analyze, but that is never a good reason for you: it means that you need to go learn how to handle the original quantitative data!
\(\blacksquare\)
I found this by googling, after I had scrolled past all the pages of articles about the baseball pitcher who lives in a van.↩
Hockey is similar: teams go on road trips, playing several different teams before returning home. Hockey teams, though, tend to play each team only once on a road trip: for example, a west coast team like the Canucks might play a game in each of Toronto, Montreal, Boston and New York on a road trip. Well, maybe three games in the New York area: one each against the Rangers, Islanders and Devils.↩
If you do this by hand, you’ll get a warning about the chi-squared approximation being inaccurate. This is because of the small frequencies, and not because of the outliers. Those are not damaging the test at all.↩