Choosing things in dataframes

Packages

The usual:

library(tidyverse)

Doing things with data frames

Let’s go back to our Australian athletes:

athletes

Choosing a column

athletes %>% select(Sport)

Choosing several columns

athletes %>% select(Sport, Hg, BMI)

Choosing consecutive columns

athletes %>% select(Sex:WCC, BMI)

Choosing all-but some columns

athletes %>% select(-(RCC:LBM))

Select-helpers

Other ways to select columns: those whose name:

starts_with something
ends_with something
contains something
matches a “regular expression”
everything() select all the columns

Columns whose names begin with S

athletes %>% select(starts_with("S"))

Columns whose names end with C

either uppercase or lowercase:

athletes %>% select(ends_with("c"))

Case-sensitive

This works with any of the select-helpers:

athletes %>% select(ends_with("C", ignore.case=FALSE))

Column names containing letter R

athletes %>% select(contains("r"))

Exactly two characters, ending with T

In regular expression terms, this is ^.t$:

^ means “start of text”
. means “exactly one character, but could be anything”
t means a literal letter t (uppercase or lowercase)
$ means “end of text”.

Matching a regular expression

athletes %>% select(matches("^.t$"))

Choosing columns by property

Use where as with summarizing several columns
eg, to choose text columns:

athletes %>% select(where(is.character))

Choosing rows by number

athletes %>% slice(16:25)

Non-consecutive rows

athletes %>% 
  slice(10, 13, 17, 42)

A random sample of rows

athletes %>% slice_sample(n=8)

Rows for which something is true

athletes %>% filter(Sport == "Tennis")

More complicated selections

athletes %>% filter(Sport == "Tennis", RCC < 5)

Another way to do “and”

athletes %>% filter(Sport == "Tennis") %>% 
  filter(RCC < 5)

Either/Or

athletes %>% filter(Sport == "Tennis" | RCC > 5)

Sorting into order

athletes %>% arrange(RCC)

Breaking ties by another variable

athletes %>% arrange(RCC, BMI)

Descending order

athletes %>% arrange(desc(BMI))

“The top ones”

athletes %>%
  arrange(desc(Wt)) %>%
  slice(1:7) %>%
  select(Sport, Wt)

Another way

athletes %>% 
  slice_max(order_by = Wt, n=7) %>% 
  select(Sport, Wt)

Create new variables from old ones

athletes %>%
  mutate(wt_lb = Wt * 2.2) %>%
  select(Sport, Sex, Wt, wt_lb) %>% 
  arrange(Wt)

Turning the result into a number

Output is always data frame unless you explicitly turn it into something else, eg. the weight of the heaviest athlete, as a number:

athletes %>% arrange(desc(Wt)) %>% 
  pluck("Wt", 1) -> heavy
heavy

[1] 123.2

Or the 20 heaviest weights in descending order:

athletes %>%
  arrange(desc(Wt)) %>%
  slice(1:20) %>%
  pluck("Wt")

 [1] 123.20 113.70 111.30 108.20 102.70 101.00 100.20  98.00  97.90  97.90
[11]  97.00  96.90  96.30  94.80  94.80  94.70  94.70  94.60  94.25  94.20

Another way to do the last one

athletes %>%
  arrange(desc(Wt)) %>%
  slice(1:20) %>%
  pull("Wt")

 [1] 123.20 113.70 111.30 108.20 102.70 101.00 100.20  98.00  97.90  97.90
[11]  97.00  96.90  96.30  94.80  94.80  94.70  94.70  94.60  94.25  94.20

pull grabs the column you name as a vector (of whatever it contains).

To find the mean height of the women athletes

Two ways:

athletes %>% group_by(Sex) %>% summarize(m = mean(Ht))

athletes %>%
  filter(Sex == "female") %>%
  summarize(m = mean(Ht))

Summary of data selection/arrangement “verbs”

Verb	Purpose
`select`	Choose columns
`slice`	Choose rows by number
`slice_sample`	Choose random rows
`slice_max`	Choose rows with largest values on a variable (also `slice_min`)
`filter`	Choose rows satisfying conditions
`arrange`	Sort in order by column(s)
`mutate`	Create new variables
`group_by`	Create groups to work with
`summarize`	Calculate summary statistics (by groups if defined)
`pluck`	Extract items from data frame
`pull`	Extract a single column from a data frame as a vector

Looking things up in another data frame

Suppose you are working in the nails department of a hardware store and you find that you have sold these items:

my_url <- "http://ritsokiguess.site/datafiles/nail_sales.csv"
sales <- read_csv(my_url)
sales

Product descriptions and prices

but you don’t remember what these product codes are, and you would like to know the total revenue from these sales.
Fortunately you found a list of product descriptions and prices:

my_url <- "http://ritsokiguess.site/datafiles/nail_desc.csv"
desc <- read_csv(my_url)
desc

The lookup

How do you “look up” the product codes to find the product descriptions and prices?
left_join.

sales %>% left_join(desc)

What we have

this looks up all the rows in the first dataframe that are also in the second.
by default matches all columns with same name in two dataframes (product_code here)
get all columns in both dataframes. The rows are the ones for that product_code.

So now can work out how much the total revenue was:

sales %>% left_join(desc) %>% 
  mutate(product_revenue = sales*price) %>% 
  summarize(total_revenue = sum(product_revenue))

More comments

if any product codes are not matched, you get NA in the added columns
anything in the second dataframe that was not in the first does not appear (here, any products that were not sold)
other variations (examples follow):
- if there are two columns with the same name in the two dataframes, and you only want to match on one, use by with one column name
- if the columns you want to look up have different names in the two dataframes, use by with a “named list”

Matching on only some matching names

Suppose the sales dataframe also had a column qty (which was the quantity sold):

sales %>% rename("qty"="sales") -> sales1
sales1

The qty in sales1 is the quantity sold, but the qty in desc is the number of nails in a package. These should not be matched: they are different things.

Matching only on product code

sales1 %>% 
  left_join(desc, join_by(product_code))

Get qty.x (from sales1) and qty.y (from desc).

Matching on different names 1/2

Suppose the product code in sales was just code:

sales %>% rename("code" = "product_code") -> sales2
sales2

How to match the two product codes that have different names?

Matching on different names 2/2

Use join_by, but like this:

sales2 %>% 
  left_join(desc, join_by(code == product_code))

Other types of join

right_join: interchanges roles, looking up keys from second dataframe in first.
anti_join: give me all the rows in the first dataframe that are not in the second. (Use this eg. to see whether the product descriptions are incomplete.)
full_join: give me all the rows in both dataframes, with missings as needed.

Full join here

sales %>% full_join(desc)

The missing sales for “masonry nail” says that it was in the lookup table desc, but we didn’t sell any.

The same thing, but with `anti_join`

Anything in first df but not in second?

desc %>% anti_join(sales)

Masonry nails are the only thing in our product description file that we did not sell any of.

The other way around

sales %>% anti_join(desc)

There was nothing we sold that was not in the description file.