Choosing things in dataframes

Packages

The usual:

library(tidyverse)

Doing things with data frames

Let’s go back to our Australian athletes:

athletes

Choosing a column

athletes %>% select(Sport)

Choosing several columns

athletes %>% select(Sport, Hg, BMI)

Choosing consecutive columns

athletes %>% select(Sex:WCC)

Choosing all-but some columns

athletes %>% select(-(RCC:LBM))

Select-helpers

Other ways to select columns: those whose name:

  • starts_with something
  • ends_with something
  • contains something
  • matches a “regular expression”
  • everything() select all the columns

Columns whose names begin with S

athletes %>% select(starts_with("S"))

Columns whose names end with C

either uppercase or lowercase:

athletes %>% select(ends_with("c"))

Case-sensitive

This works with any of the select-helpers:

athletes %>% select(ends_with("C", ignore.case=FALSE))

Column names containing letter R

athletes %>% select(contains("r"))

Exactly two characters, ending with T

In regular expression terms, this is ^.t$:

  • ^ means “start of text”
  • . means “exactly one character, but could be anything”
  • $ means “end of text”.
athletes %>% select(matches("^.t$"))

Choosing columns by property

  • Use where as with summarizing several columns
  • eg, to choose text columns:
athletes %>% select(where(is.character))

Choosing rows by number

athletes %>% slice(16:25)

Non-consecutive rows

athletes %>% 
  slice(10, 13, 17, 42)

A random sample of rows

athletes %>% slice_sample(n=8)

Rows for which something is true

athletes %>% filter(Sport == "Tennis")

More complicated selections

athletes %>% filter(Sport == "Tennis", RCC < 5)

Another way to do “and”

athletes %>% filter(Sport == "Tennis") %>% 
  filter(RCC < 5)

Either/Or

athletes %>% filter(Sport == "Tennis" | RCC > 5)

Sorting into order

athletes %>% arrange(RCC)

Breaking ties by another variable

athletes %>% arrange(RCC, BMI)

Descending order

athletes %>% arrange(desc(BMI))

“The top ones”

athletes %>%
  arrange(desc(Wt)) %>%
  slice(1:7) %>%
  select(Sport, Wt)

Another way

athletes %>% 
  slice_max(order_by = Wt, n=7) %>% 
  select(Sport, Wt)

Create new variables from old ones

athletes %>%
  mutate(wt_lb = Wt * 2.2) %>%
  select(Sport, Sex, Wt, wt_lb) %>% 
  arrange(Wt)

Turning the result into a number

Output is always data frame unless you explicitly turn it into something else, eg. the weight of the heaviest athlete, as a number:

athletes %>% arrange(desc(Wt)) %>% pluck("Wt", 1)
[1] 123.2

Or the 20 heaviest weights in descending order:

athletes %>%
  arrange(desc(Wt)) %>%
  slice(1:20) %>%
  pluck("Wt")
 [1] 123.20 113.70 111.30 108.20 102.70 101.00 100.20  98.00  97.90  97.90
[11]  97.00  96.90  96.30  94.80  94.80  94.70  94.70  94.60  94.25  94.20

Another way to do the last one

athletes %>%
  arrange(desc(Wt)) %>%
  slice(1:20) %>%
  pull("Wt")
 [1] 123.20 113.70 111.30 108.20 102.70 101.00 100.20  98.00  97.90  97.90
[11]  97.00  96.90  96.30  94.80  94.80  94.70  94.70  94.60  94.25  94.20

pull grabs the column you name as a vector (of whatever it contains).

To find the mean height of the women athletes

Two ways:

athletes %>% group_by(Sex) %>% summarize(m = mean(Ht))
athletes %>%
  filter(Sex == "female") %>%
  summarize(m = mean(Ht))

Summary of data selection/arrangement “verbs”

Verb Purpose
select Choose columns
slice Choose rows by number
slice_sample Choose random rows
slice_max Choose rows with largest values on a variable (also slice_min)
filter Choose rows satisfying conditions
arrange Sort in order by column(s)
mutate Create new variables
group_by Create groups to work with
summarize Calculate summary statistics (by groups if defined)
pluck Extract items from data frame
pull Extract a single column from a data frame as a vector

Looking things up in another data frame

  • Suppose you are working in the nails department of a hardware store and you find that you have sold these items:
my_url <- "http://ritsokiguess.site/datafiles/nail_sales.csv"
sales <- read_csv(my_url)
sales

Product descriptions and prices

  • but you don’t remember what these product codes are, and you would like to know the total revenue from these sales.

  • Fortunately you found a list of product descriptions and prices:

my_url <- "http://ritsokiguess.site/datafiles/nail_desc.csv"
desc <- read_csv(my_url)
desc

The lookup

  • How do you “look up” the product codes to find the product descriptions and prices?
  • left_join.
sales %>% left_join(desc)

What we have

  • this looks up all the rows in the first dataframe that are also in the second.
  • by default matches all columns with same name in two dataframes (product_code here)
  • get all columns in both dataframes. The rows are the ones for that product_code.

So now can work out how much the total revenue was:

sales %>% left_join(desc) %>% 
  mutate(product_revenue = sales*price) %>% 
  summarize(total_revenue = sum(product_revenue))

More comments

  • if any product codes are not matched, you get NA in the added columns
  • anything in the second dataframe that was not in the first does not appear (here, any products that were not sold)
  • other variations (examples follow):
    • if there are two columns with the same name in the two dataframes, and you only want to match on one, use by with one column name
    • if the columns you want to look up have different names in the two dataframes, use by with a “named list”

Matching on only some matching names

  • Suppose the sales dataframe also had a column qty (which was the quantity sold):
sales %>% rename("qty"="sales") -> sales1
sales1
  • The qty in sales1 is the quantity sold, but the qty in desc is the number of nails in a package. These should not be matched: they are different things.

Matching only on product code

sales1 %>% 
  left_join(desc, join_by(product_code))
  • Get qty.x (from sales1) and qty.y (from desc).

Matching on different names 1/2

  • Suppose the product code in sales was just code:
sales %>% rename("code" = "product_code") -> sales2
sales2
  • How to match the two product codes that have different names?

Matching on different names 2/2

  • Use join_by, but like this:
sales2 %>% 
  left_join(desc, join_by(code == product_code))

Other types of join

  • right_join: interchanges roles, looking up keys from second dataframe in first.
  • anti_join: give me all the rows in the first dataframe that are not in the second. (Use this eg. to see whether the product descriptions are incomplete.)
  • full_join: give me all the rows in both dataframes, with missings as needed.

Full join here

sales %>% full_join(desc)
  • The missing sales for “masonry nail” says that it was in the lookup table desc, but we didn’t sell any.

The same thing, but with anti_join

Anything in first df but not in second?

desc %>% anti_join(sales)
# sales %>% anti_join(desc)

Masonry nails are the only thing in our product description file that we did not sell any of.