Choosing things in dataframes
Doing things with data frames
Let’s go back to our Australian athletes:
Choosing a column
athletes %>% select (Sport)
Choosing several columns
athletes %>% select (Sport, Hg, BMI)
Choosing consecutive columns
athletes %>% select (Sex: WCC, BMI)
Choosing all-but some columns
athletes %>% select (- (RCC: LBM))
athletes %>% select (Sex: Sport, Ht: Wt)
Select-helpers
Other ways to select columns: those whose name:
starts_with
something
ends_with
something
contains
something
matches
a “regular expression”
everything()
select all the columns
Columns whose names begin with S
athletes %>% select (starts_with ("S" ))
Columns whose names end with C
either uppercase or lowercase:
athletes %>% select (ends_with ("c" ))
Case-sensitive
This works with any of the select-helpers:
athletes %>% select (ends_with ("C" , ignore.case= FALSE ))
Column names containing letter R
athletes %>% select (contains ("r" ))
Exactly two characters, ending with T
In regular expression terms, this is ^.t$
:
^
means “start of text”
.
means “exactly one character, but could be anything”
$
means “end of text”.
athletes %>% select (matches ("^.t$" ))
Choosing columns by property
Use where
as with summarizing several columns
eg, to choose text columns:
athletes %>% select (where (is.character))
Choosing rows by number
athletes %>% slice (16 : 25 )
Non-consecutive rows
athletes %>%
slice (10 , 13 , 17 , 42 )
A random sample of rows
athletes %>% slice_sample (n= 8 )
Rows for which something is true
athletes %>% filter (Sport == "Tennis" )
More complicated selections
athletes %>% filter (Sport == "Tennis" , RCC < 5 )
Another way to do “and”
athletes %>% filter (Sport == "Tennis" ) %>%
filter (RCC < 5 )
Either/Or
athletes %>% filter (Sport == "Tennis" | RCC > 5 )
Sorting into order
athletes %>% arrange (RCC)
Breaking ties by another variable
athletes %>% arrange (RCC, BMI)
Descending order
athletes %>% arrange (desc (BMI))
“The top ones”
athletes %>%
arrange (desc (Wt)) %>%
slice (1 : 7 ) %>%
select (Sport, Wt)
Another way
athletes %>%
slice_max (order_by = Wt, n= 7 ) %>%
select (Sport, Wt)
Create new variables from old ones
athletes %>%
mutate (wt_lb = Wt * 2.2 ) %>%
select (Sport, Sex, Wt, wt_lb) %>%
arrange (Wt)
Turning the result into a number
Output is always data frame unless you explicitly turn it into something else, eg. the weight of the heaviest athlete, as a number:
athletes %>% arrange (desc (Wt)) %>%
pluck ("Wt" , 1 ) -> heavy
heavy
Or the 20 heaviest weights in descending order:
athletes %>%
arrange (desc (Wt)) %>%
slice (1 : 20 ) %>%
pluck ("Wt" )
[1] 123.20 113.70 111.30 108.20 102.70 101.00 100.20 98.00 97.90 97.90
[11] 97.00 96.90 96.30 94.80 94.80 94.70 94.70 94.60 94.25 94.20
Another way to do the last one
athletes %>%
arrange (desc (Wt)) %>%
slice (1 : 20 ) %>%
pull ("Wt" )
[1] 123.20 113.70 111.30 108.20 102.70 101.00 100.20 98.00 97.90 97.90
[11] 97.00 96.90 96.30 94.80 94.80 94.70 94.70 94.60 94.25 94.20
pull
grabs the column you name as a vector (of whatever it contains).
To find the mean height of the women athletes
Two ways:
athletes %>% group_by (Sex) %>% summarize (m = mean (Ht))
athletes %>%
filter (Sex == "female" ) %>%
summarize (m = mean (Ht))
Summary of data selection/arrangement “verbs”
select
Choose columns
slice
Choose rows by number
slice_sample
Choose random rows
slice_max
Choose rows with largest values on a variable (also slice_min
)
filter
Choose rows satisfying conditions
arrange
Sort in order by column(s)
mutate
Create new variables
group_by
Create groups to work with
summarize
Calculate summary statistics (by groups if defined)
pluck
Extract items from data frame
pull
Extract a single column from a data frame as a vector
Looking things up in another data frame
Suppose you are working in the nails department of a hardware store and you find that you have sold these items:
my_url <- "http://ritsokiguess.site/datafiles/nail_sales.csv"
sales <- read_csv (my_url)
sales
Product descriptions and prices
but you don’t remember what these product codes are, and you would like to know the total revenue from these sales.
Fortunately you found a list of product descriptions and prices:
my_url <- "http://ritsokiguess.site/datafiles/nail_desc.csv"
desc <- read_csv (my_url)
desc
The lookup
How do you “look up” the product codes to find the product descriptions and prices?
left_join
.
sales %>% left_join (desc)
What we have
this looks up all the rows in the first dataframe that are also in the second .
by default matches all columns with same name in two dataframes (product_code
here)
get all columns in both dataframes. The rows are the ones for that product_code
.
So now can work out how much the total revenue was:
sales %>% left_join (desc) %>%
mutate (product_revenue = sales* price) %>%
summarize (total_revenue = sum (product_revenue))
Matching on only some matching names
Suppose the sales
dataframe also had a column qty
(which was the quantity sold):
sales %>% rename ("qty" = "sales" ) -> sales1
sales1
The qty
in sales1
is the quantity sold, but the qty
in desc
is the number of nails in a package. These should not be matched: they are different things.
Matching only on product code
sales1 %>%
left_join (desc, join_by (product_code))
Get qty.x
(from sales1
) and qty.y
(from desc
).
Matching on different names 1/2
Suppose the product code in sales
was just code
:
sales %>% rename ("code" = "product_code" ) -> sales2
sales2
How to match the two product codes that have different names?
Matching on different names 2/2
Use join_by
, but like this:
sales2 %>%
left_join (desc, join_by (code == product_code))
Other types of join
right_join
: interchanges roles, looking up keys from second dataframe in first.
anti_join
: give me all the rows in the first dataframe that are not in the second. (Use this eg. to see whether the product descriptions are incomplete.)
full_join
: give me all the rows in both dataframes, with missings as needed.
Full join here
sales %>% full_join (desc)
The missing sales
for “masonry nail” says that it was in the lookup table desc
, but we didn’t sell any.
The same thing, but with anti_join
Anything in first df but not in second?
desc %>% anti_join (sales)
# sales %>% anti_join(desc)
Masonry nails are the only thing in our product description file that we did not sell any of.