Ken's Blog: Correcting a dataframe

Ken Butler

Packages

library(tidyverse)
library(tmaptools)
library(leaflet)

Introduction

So I had a dataframe today, in which I wanted to make some small corrections. Specifically, I had this one:

my_url <- "http://ritsokiguess.site/datafiles/wisconsin.txt"
wisc <- read_table(my_url)
wisc %>% select(location)

# A tibble: 12 × 1
   location     
   <chr>        
 1 Appleton     
 2 Beloit       
 3 Fort.Atkinson
 4 Madison      
 5 Marshfield   
 6 Milwaukee    
 7 Monroe       
 8 Superior     
 9 Wausau       
10 Dubuque      
11 St.Paul      
12 Chicago

These are mostly, but not all, cities in Wisconsin, and I want to draw them on a map. To do that, though, I need to affix their states to them, and I thought a good starting point was to start by pretending that they were all in Wisconsin, and then correct the ones that aren’t:

wisc %>% select(location) %>% 
  mutate(state = "WI") -> wisc
wisc

# A tibble: 12 × 2
   location      state
   <chr>         <chr>
 1 Appleton      WI   
 2 Beloit        WI   
 3 Fort.Atkinson WI   
 4 Madison       WI   
 5 Marshfield    WI   
 6 Milwaukee     WI   
 7 Monroe        WI   
 8 Superior      WI   
 9 Wausau        WI   
10 Dubuque       WI   
11 St.Paul       WI   
12 Chicago       WI

The last three cities are in the wrong state: Dubuque is in Iowa (IA), St. Paul in Minnesota (MN), and Chicago is in Illinois (IL). I know how to fix this in base R: I write something like

wisc$state[12] <- "IL"

but how do you do this the Tidyverse way?

A better way

The first step is to make a small dataframe with the cities that need to be corrected, and the states they are actually in:

corrections <- tribble(
  ~location, ~state,
  "Dubuque", "IA",
  "St.Paul", "MN",
  "Chicago", "IL"
)
corrections

# A tibble: 3 × 2
  location state
  <chr>    <chr>
1 Dubuque  IA   
2 St.Paul  MN   
3 Chicago  IL

Note that the columns of this dataframe have the same names as the ones in the original dataframe wisc.

So, I was thinking, this is a lookup table (of a sort), and so joining this to wisc might yield something helpful. We want to look up locations and not match states, since we want to have these three cities have their correct state as a possibility. So what does this do?

wisc %>% 
  left_join(corrections, by = "location")

# A tibble: 12 × 3
   location      state.x state.y
   <chr>         <chr>   <chr>  
 1 Appleton      WI      <NA>   
 2 Beloit        WI      <NA>   
 3 Fort.Atkinson WI      <NA>   
 4 Madison       WI      <NA>   
 5 Marshfield    WI      <NA>   
 6 Milwaukee     WI      <NA>   
 7 Monroe        WI      <NA>   
 8 Superior      WI      <NA>   
 9 Wausau        WI      <NA>   
10 Dubuque       WI      IA     
11 St.Paul       WI      MN     
12 Chicago       WI      IL

Now, we have two states for each city. The first one is always Wisconsin, and the second one is usually missing, but where the state in state.y has a value, that is the true state of the city. So, the thought process is that the actual state should be:

if state.y is not missing, use that
else, use the value in state.x.

I had an idea that there was a function that would do exactly this, only I couldn’t remember its name, so I couldn’t really search for it. My first thought was na_if. What this does is every time it sees a certain value, it replaces it with NA. This, though, is the opposite way from what I wanted. So I looked at the See Also, and saw replace_na. This replaces NAs with a given value. Not quite right, but closer.

In the See Also for replace_na, I saw one more thing: coalesce, “replace NAs with values from other vectors”. Was that what I was thinking of? It was. The way it works is that you feed it several vectors, and the first one that is not missing gives its value to the result. Hence, what I needed was this:

wisc %>% 
  left_join(corrections, by = "location") %>% 
  mutate(state=coalesce(state.y, state.x))

# A tibble: 12 × 4
   location      state.x state.y state
   <chr>         <chr>   <chr>   <chr>
 1 Appleton      WI      <NA>    WI   
 2 Beloit        WI      <NA>    WI   
 3 Fort.Atkinson WI      <NA>    WI   
 4 Madison       WI      <NA>    WI   
 5 Marshfield    WI      <NA>    WI   
 6 Milwaukee     WI      <NA>    WI   
 7 Monroe        WI      <NA>    WI   
 8 Superior      WI      <NA>    WI   
 9 Wausau        WI      <NA>    WI   
10 Dubuque       WI      IA      IA   
11 St.Paul       WI      MN      MN   
12 Chicago       WI      IL      IL

Where state.y has a value, it is used; if it’s missing, the value in state.x is used instead.

The best way

I was quite pleased with myself for coming up with this, but I had missed the actual best way of doing this. In SQL, there is UPDATE, and what that does is to take a table of keys to look up and some new values for other columns to replace the ones in the original table. Because dplyr has a lot of things in common with SQL, it is perhaps no surprise that there is a rows_update, and for this job it is as simple as this:

wisc %>% 
  rows_update(corrections) -> wisc
wisc

# A tibble: 12 × 2
   location      state
   <chr>         <chr>
 1 Appleton      WI   
 2 Beloit        WI   
 3 Fort.Atkinson WI   
 4 Madison       WI   
 5 Marshfield    WI   
 6 Milwaukee     WI   
 7 Monroe        WI   
 8 Superior      WI   
 9 Wausau        WI   
10 Dubuque       IA   
11 St.Paul       MN   
12 Chicago       IL

The values to look up (the “keys”) are by default in the first column, which is where they are in corrections. If they had not been, I would have used a by in the same way as with a join.

Mind. Blown. (Well, my mind was, anyway.)

Geocoding

I said I wanted to draw a map with these cities on it. For that, I need to look up the longitude and latitude of these places, and for that, I need to glue the state onto the name of each city, to make sure I don’t look up the wrong one. It is perhaps easy to forget that unite is the cleanest way of doing this, particularly if you don’t want the individual columns any more:

wisc %>% unite(where, c(location, state), sep = " ") -> wisc
wisc

# A tibble: 12 × 1
   where           
   <chr>           
 1 Appleton WI     
 2 Beloit WI       
 3 Fort.Atkinson WI
 4 Madison WI      
 5 Marshfield WI   
 6 Milwaukee WI    
 7 Monroe WI       
 8 Superior WI     
 9 Wausau WI       
10 Dubuque IA      
11 St.Paul MN      
12 Chicago IL

The function geocode_OSM from tmaptools will find the longitude and latitude of a place. It expects one place as input, not a vector of placenames, so we will work rowwise to geocode one at a time. (Using map from purrr is also an option.) The geocoder returns a list, which contains, buried a little deeply, the longitudes and latitudes:

wisc %>% 
  rowwise() %>% 
  mutate(ll = list(geocode_OSM(where))) -> wisc
wisc

# A tibble: 12 × 2
# Rowwise: 
   where            ll              
   <chr>            <list>          
 1 Appleton WI      <named list [3]>
 2 Beloit WI        <named list [3]>
 3 Fort.Atkinson WI <named list [3]>
 4 Madison WI       <named list [3]>
 5 Marshfield WI    <named list [3]>
 6 Milwaukee WI     <named list [3]>
 7 Monroe WI        <named list [3]>
 8 Superior WI      <named list [3]>
 9 Wausau WI        <named list [3]>
10 Dubuque IA       <named list [3]>
11 St.Paul MN       <named list [3]>
12 Chicago IL       <named list [3]>

The column ll is a list-column, and the usual way to handle these is to unnest, but that isn’t quite right here:

wisc %>% unnest(ll)

# A tibble: 36 × 2
   where            ll          
   <chr>            <named list>
 1 Appleton WI      <chr [1]>   
 2 Appleton WI      <dbl [2]>   
 3 Appleton WI      <bbox [4]>  
 4 Beloit WI        <chr [1]>   
 5 Beloit WI        <dbl [2]>   
 6 Beloit WI        <bbox [4]>  
 7 Fort.Atkinson WI <chr [1]>   
 8 Fort.Atkinson WI <dbl [2]>   
 9 Fort.Atkinson WI <bbox [4]>  
10 Madison WI       <chr [1]>   
# … with 26 more rows

Unnesting a list of three things produces three rows for each city. It would make more sense to have the unnesting go to the right and produce a new column for each thing in the list. The new tidyr has a variant called unnest_wider that does this:

wisc %>% 
  unnest_wider(ll)

# A tibble: 12 × 4
   where            query            coords    bbox      
   <chr>            <chr>            <list>    <list>    
 1 Appleton WI      Appleton WI      <dbl [2]> <bbox [4]>
 2 Beloit WI        Beloit WI        <dbl [2]> <bbox [4]>
 3 Fort.Atkinson WI Fort.Atkinson WI <dbl [2]> <bbox [4]>
 4 Madison WI       Madison WI       <dbl [2]> <bbox [4]>
 5 Marshfield WI    Marshfield WI    <dbl [2]> <bbox [4]>
 6 Milwaukee WI     Milwaukee WI     <dbl [2]> <bbox [4]>
 7 Monroe WI        Monroe WI        <dbl [2]> <bbox [4]>
 8 Superior WI      Superior WI      <dbl [2]> <bbox [4]>
 9 Wausau WI        Wausau WI        <dbl [2]> <bbox [4]>
10 Dubuque IA       Dubuque IA       <dbl [2]> <bbox [4]>
11 St.Paul MN       St.Paul MN       <dbl [2]> <bbox [4]>
12 Chicago IL       Chicago IL       <dbl [2]> <bbox [4]>

The longitudes and latitudes we want are still hidden in a list-column, the one called coords, so with luck, if we unnest that wider as well, we should be in business:

wisc %>% 
  unnest_wider(ll) %>% 
  unnest_wider(coords) -> wisc
wisc

# A tibble: 12 × 5
   where            query                x     y bbox      
   <chr>            <chr>            <dbl> <dbl> <list>    
 1 Appleton WI      Appleton WI      -88.4  44.3 <bbox [4]>
 2 Beloit WI        Beloit WI        -89.0  42.5 <bbox [4]>
 3 Fort.Atkinson WI Fort.Atkinson WI -88.8  42.9 <bbox [4]>
 4 Madison WI       Madison WI       -89.4  43.1 <bbox [4]>
 5 Marshfield WI    Marshfield WI    -90.2  44.7 <bbox [4]>
 6 Milwaukee WI     Milwaukee WI     -87.9  43.0 <bbox [4]>
 7 Monroe WI        Monroe WI        -89.6  42.6 <bbox [4]>
 8 Superior WI      Superior WI      -92.1  46.6 <bbox [4]>
 9 Wausau WI        Wausau WI        -89.6  45.0 <bbox [4]>
10 Dubuque IA       Dubuque IA       -90.7  42.5 <bbox [4]>
11 St.Paul MN       St.Paul MN       -93.1  44.9 <bbox [4]>
12 Chicago IL       Chicago IL       -87.6  41.9 <bbox [4]>

And now we are. x contains the longitudes (negative for degrees west), and y the latitudes (positive for degrees north).

Making a map with these on them

The most enjoyable way to make a map in R is to use the leaflet package. Making a map is a three-step process:

leaflet() with the name of the dataframe
addTiles() to get map tiles to draw the map with
add some kind of markers to show where the points are. I use circle markers here; there are also markers (from addMarkers) that look like Google map pins. Here also you associate the longs and lats with the columns they are in in your dataframe:

leaflet(data = wisc) %>% 
  addTiles() %>% 
  addCircleMarkers(lng = ~x, lat = ~y)

The nice thing about Leaflet maps is that you can zoom, pan and generally move about in them. For example, you can zoom in to find out which city each circle represents.

Correcting a dataframe