The tidyverse way.
So I had a dataframe today, in which I wanted to make some small corrections. Specifically, I had this one:
my_url <- "http://ritsokiguess.site/datafiles/wisconsin.txt"
wisc <- read_table(my_url)
wisc %>% select(location)
# A tibble: 12 × 1
location
<chr>
1 Appleton
2 Beloit
3 Fort.Atkinson
4 Madison
5 Marshfield
6 Milwaukee
7 Monroe
8 Superior
9 Wausau
10 Dubuque
11 St.Paul
12 Chicago
These are mostly, but not all, cities in Wisconsin, and I want to draw them on a map. To do that, though, I need to affix their states to them, and I thought a good starting point was to start by pretending that they were all in Wisconsin, and then correct the ones that aren’t:
wisc %>% select(location) %>%
mutate(state = "WI") -> wisc
wisc
# A tibble: 12 × 2
location state
<chr> <chr>
1 Appleton WI
2 Beloit WI
3 Fort.Atkinson WI
4 Madison WI
5 Marshfield WI
6 Milwaukee WI
7 Monroe WI
8 Superior WI
9 Wausau WI
10 Dubuque WI
11 St.Paul WI
12 Chicago WI
The last three cities are in the wrong state: Dubuque is in Iowa (IA), St. Paul in Minnesota (MN), and Chicago is in Illinois (IL). I know how to fix this in base R: I write something like
wisc$state[12] <- "IL"
but how do you do this the Tidyverse way?
The first step is to make a small dataframe with the cities that need to be corrected, and the states they are actually in:
corrections <- tribble(
~location, ~state,
"Dubuque", "IA",
"St.Paul", "MN",
"Chicago", "IL"
)
corrections
# A tibble: 3 × 2
location state
<chr> <chr>
1 Dubuque IA
2 St.Paul MN
3 Chicago IL
Note that the columns of this dataframe have the same names as the ones in the original dataframe wisc
.
So, I was thinking, this is a lookup table (of a sort), and so joining this to wisc
might yield something helpful. We want to look up locations and not match states, since we want to have these three cities have their correct state as a possibility. So what does this do?
wisc %>%
left_join(corrections, by = "location")
# A tibble: 12 × 3
location state.x state.y
<chr> <chr> <chr>
1 Appleton WI <NA>
2 Beloit WI <NA>
3 Fort.Atkinson WI <NA>
4 Madison WI <NA>
5 Marshfield WI <NA>
6 Milwaukee WI <NA>
7 Monroe WI <NA>
8 Superior WI <NA>
9 Wausau WI <NA>
10 Dubuque WI IA
11 St.Paul WI MN
12 Chicago WI IL
Now, we have two states for each city. The first one is always Wisconsin, and the second one is usually missing, but where the state in state.y
has a value, that is the true state of the city. So, the thought process is that the actual state
should be:
state.y
is not missing, use thatstate.x
.I had an idea that there was a function that would do exactly this, only I couldn’t remember its name, so I couldn’t really search for it. My first thought was na_if
. What this does is every time it sees a certain value, it replaces it with NA. This, though, is the opposite way from what I wanted. So I looked at the See Also, and saw replace_na
. This replaces NAs with a given value. Not quite right, but closer.
In the See Also for replace_na
, I saw one more thing: coalesce
, “replace NAs with values from other vectors”. Was that what I was thinking of? It was. The way it works is that you feed it several vectors, and the first one that is not missing gives its value to the result. Hence, what I needed was this:
wisc %>%
left_join(corrections, by = "location") %>%
mutate(state=coalesce(state.y, state.x))
# A tibble: 12 × 4
location state.x state.y state
<chr> <chr> <chr> <chr>
1 Appleton WI <NA> WI
2 Beloit WI <NA> WI
3 Fort.Atkinson WI <NA> WI
4 Madison WI <NA> WI
5 Marshfield WI <NA> WI
6 Milwaukee WI <NA> WI
7 Monroe WI <NA> WI
8 Superior WI <NA> WI
9 Wausau WI <NA> WI
10 Dubuque WI IA IA
11 St.Paul WI MN MN
12 Chicago WI IL IL
Where state.y
has a value, it is used; if it’s missing, the value in state.x
is used instead.
I was quite pleased with myself for coming up with this, but I had missed the actual best way of doing this. In SQL, there is UPDATE, and what that does is to take a table of keys to look up and some new values for other columns to replace the ones in the original table. Because dplyr
has a lot of things in common with SQL, it is perhaps no surprise that there is a rows_update
, and for this job it is as simple as this:
wisc %>%
rows_update(corrections) -> wisc
wisc
# A tibble: 12 × 2
location state
<chr> <chr>
1 Appleton WI
2 Beloit WI
3 Fort.Atkinson WI
4 Madison WI
5 Marshfield WI
6 Milwaukee WI
7 Monroe WI
8 Superior WI
9 Wausau WI
10 Dubuque IA
11 St.Paul MN
12 Chicago IL
The values to look up (the “keys”) are by default in the first column, which is where they are in corrections
. If they had not been, I would have used a by
in the same way as with a join.
Mind. Blown. (Well, my mind was, anyway.)
I said I wanted to draw a map with these cities on it. For that, I need to look up the longitude and latitude of these places, and for that, I need to glue the state onto the name of each city, to make sure I don’t look up the wrong one. It is perhaps easy to forget that unite
is the cleanest way of doing this, particularly if you don’t want the individual columns any more:
wisc %>% unite(where, c(location, state), sep = " ") -> wisc
wisc
# A tibble: 12 × 1
where
<chr>
1 Appleton WI
2 Beloit WI
3 Fort.Atkinson WI
4 Madison WI
5 Marshfield WI
6 Milwaukee WI
7 Monroe WI
8 Superior WI
9 Wausau WI
10 Dubuque IA
11 St.Paul MN
12 Chicago IL
The function geocode_OSM
from tmaptools
will find the longitude and latitude of a place. It expects one place as input, not a vector of placenames, so we will work rowwise
to geocode one at a time. (Using map
from purrr
is also an option.) The geocoder returns a list, which contains, buried a little deeply, the longitudes and latitudes:
wisc %>%
rowwise() %>%
mutate(ll = list(geocode_OSM(where))) -> wisc
wisc
# A tibble: 12 × 2
# Rowwise:
where ll
<chr> <list>
1 Appleton WI <named list [3]>
2 Beloit WI <named list [3]>
3 Fort.Atkinson WI <named list [3]>
4 Madison WI <named list [3]>
5 Marshfield WI <named list [3]>
6 Milwaukee WI <named list [3]>
7 Monroe WI <named list [3]>
8 Superior WI <named list [3]>
9 Wausau WI <named list [3]>
10 Dubuque IA <named list [3]>
11 St.Paul MN <named list [3]>
12 Chicago IL <named list [3]>
The column ll
is a list-column, and the usual way to handle these is to unnest
, but that isn’t quite right here:
wisc %>% unnest(ll)
# A tibble: 36 × 2
where ll
<chr> <named list>
1 Appleton WI <chr [1]>
2 Appleton WI <dbl [2]>
3 Appleton WI <bbox [4]>
4 Beloit WI <chr [1]>
5 Beloit WI <dbl [2]>
6 Beloit WI <bbox [4]>
7 Fort.Atkinson WI <chr [1]>
8 Fort.Atkinson WI <dbl [2]>
9 Fort.Atkinson WI <bbox [4]>
10 Madison WI <chr [1]>
# … with 26 more rows
Unnesting a list of three things produces three rows for each city. It would make more sense to have the unnesting go to the right and produce a new column for each thing in the list. The new tidyr
has a variant called unnest_wider
that does this:
wisc %>%
unnest_wider(ll)
# A tibble: 12 × 4
where query coords bbox
<chr> <chr> <list> <list>
1 Appleton WI Appleton WI <dbl [2]> <bbox [4]>
2 Beloit WI Beloit WI <dbl [2]> <bbox [4]>
3 Fort.Atkinson WI Fort.Atkinson WI <dbl [2]> <bbox [4]>
4 Madison WI Madison WI <dbl [2]> <bbox [4]>
5 Marshfield WI Marshfield WI <dbl [2]> <bbox [4]>
6 Milwaukee WI Milwaukee WI <dbl [2]> <bbox [4]>
7 Monroe WI Monroe WI <dbl [2]> <bbox [4]>
8 Superior WI Superior WI <dbl [2]> <bbox [4]>
9 Wausau WI Wausau WI <dbl [2]> <bbox [4]>
10 Dubuque IA Dubuque IA <dbl [2]> <bbox [4]>
11 St.Paul MN St.Paul MN <dbl [2]> <bbox [4]>
12 Chicago IL Chicago IL <dbl [2]> <bbox [4]>
The longitudes and latitudes we want are still hidden in a list-column, the one called coords
, so with luck, if we unnest that wider as well, we should be in business:
wisc %>%
unnest_wider(ll) %>%
unnest_wider(coords) -> wisc
wisc
# A tibble: 12 × 5
where query x y bbox
<chr> <chr> <dbl> <dbl> <list>
1 Appleton WI Appleton WI -88.4 44.3 <bbox [4]>
2 Beloit WI Beloit WI -89.0 42.5 <bbox [4]>
3 Fort.Atkinson WI Fort.Atkinson WI -88.8 42.9 <bbox [4]>
4 Madison WI Madison WI -89.4 43.1 <bbox [4]>
5 Marshfield WI Marshfield WI -90.2 44.7 <bbox [4]>
6 Milwaukee WI Milwaukee WI -87.9 43.0 <bbox [4]>
7 Monroe WI Monroe WI -89.6 42.6 <bbox [4]>
8 Superior WI Superior WI -92.1 46.6 <bbox [4]>
9 Wausau WI Wausau WI -89.6 45.0 <bbox [4]>
10 Dubuque IA Dubuque IA -90.7 42.5 <bbox [4]>
11 St.Paul MN St.Paul MN -93.1 44.9 <bbox [4]>
12 Chicago IL Chicago IL -87.6 41.9 <bbox [4]>
And now we are. x
contains the longitudes (negative for degrees west), and y
the latitudes (positive for degrees north).
The most enjoyable way to make a map in R is to use the leaflet
package. Making a map is a three-step process:
leaflet()
with the name of the dataframeaddTiles()
to get map tiles to draw the map withaddMarkers
) that look like Google map pins. Here also you associate the longs and lats with the columns they are in in your dataframe:leaflet(data = wisc) %>%
addTiles() %>%
addCircleMarkers(lng = ~x, lat = ~y)
The nice thing about Leaflet maps is that you can zoom, pan and generally move about in them. For example, you can zoom in to find out which city each circle represents.
For attribution, please cite this work as
Butler (2021, April 26). Ken's Blog: Correcting a dataframe. Retrieved from http://ritsokiguess.site/blogg/posts/2021-11-07-correcting-a-dataframe/
BibTeX citation
@misc{butler2021correcting, author = {Butler, Ken}, title = {Ken's Blog: Correcting a dataframe}, url = {http://ritsokiguess.site/blogg/posts/2021-11-07-correcting-a-dataframe/}, year = {2021} }