Ken's Blog: Un-counting

Ken Butler

Packages

library(tidyverse)

Introduction

You probably know about count, which tells you how many observations you have in each group:

d <- tribble(
  ~g, ~y,
  "a", 10,
  "a", 13,
  "a", 14, 
  "a", 14,
  "b", 6,
  "b", 7,
  "b", 9
)

There are four observations in group a and three in group b:

d %>% count(g) -> counts
counts

# A tibble: 2 × 2
  g         n
  <chr> <int>
1 a         4
2 b         3

I didn’t know about this until fairly recently. Until then, I thought you had to do this:

d %>% group_by(g) %>% 
  summarize(count=n())

# A tibble: 2 × 2
  g     count
  <chr> <int>
1 a         4
2 b         3

which works, but is a lot more typing.

Going the other way

The other day, I had the opposite problem. I had a table of frequencies, and I wanted to get it back to one row per observation (I was fitting a model in Stan, and I didn’t know how to deal with frequencies). I had no idea how you might do that (without something ugly like loops), and I was almost embarrassed to stumble upon this:

counts %>% uncount(n)

# A tibble: 7 × 1
  g    
  <chr>
1 a    
2 a    
3 a    
4 a    
5 b    
6 b    
7 b

My situation was a bit less trivial than that. I had disease category counts of coal miners with different exposures to coal dust:

my_url="https://www.utsc.utoronto.ca/~butler/d29/miners-tab.txt"
miners0 <- read_table(my_url)
miners0

# A tibble: 8 × 4
  Exposure  None Moderate Severe
     <dbl> <dbl>    <dbl>  <dbl>
1      5.8    98        0      0
2     15      51        2      1
3     21.5    34        6      3
4     27.5    35        5      8
5     33.5    32       10      9
6     39.5    23        7      8
7     46      12        6     10
8     51.5     4        2      5

This needs tidying to get the frequencies all into one column:

miners0 %>% 
  gather(disease, freq, -Exposure) -> miners
miners

# A tibble: 24 × 3
   Exposure disease   freq
      <dbl> <chr>    <dbl>
 1      5.8 None        98
 2     15   None        51
 3     21.5 None        34
 4     27.5 None        35
 5     33.5 None        32
 6     39.5 None        23
 7     46   None        12
 8     51.5 None         4
 9      5.8 Moderate     0
10     15   Moderate     2
# … with 14 more rows

So I wanted to fit an ordered logistic regression in Stan, predicting disease category from exposure, but I didn’t know how to handle the frequencies. If I had one row per miner, I thought…

miners %>% uncount(freq) %>% rmarkdown::paged_table()

… and so I do. (I scrolled down to check, and eventually got past the 98 miners with 5.8 years of exposure and no disease).

From there, you can use this to fit the model, though I would rather have weakly informative priors for their beta and c. c is tricky, since it is ordered, but I used the idea here (near the bottom) and it worked smoothly.

Un-counting

Packages

Introduction

Going the other way

Citation