Un-counting

Why you would want to do the opposite of counting

Ken Butler http://ritsokiguess.site/blog
07-13-2019

Packages

Introduction

You probably know about count, which tells you how many observations you have in each group:

d <- tribble(
  ~g, ~y,
  "a", 10,
  "a", 13,
  "a", 14, 
  "a", 14,
  "b", 6,
  "b", 7,
  "b", 9
)

There are four observations in group a and three in group b:

d %>% count(g) -> counts
counts
# A tibble: 2 × 2
  g         n
  <chr> <int>
1 a         4
2 b         3

I didn’t know about this until fairly recently. Until then, I thought you had to do this:

d %>% group_by(g) %>% 
  summarize(count=n()) 
# A tibble: 2 × 2
  g     count
  <chr> <int>
1 a         4
2 b         3

which works, but is a lot more typing.

Going the other way

The other day, I had the opposite problem. I had a table of frequencies, and I wanted to get it back to one row per observation (I was fitting a model in Stan, and I didn’t know how to deal with frequencies). I had no idea how you might do that (without something ugly like loops), and I was almost embarrassed to stumble upon this:

counts %>% uncount(n)
# A tibble: 7 × 1
  g    
  <chr>
1 a    
2 a    
3 a    
4 a    
5 b    
6 b    
7 b    

My situation was a bit less trivial than that. I had disease category counts of coal miners with different exposures to coal dust:

my_url="https://www.utsc.utoronto.ca/~butler/d29/miners-tab.txt"
miners0 <- read_table(my_url)
miners0
# A tibble: 8 × 4
  Exposure  None Moderate Severe
     <dbl> <dbl>    <dbl>  <dbl>
1      5.8    98        0      0
2     15      51        2      1
3     21.5    34        6      3
4     27.5    35        5      8
5     33.5    32       10      9
6     39.5    23        7      8
7     46      12        6     10
8     51.5     4        2      5

This needs tidying to get the frequencies all into one column:

miners0 %>% 
  gather(disease, freq, -Exposure) -> miners
miners
# A tibble: 24 × 3
   Exposure disease   freq
      <dbl> <chr>    <dbl>
 1      5.8 None        98
 2     15   None        51
 3     21.5 None        34
 4     27.5 None        35
 5     33.5 None        32
 6     39.5 None        23
 7     46   None        12
 8     51.5 None         4
 9      5.8 Moderate     0
10     15   Moderate     2
# … with 14 more rows

So I wanted to fit an ordered logistic regression in Stan, predicting disease category from exposure, but I didn’t know how to handle the frequencies. If I had one row per miner, I thought…

miners %>% uncount(freq) %>% rmarkdown::paged_table()

… and so I do. (I scrolled down to check, and eventually got past the 98 miners with 5.8 years of exposure and no disease).

From there, you can use this to fit the model, though I would rather have weakly informative priors for their beta and c. c is tricky, since it is ordered, but I used the idea here (near the bottom) and it worked smoothly.

Citation

For attribution, please cite this work as

Butler (2019, July 13). Ken's Blog: Un-counting. Retrieved from http://ritsokiguess.site/blogg/posts/2019-07-13-un-counting/

BibTeX citation

@misc{butler2019un-counting,
  author = {Butler, Ken},
  title = {Ken's Blog: Un-counting},
  url = {http://ritsokiguess.site/blogg/posts/2019-07-13-un-counting/},
  year = {2019}
}