Ken's Blog: Random sampling from groups

Ken Butler

Packages

library(tidyverse)

Introduction

In a previous post, I discussed how we might sample in groups, where each group was a sample from a different population. I introduced this function:

gen_sample <- function(n, mean, sd) {
  tibble(gp = c("x", "y"), n = n, mean = mean, sd = sd) %>% 
    rowwise() %>% 
    mutate(z = list(rnorm(n, mean, sd))) %>% 
    unnest(z) %>% 
    select(gp, z)
}

that samples from normal populations with possibly different means, SDs, and sample sizes in different groups.

Explanation

This is (more or less) the same explanation that appeared at the end of the previous post, so feel free to skip if you have read it before.

The first step is to make a data frame with one row for each sample that will be generated. This uses the inputs to the function above, so we will make some up:

n <- c(5, 3)
mean <- c(20, 10)
sd <- c(2, 1)
tibble(gp = c("x", "y"), n = n, mean = mean, sd = sd)

# A tibble: 2 × 4
  gp        n  mean    sd
  <chr> <dbl> <dbl> <dbl>
1 x         5    20     2
2 y         3    10     1

Evidently, in a function for public consumption, you would check that all the inputs are the same length, or you would rely on tibble telling you that only vectors of length 1 are recycled.¹ The groups are for no good reason called x and y.

The next two lines generate random samples, one for each group, according to the specifications, and store them each in one cell of the two-row spreadsheet:

  tibble(gp = c("x", "y"), n = n, mean = mean, sd = sd) %>% 
    rowwise() %>% 
    mutate(z = list(rnorm(n, mean, sd)))

# A tibble: 2 × 5
# Rowwise: 
  gp        n  mean    sd z        
  <chr> <dbl> <dbl> <dbl> <list>   
1 x         5    20     2 <dbl [5]>
2 y         3    10     1 <dbl [3]>

The new column z is a list column, since the top cell of the column is a vector of length 5, and the bottom cell is a vector of length 3. To actually see the values they contain, we unnest z:

  tibble(gp = c("x", "y"), n = n, mean = mean, sd = sd) %>% 
    rowwise() %>% 
    mutate(z = list(rnorm(n, mean, sd))) %>% 
    unnest(z)

# A tibble: 8 × 5
  gp        n  mean    sd     z
  <chr> <dbl> <dbl> <dbl> <dbl>
1 x         5    20     2 20.1 
2 x         5    20     2 19.4 
3 x         5    20     2 20.0 
4 x         5    20     2 21.5 
5 x         5    20     2 20.2 
6 y         3    10     1  9.71
7 y         3    10     1 12.2 
8 y         3    10     1  9.18

and, finally, the middle three columns were only used to generate the values in z, so they can be thrown away now by selecting only gp and z.

The rowwise is necessary:

  tibble(gp = c("x", "y"), n = n, mean = mean, sd = sd) %>% 
    mutate(z = list(rnorm(n, mean, sd))) %>% 
    unnest(z)

# A tibble: 4 × 5
  gp        n  mean    sd     z
  <chr> <dbl> <dbl> <dbl> <dbl>
1 x         5    20     2  20.2
2 x         5    20     2  10.8
3 y         3    10     1  20.2
4 y         3    10     1  10.8

because rnorm is vectorized, and for the x sample, R will draw one sampled value from each normal distribution, and then repeat the same values for the y sample. This is very much not what we want.

The same idea can be used to draw random chi-squared data (say):

tibble(df = c(2,6)) %>% 
  rowwise() %>% 
  mutate(z = list(rchisq(5, df))) %>% 
  unnest(z)

# A tibble: 10 × 2
      df     z
   <dbl> <dbl>
 1     2  3.43
 2     2  4.37
 3     2  3.91
 4     2  2.57
 5     2  6.28
 6     6  2.09
 7     6  7.43
 8     6 11.2 
 9     6 11.1 
10     6 12.3

(five values from \(\chi^2_2\), followed by five from \(\chi^2_6\).)

This suggests that I ought to be able to generalize my function gen_sample. Generalizing to any number of groups needs no extra work: the length of the input n determines the number of groups, and the values in n determine the size of each of those groups.

The interesting generalization is the distribution to sample from. The first parameter of the functions rnorm, rchisq etc. is always the number of random values to generate, but the remaining parameters are different for each distribution. This suggests that my generalized random sample generator ought to have the name of the random sampling function as input, followed by some means to allow any other inputs needed by that sampling function.

Generalizing

To generalize gen_sample to a new function sample_groups, we need to consider how to handle different distributions. The distribution itself is not the hard part; that can be specified by having the random sample generator function for the desired distribution as an input to the new function. The problem is that each distribution has different parameters, which need to be inputs to sample_groups.

The standard way of doing this kind of thing is to use the ... input to a function. If I had just one group, it would go like this:

sample_group <- function(n, dist, ...) {
  dist(n, ...)
}

and then

sample_group(5, rnorm, mean = 10, sd = 3)

[1]  7.252005  9.075549  7.826410  6.479892 14.855702

sample_group(6, rpois, lambda = 2)

[1] 1 2 1 0 0 1

The additional (named) inputs to sample_group are passed on unchanged to the random sample generator, to generate respectively normal random values with mean 10 and SD 3, and Poisson random values with mean 2.²

The random sample generators are vectorized, but the obvious thing for generating two samples from different distributions does not work:

sample_group(c(6, 3), rnorm, mean = c(10, 20), sd = c(3, 2))

[1] 11.42518 22.38107

We appear to have one value from each distribution, not six from the first and three from the second. This, I think, is what the help files say will happen.

To allow for the stuff at the end of the call to be different, another way is to use a list to pass each distribution’s parameters. This turns out to be what I do later.

A bad approach

Let’s suppose I ask the user to write the code to generate each sample as text (a vector of pieces of text, one for each sample). Here’s how my example above would look:

code <- c("rnorm(5, mean = 10, sd = 3)", 
          "rpois(6, lambda = 2)")
code

[1] "rnorm(5, mean = 10, sd = 3)" "rpois(6, lambda = 2)"

The problem is that this is text, not runnable code. One way to turn this into something useful is to parse it:

pc <- parse(text = code[1])
pc

expression(rnorm(5, mean = 10, sd = 3))

This has turned the text into an expression, something that can be evaluated, thus:

eval(pc)

[1] 13.037666  4.466370 11.621363 12.523617  8.222136

And now we have a strategy:

tibble(code) %>% 
  rowwise() %>% 
  mutate(expr = list(parse(text = code))) %>%
  mutate(z = list(eval(expr))) %>% 
  unnest(z)

# A tibble: 11 × 3
   code                        expr             z
   <chr>                       <list>       <dbl>
 1 rnorm(5, mean = 10, sd = 3) <expression>  9.61
 2 rnorm(5, mean = 10, sd = 3) <expression>  2.61
 3 rnorm(5, mean = 10, sd = 3) <expression>  7.55
 4 rnorm(5, mean = 10, sd = 3) <expression>  8.24
 5 rnorm(5, mean = 10, sd = 3) <expression>  9.97
 6 rpois(6, lambda = 2)        <expression>  1   
 7 rpois(6, lambda = 2)        <expression>  3   
 8 rpois(6, lambda = 2)        <expression>  0   
 9 rpois(6, lambda = 2)        <expression>  1   
10 rpois(6, lambda = 2)        <expression>  3   
11 rpois(6, lambda = 2)        <expression>  2

So now we have the ingredients for a version of sample_groups based on the user writing the random-sampling code for us. I added one extra thing: lettering the groups, since they otherwise have bad names:

sample_groups <- function(code) {
  n_pop <- length(code)
  tibble(code, gp = letters[1:n_pop]) %>% 
    rowwise() %>% 
    mutate(expr = list(parse(text = code))) %>%
    mutate(z = list(eval(expr))) %>% 
    unnest(z) %>% 
    select(gp, z)
}

and to test:

d <- sample_groups(code)
d

# A tibble: 11 × 2
   gp        z
   <chr> <dbl>
 1 a      9.93
 2 a      8.02
 3 a     11.1 
 4 a     11.7 
 5 a      7.08
 6 b      6   
 7 b      3   
 8 b      1   
 9 b      3   
10 b      1   
11 b      3

Using the eval(parse(text = something)) idea is not (apparently) very well regarded.³ One immediate problem is that the user could put any code at all (that evaluates into a vector of numbers) into the input code, which seems less than secure:

code <- c("1:3", "mtcars$mpg")
sample_groups(code)

# A tibble: 35 × 2
   gp        z
   <chr> <dbl>
 1 a       1  
 2 a       2  
 3 a       3  
 4 b      21  
 5 b      21  
 6 b      22.8
 7 b      21.4
 8 b      18.7
 9 b      18.1
10 b      14.3
# … with 25 more rows

A better way

I want to get back to the user inputting the desired random sample generators as functions, and then running those functions on the rest of the input. This is what do.call does:

do.call(rnorm, list(n = 5, mean = 10, sd = 3))

[1] 11.787958  8.725093  8.080057  8.224860  7.293048

do.call(rpois, list(n = 6, lambda = 2))

[1] 2 2 1 2 1 2

Having realized (i) that do.call is what I wanted, and (ii) that the input parameters to the functions need to be in a list, I packaged up those distribution parameters into a list of lists first. It is actually not necessary to make the list of distributions into a list, but it works if you do that too:

dist <- list(rnorm, rpois)
pars <- list(list(n = 5, mean = 10, sd = 3), 
             list(n = 6, lambda = 2))
d <- tibble(dist = dist, pars = pars)
d

# A tibble: 2 × 2
  dist   pars            
  <list> <list>          
1 <fn>   <named list [3]>
2 <fn>   <named list [2]>

and then we put the do.call in a rowwise mutate, wrapping the whole thing in a list to make a list-column:

d %>% 
  rowwise() %>% 
  mutate(z = list(do.call(dist, pars))) %>% 
  unnest(z)

# A tibble: 11 × 3
   dist   pars                 z
   <list> <list>           <dbl>
 1 <fn>   <named list [3]>  7.68
 2 <fn>   <named list [3]> 11.8 
 3 <fn>   <named list [3]>  9.27
 4 <fn>   <named list [3]>  5.88
 5 <fn>   <named list [3]> 10.4 
 6 <fn>   <named list [2]>  2   
 7 <fn>   <named list [2]>  1   
 8 <fn>   <named list [2]>  3   
 9 <fn>   <named list [2]>  0   
10 <fn>   <named list [2]>  0   
11 <fn>   <named list [2]>  0

it works!

And so, we can now make our function:

sample_groups <- function(dist, pars) {
  nr <- length(pars)
  tibble(dist = dist, pars = pars, gp = letters[1:nr]) %>% 
    rowwise() %>% 
    mutate(z = list(do.call(dist, pars))) %>% 
    unnest(z) %>% 
    select(gp, z)
}

and to test

dists <- list(rnorm, rpois)
pars <- list(list(n = 5, mean = 10, sd = 3), list(n = 6, lambda = 2))
sample_groups(dists, pars)

# A tibble: 11 × 2
   gp        z
   <chr> <dbl>
 1 a     16.4 
 2 a     12.0 
 3 a      9.95
 4 a     10.7 
 5 a     10.7 
 6 b      0   
 7 b      0   
 8 b      5   
 9 b      2   
10 b      3   
11 b      3

The only weirdness is that the user has to specify a list of lists for the parameters (because do.call needs a list for inputs to its function). But it definitely works.

One shortcut is that if you want all the samples to be from the same distribution, you specify only one thing in the input that I called dists:⁴

dist <- list(rnorm)
new_pars <- list(list(n = 3, mean = 5, sd = 1),
                 list(n = 2, mean = 10, sd = 2),
                 list(n = 4, mean = 20, sd = 3))
sample_groups(dist, pars = new_pars)

# A tibble: 9 × 2
  gp        z
  <chr> <dbl>
1 a      6.79
2 a      3.22
3 a      3.39
4 b     12.6 
5 b      9.58
6 c     24.4 
7 c     19.2 
8 c     22.2 
9 c     20.3

So, for example, if both your sample sizes are the same, you could define eg n <- 10 and it would get expanded to length 2 in the function.↩︎
It may take a trip to the help files to find out what R calls these parameters.↩︎
There seems to be a recurring theme on Stack Overflow that if eval(parse()) is the answer, you are asking the wrong question.↩︎
This is where the list() is important: there is no problem having a list-column of functions, but you cannot have a column which is just a function.↩︎

Random sampling from groups