3
votes

Say we have two tables, each with a group indicator common between the two. Then for every row in DT1 we want to randomly select 2 rows from DT2 within each group.

One possible way to address this would be to randomly generate an integer column in DT2 that would coincide with the rows of DT1, and join the two tables together. But the number of rows varies by group and it's not clear how we could condition on this length by group.

Minimum working example:

DT1 <- data.table(var1=seq(1:20),
                  group=c(1,1,1,1,1,2,2,2,2,3,3,3,4,4,4,4,4,4,4,4))
DT2 <- data.table(obs=seq(1:13),
                  group=c(1,1,1,2,2,2,3,3,3,4,4,4,5))

View:

DT1
    var1 group
 1:    1     1
 2:    2     1
 3:    3     1
 4:    4     1
 5:    5     1
 6:    6     2
 7:    7     2
 8:    8     2
 9:    9     2
10:   10     3
11:   11     3
12:   12     3
13:   13     4
14:   14     4
15:   15     4
16:   16     4
17:   17     4
18:   18     4
19:   19     4
20:   20     4

DT2
    obs group
 1:   1     1
 2:   2     1
 3:   3     1
 4:   4     2
 5:   5     2
 6:   6     2
 7:   7     3
 8:   8     3
 9:   9     3
10:  10     4
11:  11     4
12:  12     4
13:  13     5

So for every row in DT1, I would like to pair it with 2 randomly selected rows from DT2 by group. The expected result might look something like:

DT3
    var1 group obs
 1:    1     1   1
 2:    1     1   3
 3:    2     1   2
 4:    2     1   3
...
37:   19     4   10
38:   19     4   11
39:   20     4   10
40:   20     4   12

In case it needs to be said, the actual application has 400 million rows in DT1 and 10 million in DT2.

2

2 Answers

3
votes

A direct translation of your needs is:

DT2[DT1, on=.(group), allow.cartesian=TRUE, .(var1, obs=obs[sample(.N, 2L)]), by=.EACHI]

This might be faster:

gn <- DT1[, .(nsamp=2*.N), keyby=.(group)]
DT2[gn, on=.(group), .(obs=obs[sample(.N, nsamp, replace=TRUE)]), by=.EACHI][,
    var1 := rep(DT1$var1, each=2L)]

data:

set.seed(0L)
library(data.table)
DT1 <- data.table(var1=101:120, group=c(1,1,1,1,1,2,2,2,2,3,3,3,4,4,4,4,4,4,4,4))
DT2 <- data.table(obs=201:213, group=c(1,1,1,2,2,2,3,3,3,4,4,4,5))

sample output:

    group var1 obs
 1:     1  101 203
 2:     1  101 201
 3:     1  102 202
 4:     1  102 203
 5:     1  103 203
 6:     1  103 201
 7:     1  104 203
 8:     1  104 202
 9:     1  105 202
10:     1  105 203
11:     2  106 204
12:     2  106 206
13:     2  107 204
14:     2  107 205
15:     2  108 205
16:     2  108 206
17:     2  109 205
18:     2  109 206
19:     3  110 209
20:     3  110 207
21:     3  111 209
22:     3  111 208
23:     3  112 207
24:     3  112 208
25:     4  113 210
26:     4  113 212
27:     4  114 211
28:     4  114 210
29:     4  115 211
30:     4  115 212
31:     4  116 211
32:     4  116 210
33:     4  117 211
34:     4  117 210
35:     4  118 210
36:     4  118 211
37:     4  119 212
38:     4  119 211
39:     4  120 210
40:     4  120 211
    group var1 obs
2
votes

Here is one way using dplyr, I am sure this can be translated to data.table as well. The logic being we first calculate number of rows for each group in DT1, left_join them to DT2 by group and use sample to select randomly n * 2 rows in each group.

library(dplyr)

DT1 %>%
   group_by(group) %>%
   mutate(n = n()) %>%
   left_join(DT2, by = "group") %>%
   group_by(group) %>%
   slice(sample(seq_len(n()), first(n) * 2, replace = TRUE)) %>%
   select(-n)

#    var1 group   obs
#   <int> <dbl> <int>
# 1     4     1     1
# 2     3     1     3
# 3     4     1     2
# 4     4     1     1
# 5     2     1     3
# 6     3     1     2
# 7     1     1     1
# 8     4     1     2
# 9     4     1     2
#10     3     1     2
# … with 30 more rows