sample data.table rows with different conditions

Question

I have a data.table with multiple columns. One of these columns currently works as a 'key' (keyb for the example). Another column (let's say A), may or may not have data in it. I would like to supply a vector that randomly sample two rows per key, -if this key appears in the vector, where 1 row contains data in A, while the other does not.

MRE:

#data.table
trys <- structure(list(keyb = c("x", "x", "x", "x", "x", "y", "y", "y", 
"y", "y"), A = c("1", "", "1", "", "", "1", "", "", "1", "")), .Names = c("keyb", 
"A"), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))
setkey(trys,keyb)

#list with keys
list_try <- structure(list(a = "x", b = c("r", "y","x")), .Names = c("a", "b"))

I could, for instance subset the data.table based on the elements that appear in list_try:

trys[keyb %in% list_try[[2]]]

My original (and probably inefficient idea), was to try to chain a sample of two rows per key, where the A column has data or no data, and then merge. But it does not work:

#here I was trying to sample rows based on whether A has data or not
#here for rows where A has no data
trys[keyb %in% list_try[[2]]][nchar(A)==0][sample(.N, 2), ,by = keyb]
#here for rows where A has data
trys[keyb %in% list_try[[2]]][nchar(A)==1][sample(.N, 2), ,by = keyb]

In this case, my expected output would be two data.tables (one for a and one for b in list_try), of two rows per appearing element: So the data.table from a would have two rows (one with and without data in A), and the one from b, four rows (two with and two without data in A).

Please let me know if I can make this post any clearer

Not sure, but maybe this trys[list_try[[2]], nomatch = 0L, sample(.I, 1L), by = .(keyb, A)]? V1 is the sampled row index — David Arenburg
For two data sets, maybe lapply(list_try, function(x) trys[x, nomatch = 0L, sample(.I, 1L), by = .(keyb, A)]) — David Arenburg
@DavidArenburg this is working for my sample dataset, I'll adapt the comment for my actual dataset. It's dropping all of the other columns (with relevant information), and for some reason, it gives me a sample of two rows for the column with data, instead of 1 — erasmortg
If you want to keep the other columns, just get $V1 and then subset your data according to that index as in trys[trys[list_try[[2]], nomatch = 0L, sample(.I, 1L), by = .(keyb, A)]$V1], for instance. Also, how many unique values you have in A in your real data? If you have more than 2 unique values in A, you could modify this to trys[list_try[[2]], nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1 — David Arenburg

David Arenburg David Arenburg · Accepted Answer · 2016-03-07T18:38:20

You could add A to the by statement too, while converting it to a binary vector by modifying to A != "", combine with a binary join (while adding nomatch = 0L in order to remove non-matches) you could then sample from the row index .I by those two aggregators and then subset from the original data set

For a single subset case

trys[trys[list_try[[2]], nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1]
#    keyb A
# 1:    y 1
# 2:    y  
# 3:    x 1
# 4:    x

For a more general case, when you want to create separate data sets according to a list of keys, you could easily embed this into lapply

lapply(list_try, 
       function(x) trys[trys[x, nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1]) 
# $a
# keyb A
# 1:    x 1
# 2:    x  
# 
# $b
# keyb A
# 1:    y 1
# 2:    y  
# 3:    x 1
# 4:    x

sample data.table rows with different conditions

1 Answers