I have a data.table with multiple columns. One of these columns currently works as a 'key' (keyb
for the example). Another column (let's say A
), may or may not have data in it. I would like to supply a vector that randomly sample two rows per key, -if this key appears in the vector, where 1 row contains data in A
, while the other does not.
MRE:
#data.table
trys <- structure(list(keyb = c("x", "x", "x", "x", "x", "y", "y", "y",
"y", "y"), A = c("1", "", "1", "", "", "1", "", "", "1", "")), .Names = c("keyb",
"A"), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))
setkey(trys,keyb)
#list with keys
list_try <- structure(list(a = "x", b = c("r", "y","x")), .Names = c("a", "b"))
I could, for instance subset the data.table
based on the elements that appear in list_try
:
trys[keyb %in% list_try[[2]]]
My original (and probably inefficient idea), was to try to chain a sample of two rows per key, where the A
column has data or no data, and then merge. But it does not work:
#here I was trying to sample rows based on whether A has data or not
#here for rows where A has no data
trys[keyb %in% list_try[[2]]][nchar(A)==0][sample(.N, 2), ,by = keyb]
#here for rows where A has data
trys[keyb %in% list_try[[2]]][nchar(A)==1][sample(.N, 2), ,by = keyb]
In this case, my expected output would be two data.tables (one for a
and one for b
in list_try
), of two rows per appearing element: So the data.table from a
would have two rows (one with and without data in A), and the one from b
, four rows (two with and two without data in A).
Please let me know if I can make this post any clearer
trys[list_try[[2]], nomatch = 0L, sample(.I, 1L), by = .(keyb, A)]
?V1
is the sampled row index – David Arenburglapply(list_try, function(x) trys[x, nomatch = 0L, sample(.I, 1L), by = .(keyb, A)])
– David Arenburg$V1
and then subset your data according to that index as intrys[trys[list_try[[2]], nomatch = 0L, sample(.I, 1L), by = .(keyb, A)]$V1]
, for instance. Also, how many unique values you have inA
in your real data? If you have more than 2 unique values inA
, you could modify this totrys[list_try[[2]], nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1
– David Arenburg