I have a bit of a specific problem of selecting rows in a data.table, and so far not managed to solve it. I have a dataset storing simulation results over a range of parameters. Columns in the dataset either contain parameters or result values, see code below ("p" for parameter columns and "v" for value columns.
# create dataset for demonstration
params <- expand.grid (seq(0,0.5,by=.1),
seq(1,10),
seq(100,105),
letters[1:4],
letters[10:14])
colnames(params) <- paste("p",1:5,sep="")
data <- data.table(cbind(params,runif(nrow(params)),rnorm(nrow(params))))
setnames(data, c(colnames(params),"v1","v2"))
I would now like to extract: for each p1, and for given values of p2 and p3,and for arbitrary values of p4, p5, the row where the value of v1 is minimal. Let np4 and np5 be the number of unique values of p4 and p5, for each unique p1 and given p2, p3, I would like to select among the np4*np5 rows where p1, p2, p3 match that one row where v1 is minimal. The desired output should then be a table with np1 rows selected from the original table, i.e. containing all variables the original did. I know how to select rows from a data.table, how to use expressions and "by", but I have not managed to put that all together to produce the desired result.
UPDATE: I found the answer. The trick was, how to select the optimal row within the subset created by "by? (Of course, there was already a built-in) solution:
np4 <- c("a", "b")
np5 <- c("m", "n")
ss2 <- data[ p4 %in% np4 & p5 %in% np5,
.SD[which(v1==min(v1)),],
by = "p1"]
From the data.table documentation:
.SD is a data.table containing the Subset of x's Data for each group, excluding any columns used in by (or keyby).