Select data.table columns based on condition, within by

Question

I want to extract data.table columns if their contents fulfill a criteria. And I need a method that will work with by (or in some other way within combinations of columns). I am not very experienced with data.table and have tried my best with .SDcol and what else I could think of.

Example: I often have datasets with observations at multiple time points for multiple subjects. They also contain covariates which do not vary within subjects.

dt1 <- data.table(
    id=c(1,1,2,2,3,3),
    time=c(1,2,1,2,1,2),
    meas=c(452,23,555,33,322,32),
    age=c(30,30,54,54,20,20),
    bw=c(75,75,81,81,69,70)
)

How do I (efficiently) select the columns that do not vary within id (in this case, id and age)? I'd like a function call that would return

And how do I select the columns that do vary within ID (so drop age)? The function call should return:

   id time meas bw
1:  1    1  452 75
2:  1    2   23 75
3:  2    1  555 81
4:  2    2   33 81
5:  3    1  322 69
6:  3    2   32 70

Of course, I am interested if you know of a function that addresses the specific example above, but I am even more curious on how to do this generally. Columns that contain more than two values > 1000 within any combinations of id and time in by=.(id,time), or whatever...

Thanks!

chinsoon12 chinsoon12 · Accepted Answer · 2019-10-15T01:58:52

How do I (efficiently) select the columns that do not vary within id (in this case, id and age)?

Maybe something like:

f <- function(DT, byChar) {
    cols <- Reduce(intersect, DT[, .(.(names(.SD)[sapply(.SD, uniqueN)==1])), byChar]$V1)
    unique(DT[, c(byChar, cols), with=FALSE])
}
f(dt1, "id")

output:

   id age
1:  1  30
2:  2  54
3:  3  20

And how do I select the columns that do vary within ID (so drop age)?

Similarly,

f2 <- function(DT, byChar, k) {
    cols <- Reduce(intersect, DT[, .(.(names(.SD)[sapply(.SD, uniqueN)>k])), byChar]$V1)
    unique(DT[, c(byChar, cols), with=FALSE])
}
f2(dt1, "id", 1)

output:

   id time meas
1:  1    1  452
2:  1    2   23
3:  2    1  555
4:  2    2   33
5:  3    1  322
6:  3    2   32

data:

library(data.table)
dt1 <- data.table(
    id=c(1,1,2,2,3,3),
    time=c(1,2,1,2,1,2),
    meas=c(452,23,555,33,322,32),
    age=c(30,30,54,54,20,20),
    bw=c(75,75,81,81,69,70)
)

Select data.table columns based on condition, within by

3 Answers