There is an alternative approach which computes min()
for each group after filtering and then does an update join. Missing groups will be set to NA
, thereby avoiding the type conversion from integer to double.
# dummy data
dt <- data.table(A = rep(1:3, 3),
B = c(rep(c(FALSE, TRUE, TRUE), 2), rep(FALSE, 3)),
C = 10L * rep(1:3, each = 3))
dt[, A := A + C]
dt
A B C
1: 11 FALSE 10
2: 12 TRUE 10
3: 13 TRUE 10
4: 21 FALSE 20
5: 22 TRUE 20
6: 23 TRUE 20
7: 31 FALSE 30
8: 32 FALSE 30
9: 33 FALSE 30
# all variables are integer or logical
str(dt)
Classes ‘data.table’ and 'data.frame': 9 obs. of 3 variables:
$ A: int 11 12 13 21 22 23 31 32 33
$ B: logi FALSE TRUE TRUE FALSE TRUE TRUE ...
$ C: int 10 10 10 20 20 20 30 30 30
- attr(*, ".internal.selfref")=<externalptr>
dt[dt[(B), min(A), by = C], on = "C", D := V1][]
A B C D
1: 11 FALSE 10 12
2: 12 TRUE 10 12
3: 13 TRUE 10 12
4: 21 FALSE 20 22
5: 22 TRUE 20 22
6: 23 TRUE 20 22
7: 31 FALSE 30 NA
8: 32 FALSE 30 NA
9: 33 FALSE 30 NA
# all variables are still integer or logical
str(dt)
Classes ‘data.table’ and 'data.frame': 9 obs. of 4 variables:
$ A: int 11 12 13 21 22 23 31 32 33
$ B: logi FALSE TRUE TRUE FALSE TRUE TRUE ...
$ C: int 10 10 10 20 20 20 30 30 30
$ D: int 12 12 12 22 22 22 NA NA NA
- attr(*, ".internal.selfref")=<externalptr>
The expression
dt[(B), min(A), by = C]
returns the min values by group (if available)
C V1
1: 10 12
2: 20 22
NB: I do not claim that this is faster than OP's approach. I cannot test it due to lack of a reproducible example which can be scaled in problem size for benchmarking.
dt[, D := min(A[B]), by = C]
and I can't imagine this taking a long time or being slow for a few million rows tbh. – talatdt[, D := min(A[B]), by = C][, D := as.integer(D)]
– talat