0
votes

If A, B and C are columns in a data.table, dt, what I would like to do is: create a new column (D), which is the minimum of each group in C, filtered so that I only consider rows in the group where B is true. So far what I have is...

dt[, D := min(A[B == TRUE]), by = C]

This works and I get the result that I'm after, but it's pretty slow (my data set is a few million rows). Is there a faster, more elegant way to do it? I'm only interested in results using data.table, no dplyr please :)

2
Thanks, that's a good point! There's actually only 3-4 entries (on average) per group in C. I just tried making a new table and it's not really that much faster.astro person
You could shorten it to dt[, D := min(A[B]), by = C] and I can't imagine this taking a long time or being slow for a few million rows tbh.talat
Do you need to do this several times? If so, consider setting a key (C) on dt. Once this is done, it reduces the run time to approximately a third of the original time.talat
Brilliant, thanks. It's not super slow, but I'm quite new to R and data.table, so I thought I may not be calculating it in the best way possible. My example is also simplified compared to the actual code... I'm grouping my multiple columns, not just C, and I'm converting the end result to integer format.astro person
If you're doing a conversion to integer, do it in a separate step, since this doesn't need to be done by group. For example: dt[, D := min(A[B]), by = C][, D := as.integer(D)]talat

2 Answers

1
votes

There is an alternative approach which computes min() for each group after filtering and then does an update join. Missing groups will be set to NA, thereby avoiding the type conversion from integer to double.

# dummy data
dt <- data.table(A = rep(1:3, 3),
                 B = c(rep(c(FALSE, TRUE, TRUE), 2), rep(FALSE, 3)),
                 C = 10L * rep(1:3, each = 3))
dt[, A := A + C]
dt
    A     B  C
1: 11 FALSE 10
2: 12  TRUE 10
3: 13  TRUE 10
4: 21 FALSE 20
5: 22  TRUE 20
6: 23  TRUE 20
7: 31 FALSE 30
8: 32 FALSE 30
9: 33 FALSE 30
# all variables are integer or logical
str(dt)
Classes ‘data.table’ and 'data.frame':    9 obs. of  3 variables:
 $ A: int  11 12 13 21 22 23 31 32 33
 $ B: logi  FALSE TRUE TRUE FALSE TRUE TRUE ...
 $ C: int  10 10 10 20 20 20 30 30 30
 - attr(*, ".internal.selfref")=<externalptr>
dt[dt[(B), min(A), by = C], on = "C", D := V1][]
    A     B  C  D
1: 11 FALSE 10 12
2: 12  TRUE 10 12
3: 13  TRUE 10 12
4: 21 FALSE 20 22
5: 22  TRUE 20 22
6: 23  TRUE 20 22
7: 31 FALSE 30 NA
8: 32 FALSE 30 NA
9: 33 FALSE 30 NA
# all variables are still integer or logical
str(dt)
Classes ‘data.table’ and 'data.frame':    9 obs. of  4 variables:
 $ A: int  11 12 13 21 22 23 31 32 33
 $ B: logi  FALSE TRUE TRUE FALSE TRUE TRUE ...
 $ C: int  10 10 10 20 20 20 30 30 30
 $ D: int  12 12 12 22 22 22 NA NA NA
 - attr(*, ".internal.selfref")=<externalptr>

The expression

dt[(B), min(A), by = C]

returns the min values by group (if available)

    C V1
1: 10 12
2: 20 22

NB: I do not claim that this is faster than OP's approach. I cannot test it due to lack of a reproducible example which can be scaled in problem size for benchmarking.

0
votes

I ended up converting column A to a numeric type (from integer), so that empty groups return a minimum of NA, and then used pmin.int() instead of min(). As it turns out, this is about 4 times faster than my original method!