Find repeating groups in r data.table

Question

I need to identify and de-duplicate groups of records in an r data table (but I suppose the issue would be the same in any programming language), structured like the following:

Groups are identified by the values in var1 and var2 and they are duplicates if they have the same size and contain the same values in var2 and var3 (the values in var3 are what bigger groups identified by var1 and var2 have in common).

So in the example the 2 red groups are duplicates, but the pair (red,blue) and the pair (red,brown) are not.

My solution consists in transposing the table to wide format

and then do unique(dt[,var1:=NULL]) and transpose back to long format (I will not need var1 any longer at this point).

The problem is that my real table has 165,391,868 records and it's not a one-off task but a weekly one with similarly sized tables and limited time to do it.

I have tried splitting the table into chunks, appending them and then do the de-duplication but the first transpose has now been running for more than 2h!

Any alternative and fastest solution? Thank you very much!

Code to create the example table:

dt <- data.table(
var1=c(
    "value1_1",
    "value1_1",
    "value1_1",
    "value1_2",
    "value1_2",
    "value1_2",
    "value1_2",
    "value1_3",
    "value1_3",
    "value1_3",
    "value1_4",
    "value1_4",
    "value1_4",
    "value1_5",
    "value1_5",
    "value1_5",
    "value1_5"),
var2=c(
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1"),
var1=c(
    "value3_1",
    "value3_2",
    "value3_3",
    "value3_2",
    "value3_4",
    "value3_5",
    "value3_6",
    "value3_1",
    "value3_2",
    "value3_3",
    "value3_1",
    "value3_2",
    "value3_4",
    "value3_1",
    "value3_2",
    "value3_3",
    "value3_5"))

What is your expected output? Keep rows only in green, blue and brown? — Ronak Shah
I expect lines in green, blue and brown and those in red but just once — user3645882
But the 2 red groups have different var1 values. How are they duplicates? — Ronak Shah
If you read the question he is saying the duplication is based on the var3 column and the size — Adam Waring
sorry, to be duplicates they also need the same value in var2 — user3645882

chinsoon12 chinsoon12 · Accepted Answer · 2020-04-28T22:57:34

Here are 2 other options:

1) Collapsing var3 into a single value for joining

lu <- dt[, paste(var3, collapse=""), .(var1, var2)]

samegrp <- lu[lu, on=.(V1)][
    var1!=i.var1 & var2==i.var2, 
    .(var1=c(var11, var12), g=.GRP),
    .(var11=pmin(var1, i.var1), var12=pmax(var1, i.var1), var2)]

dt[samegrp, on=.(var1, var2), g := g]

output:

        var1     var2     var3  g
 1: value1_1 value2_1 value3_1  1
 2: value1_1 value2_1 value3_2  1
 3: value1_1 value2_1 value3_3  1
 4: value1_2 value2_1 value3_2 NA
 5: value1_2 value2_1 value3_4 NA
 6: value1_2 value2_1 value3_5 NA
 7: value1_2 value2_1 value3_6 NA
 8: value1_3 value2_1 value3_1  1
 9: value1_3 value2_1 value3_2  1
10: value1_3 value2_1 value3_3  1
11: value1_4 value2_1 value3_1 NA
12: value1_4 value2_1 value3_2 NA
13: value1_4 value2_1 value3_4 NA
14: value1_5 value2_1 value3_1 NA
15: value1_5 value2_1 value3_2 NA
16: value1_5 value2_1 value3_3 NA
17: value1_5 value2_1 value3_5 NA

2) Matching counts:

setkey(dt, var1, var2, var3)
count <- dt[, .N, .(var1, var2)]

matches <- dt[dt, on=.(var2, var3), allow.cartesian=TRUE, nomatch=0L][
    var1!=i.var1,
    .(N=.N / 2, g=.GRP),
    .(var11=pmin(i.var1, var1), var12=pmax(i.var1, var1), var2)]

matches[count, on=.(var11=var1, var2, N), nomatch=0L][
    count, on=.(var12=var1, var2, N), nomatch=0L]

output:

      var11    var12     var2 N g
1: value1_1 value1_3 value2_1 3 1

The 2nd method is more memory intensive and hence might be slower. But actual performance really depends on the characteristics of the actual dataset. E.g. the data types of the columns, the number of unique pairs of var1 and var2, the number of unique values of var3, etc.

Find repeating groups in r data.table

3 Answers