I need to identify and de-duplicate groups of records in an r data table (but I suppose the issue would be the same in any programming language), structured like the following:
Groups are identified by the values in var1 and var2 and they are duplicates if they have the same size and contain the same values in var2 and var3 (the values in var3 are what bigger groups identified by var1 and var2 have in common).
So in the example the 2 red groups are duplicates, but the pair (red,blue) and the pair (red,brown) are not.
My solution consists in transposing the table to wide format
and then do unique(dt[,var1:=NULL])
and transpose back to long format (I will not need var1 any longer at this point).
The problem is that my real table has 165,391,868 records and it's not a one-off task but a weekly one with similarly sized tables and limited time to do it.
I have tried splitting the table into chunks, appending them and then do the de-duplication but the first transpose has now been running for more than 2h!
Any alternative and fastest solution? Thank you very much!
Code to create the example table:
dt <- data.table(
var1=c(
"value1_1",
"value1_1",
"value1_1",
"value1_2",
"value1_2",
"value1_2",
"value1_2",
"value1_3",
"value1_3",
"value1_3",
"value1_4",
"value1_4",
"value1_4",
"value1_5",
"value1_5",
"value1_5",
"value1_5"),
var2=c(
"value2_1",
"value2_1",
"value2_1",
"value2_1",
"value2_1",
"value2_1",
"value2_1",
"value2_1",
"value2_1",
"value2_1",
"value2_1",
"value2_1",
"value2_1",
"value2_1",
"value2_1",
"value2_1",
"value2_1"),
var1=c(
"value3_1",
"value3_2",
"value3_3",
"value3_2",
"value3_4",
"value3_5",
"value3_6",
"value3_1",
"value3_2",
"value3_3",
"value3_1",
"value3_2",
"value3_4",
"value3_1",
"value3_2",
"value3_3",
"value3_5"))
var1
values. How are they duplicates? – Ronak Shah