I have been working on matching values between two large data tables (around 25 million records each) based on unique and non-unique values. I posted a question yesterday on how to update values in a data.table using another data.table and the answers have made me wonder how I can improve the efficiency of this matching. In my example dt1
contains a non-unique area code and a type which can be from 1 to 10 (along with some blank columns). dt2
contains the same non-unique area codes (although not the same number as in dt1
), types from 1 to 10 and a unique ID (U_ID
).
What I would like to do is find the AREA_CD
and TYPE
rows in dt1
and dt2
that match and copy the U_ID
for those rows from dt2
to dt1
. The issue is that dt1 and dt2 do not have the same number of instances of each unique combination. For example AREA_CD
'A1' and TYPE
'1' occur 21 times in dt1
and only 20 times in dt2
. In these instances then the minimum number of rows (so 20 in this case) would have the match operation applied leaving 1 row in dt1
unmodified (if dt2
had more rows than dt1
then the number of rows in dt1
would be used).
Here is an example of my dataset. My actual datasets have around 25,000,000 rows and contain around 10,000 unique areas and types from 1 to 10.
require("data.table")
df1 <-data.frame(AREA_CD = c(rep("A1", 205), rep("A2", 145), rep("A3", 250), rep("A4", 100), rep("A5", 300)), TYPE = rep(1:10), ALLOCATED = 0, U_ID = 0, ID_CD = c(1:1000))
df1$ID_CD <- interaction( "ID", df1$ID_CD, sep = "")
df2 <-data.frame(U_ID = c(1:1000), AREA_CD = c(rep("A1", 200), rep("A2", 155), rep("A3", 245), rep("A4", 90), rep("A5", 310)), TYPE = rep(1:10), ASSIGNED = 0)
df2$U_ID <- interaction( "U", df2$U_ID, sep = "")
dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)
The output I am looking for would look something like this:
for(o in 1:5){
Ao <- paste("A",o,sep="")
for(i in 1:10){
R.Num <- min(nrow(with(df1, df1[AREA_CD == Ao & TYPE == i ,])), nrow(with(df2, df2[AREA_CD == Ao & TYPE == i ,])))
df1[df1$AREA_CD == Ao & df1$TYPE == i,][1:R.Num,"U_ID"] <- as.character(df2[df2$AREA_CD == Ao & df2$TYPE == i,][1:R.Num,"U_ID"])
}}
I hope that makes sense.