data.table assign value to unique observation

Question

Some example data:

library(data.table)

mydat <- data.table(id1=rep(c("A","B","C"),each=3),
                    id2=c("D","E","G", "D","E","F","G","E","D"),
                    val=c(1,2,4,1,2,3, 4,2,1))

Which gives

   id1 id2 val
1:   A   D   1
2:   A   E   2
3:   A   G   4
4:   B   D   1
5:   B   E   2
6:   B   F   3
7:   C   G   4
8:   C   E   2
9:   C   D   1

My goal is to get unique values of id2,val and then generate a variable that depends upon the unique values (e.g. the sum across unique observations as below). This variable should then be put into a column in the original data.table. I often find myself writing code like the following:

## This is the most obvious way
tmp <- unique(mydat[,.(id2,val)])
tmp[,weight:=val/sum(val)]
tmp[,val:=NULL]
mydat <- merge(mydat,tmp,by="id2",all.x=TRUE)

## A second option which doesn't require merging
mydat[,first:=FALSE]
mydat[mydat[,.I[1],by=.(id2)]$V1,first:=TRUE]
mydat[first==TRUE,weight2:=val/sum(val)]
mydat[,weight2:=max(weight,na.rm = TRUE),by=.(id2)]
mydat[,first:=NULL]

This gives

   id2 id1 val weight weight2
1:   D   A   1    0.1     0.1
2:   D   B   1    0.1     0.1
3:   D   C   1    0.1     0.1
4:   E   A   2    0.2     0.2
5:   E   B   2    0.2     0.2
6:   E   C   2    0.2     0.2
7:   F   B   3    0.3     0.3
8:   G   A   4    0.4     0.4
9:   G   C   4    0.4     0.4

Entirely out of curiosity, is there a cleaner (more data.table) way to do this? Perhaps with self joins? Performance is important because the actual data I'm working with tends to be quite large.

Can you elaborate slightly on "...and then generate a variable based upon these values"? Knowing the logic of what you are trying to calculate will help in making suggestions about the best way to do it. — thelatemail
Your first method is pretty clean really - you could keep chaining calls but I'm not sure it is any better mydat[unique(mydat[,.(id2,val)])[, weight := val / sum(val)], on=.(id2)] — thelatemail

Frank Frank · Accepted Answer · 2019-06-27T00:54:35

I agree with @thelatemail that the approaches in the OP are already pretty clean.

Performance is important because the actual data I'm working with tends to be quite large.

If you must use this structure, there's:

setorder(mydat, id2)
mydat[unique(id2), on=.(id2), mult="first", w2 := val/sum(val)]
mydat[, w2 := nafill(w2, type="locf")]

I'm just sorting because that's shown in the desired output. To keep the original sorting, drop setorder and change the last line to mydat[order(id2), w2 := nafill(w2, type="locf")].

The nafill function is available in 1.12.3+ (so not yet on CRAN).

I would suggest instead using a set of normalized/"tidy" tables: val is an attribute of id2, so you could have an id2 table containing such things.

# same as OP's tmp
id2DT = unique(mydat[, .(id2, val)])
setkey(id2DT, id2)
id2DT[, w := val/sum(val)]

# drop redundant repeating val unless you really need it there
# to save on space and improve readability
mydat[, val := NULL] 
# merge w in if/when needed
mydat[, w := id2DT[.SD, on=.(id2), x.w]]

data.table assign value to unique observation

2 Answers