My first impression upon reading your question was that an index matrix would be perfect here. I've devised a solution built around this idea, but I have to warn you, it ended up being much more involved than one would hope.
Here's the whole thing:
d[,new_col:=
as.matrix(d[,unique(d[,Ref_Col]),with=F])[
matrix(c(seq_len(nrow(d)),match(Ref_Col,unique(Ref_Col))),nrow(d))
]
];
## A B Ref_Col new_col
## 1: 1 11 A 1
## 2: 2 12 B 12
## 3: 3 13 A 3
## 4: 4 14 B 14
## 5: 5 15 A 5
## 6: 6 16 B 16
## 7: 7 17 A 7
## 8: 8 18 B 18
## 9: 9 19 A 9
## 10: 10 20 B 20
Let's break that down one piece at a time:
c(seq_len(nrow(d)),match(Ref_Col,unique(Ref_Col)))
First I construct the vector that will comprise the underlying data of the index matrix. The left column is the row subscripts, the right column is the column subscripts. Now, a limitation of matrices is that they cannot contain heterogeneous types. Thus, we must choose between integer indexes and character names. Since your data.table has no row names, we have to use integer indexes (and integer indexes will likely be faster). The row indexes are easy to construct; it's just the sequence from 1 to nrow(d). The column indexes must be match()ed from the Ref_Col values to the column names of the object we will be indexing. To jump ahead, we will not actually index d, but rather a matrix built from just the columns referenced at least once by the Ref_Col column. Thus the correct column indexes are based on the position of the column name within the vector unique(Ref_Col).
matrix(...,nrow(d))
The next step is obviously to form a matrix out of the underlying data vector.
as.matrix(d[,unique(d[,Ref_Col]),with=F])[...]
Unfortunately, data.table does not currently support indexing with an index matrix. To solve this problem, we must coerce to a data type that does support indexing with an index matrix. The two most logical choices are a data.frame or a matrix. I went with a matrix, since the new_col column will have to comprise a single vector type anyway, so coercing to a matrix (and therefore flattening all referent columns to a single type) is not a problem. I briefly considered that since a data.table is already a valid data.frame (i.e. it "inherits" from data.frame under R's pseudo-OOP paradigm), and since R uses a copy-on-modify optimization under-the-hood, it might be less costly to coerce to data.frame and hopefully avoid a copy, but running tracemem() reveals that R does in fact copy the entire data.table when coerced to data.frame. (Update: And I just discovered that R internally coerces a data.frame to a matrix anyway just before it is about to be indexed by an index matrix, so coercing to data.frame prior to indexing with an index matrix would not buy anything over and above coercing to matrix directly.) So I just went with as.matrix(). Yes, it still copies data in this case, but at least it will copy less data, since we can first index out just the referent columns with unique(d[,Ref_Col]) (requiring with=F).
d[,new_col:=...]
Finally, we can assign the new column from the result of applying the index matrix to the combined referent matrix.
Performance
I did some benchmarking:
library(data.table);
library(microbenchmark);
chinsoon <- function(d) { d[,id:=seq_along(Ref_Col)]; temp <- melt(d, meas=unique(d$Ref_Col), value.name="new_col")[Ref_Col==variable,]; setkey(d, id, Ref_Col); setkey(temp, id, Ref_Col); d[temp][ ,`:=`(id = NULL, variable = NULL)][]; };
bgoldst <- function(d) d[,new_col:=as.matrix(d[,unique(d[,Ref_Col]),with=F])[matrix(c(seq_len(nrow(d)),match(Ref_Col,unique(Ref_Col))),nrow(d))]];
symbolix <- function(d) { refs <- unique(d[, Ref_Col]); for(i in refs) d[ Ref_Col == i, eval(parse(text = paste0("new_col := ", i)))][ ]; d; };
arun <- function(d) d[,new_col:=get(Ref_Col),Ref_Col];
N <- 100L; d <- data.table(A=seq(1L,N%/%2),B=seq(N%/%2+1L,N),Ref_Col=rep(c('A','B'),N%/%2L));
identical(bgoldst(copy(d)),chinsoon(copy(d)));
## [1] TRUE
identical(bgoldst(copy(d)),{ x <- symbolix(copy(d)); attr(x,'index') <- NULL; x; }); ## irrelevant index attribute difference
## [1] TRUE
identical(bgoldst(copy(d)),arun(copy(d)));
## [1] TRUE
N <- 100L; d <- data.table(A=seq(1L,N%/%2),B=seq(N%/%2+1L,N),Ref_Col=rep(c('A','B'),N%/%2L));
microbenchmark(chinsoon(copy(d)),bgoldst(copy(d)),symbolix(copy(d)),arun(copy(d)));
## Unit: microseconds
## expr min lq mean median uq max neval
## chinsoon(copy(d)) 2444.896 2516.955 2941.2385 2597.1400 3501.8410 6343.812 100
## bgoldst(copy(d)) 1713.608 1790.799 2137.4168 1837.4135 2472.6930 4599.841 100
## symbolix(copy(d)) 2175.901 2275.972 2769.9504 2354.8740 3173.6170 13897.454 100
## arun(copy(d)) 635.921 685.743 862.7615 722.7345 951.5295 4414.667 100
N <- 1e4L; d <- data.table(A=seq(1L,N%/%2),B=seq(N%/%2+1L,N),Ref_Col=rep(c('A','B'),N%/%2L));
microbenchmark(chinsoon(copy(d)),bgoldst(copy(d)),symbolix(copy(d)),arun(copy(d)));
## Unit: microseconds
## expr min lq mean median uq max neval
## chinsoon(copy(d)) 4603.262 4999.6975 7194.594 6277.311 7162.555 49217.352 100
## bgoldst(copy(d)) 2511.609 2600.5610 3371.723 2682.029 3979.529 6738.964 100
## symbolix(copy(d)) 2645.893 2761.1450 3588.282 2959.789 4190.149 15062.810 100
## arun(copy(d)) 770.204 849.5345 1048.795 880.753 1126.653 2831.495 100
N <- 1e5L; d <- data.table(A=seq(1L,N%/%2),B=seq(N%/%2+1L,N),Ref_Col=rep(c('A','B'),N%/%2L));
microbenchmark(chinsoon(copy(d)),bgoldst(copy(d)),symbolix(copy(d)),arun(copy(d)));
## Unit: milliseconds
## expr min lq mean median uq max neval
## chinsoon(copy(d)) 27.114512 32.982772 59.00385 70.976359 78.864641 131.06167 100
## bgoldst(copy(d)) 9.732538 11.673015 19.02450 13.396672 16.624600 66.72976 100
## symbolix(copy(d)) 6.787716 8.509448 11.07309 9.057487 10.523269 55.60692 100
## arun(copy(d)) 2.127149 2.380748 3.32179 2.813746 3.930136 6.83604 100
So the conclusion is that my solution is somewhat faster for small data, but as the data gets progressively larger, Symbolix's solution becomes significantly faster. We can guess with a high degree of confidence that this is because my solution incurs the penalty of copying the referent columns in order to index them with an index matrix, while Symbolix uses the smarter approach of simply looping over the referent columns and indexing them one at a time. This is a case in R where looping is better than vectorization. +1 to Symbolix.
Update: Whoooooaaaaa! After adding Arun's solution, I am amazed how much faster it is, in addition to being shorter and more elegant than the rest. Game, set, and match to Arun.
I'm reminded of the line Furious activity is no substitute for understanding.
d[,new_col:=ifelse(Ref_Col=="A",A,B)]- nicolad[,new_col:=get(Ref_Col), by=Ref_Col]- Arun