data.table r create new column referring to column

Question

Apologies in advance if this is a duplicate, but I have spent a fair amount of time searching and haven't found anything that relates to my problem.

I have a column in a data.table that contains the name of a column that I want to use in order to create a new column. That is- for each row, I want to look up a different column based on the values in a column.

I tried using get() but this didn't work:

d<-data.table(A=1:10,B=11:20,Ref_Col=rep(c("A","B"),5))
d[,new_col:=get(Ref_Col)]

The below is the output I would like:

     A  B Ref_Col new_col
 1:  1 11       A       1
 2:  2 12       B      12
 3:  3 13       A       3
 4:  4 14       B      14
 5:  5 15       A       5
 6:  6 16       B      16
 7:  7 17       A       7
 8:  8 18       B      18
 9:  9 19       A       9
10: 10 20       B      20

Any help is much appreciated.

Thanks for the quick reply nicola. I should have mentioned in my question that there are actually more than just 2 columns in the data.table (~30). So doing ifelse with each of these wouldn't be practical. — mat
Perfect - thanks Arun. I'm a bit puzzled about why the 'group by' is necessary... Is the intuition behind why this works that it's pulling the right quoted column grouped by each quoted column? — mat

bgoldst bgoldst · Accepted Answer · 2016-04-11T06:36:55

My first impression upon reading your question was that an index matrix would be perfect here. I've devised a solution built around this idea, but I have to warn you, it ended up being much more involved than one would hope.

Here's the whole thing:

d[,new_col:=
    as.matrix(d[,unique(d[,Ref_Col]),with=F])[
        matrix(c(seq_len(nrow(d)),match(Ref_Col,unique(Ref_Col))),nrow(d))
    ]
];
##      A  B Ref_Col new_col
##  1:  1 11       A       1
##  2:  2 12       B      12
##  3:  3 13       A       3
##  4:  4 14       B      14
##  5:  5 15       A       5
##  6:  6 16       B      16
##  7:  7 17       A       7
##  8:  8 18       B      18
##  9:  9 19       A       9
## 10: 10 20       B      20

Let's break that down one piece at a time:

c(seq_len(nrow(d)),match(Ref_Col,unique(Ref_Col)))

First I construct the vector that will comprise the underlying data of the index matrix. The left column is the row subscripts, the right column is the column subscripts. Now, a limitation of matrices is that they cannot contain heterogeneous types. Thus, we must choose between integer indexes and character names. Since your data.table has no row names, we have to use integer indexes (and integer indexes will likely be faster). The row indexes are easy to construct; it's just the sequence from 1 to nrow(d). The column indexes must be match()ed from the Ref_Col values to the column names of the object we will be indexing. To jump ahead, we will not actually index d, but rather a matrix built from just the columns referenced at least once by the Ref_Col column. Thus the correct column indexes are based on the position of the column name within the vector unique(Ref_Col).

matrix(...,nrow(d))

The next step is obviously to form a matrix out of the underlying data vector.

as.matrix(d[,unique(d[,Ref_Col]),with=F])[...]

Unfortunately, data.table does not currently support indexing with an index matrix. To solve this problem, we must coerce to a data type that does support indexing with an index matrix. The two most logical choices are a data.frame or a matrix. I went with a matrix, since the new_col column will have to comprise a single vector type anyway, so coercing to a matrix (and therefore flattening all referent columns to a single type) is not a problem. I briefly considered that since a data.table is already a valid data.frame (i.e. it "inherits" from data.frame under R's pseudo-OOP paradigm), and since R uses a copy-on-modify optimization under-the-hood, it might be less costly to coerce to data.frame and hopefully avoid a copy, but running tracemem() reveals that R does in fact copy the entire data.table when coerced to data.frame. (Update: And I just discovered that R internally coerces a data.frame to a matrix anyway just before it is about to be indexed by an index matrix, so coercing to data.frame prior to indexing with an index matrix would not buy anything over and above coercing to matrix directly.) So I just went with as.matrix(). Yes, it still copies data in this case, but at least it will copy less data, since we can first index out just the referent columns with unique(d[,Ref_Col]) (requiring with=F).

d[,new_col:=...]

Finally, we can assign the new column from the result of applying the index matrix to the combined referent matrix.

Performance

I did some benchmarking:

library(data.table);
library(microbenchmark);

chinsoon <- function(d) { d[,id:=seq_along(Ref_Col)]; temp <- melt(d, meas=unique(d$Ref_Col), value.name="new_col")[Ref_Col==variable,]; setkey(d, id, Ref_Col); setkey(temp, id, Ref_Col); d[temp][ ,`:=`(id = NULL, variable = NULL)][]; };
bgoldst <- function(d) d[,new_col:=as.matrix(d[,unique(d[,Ref_Col]),with=F])[matrix(c(seq_len(nrow(d)),match(Ref_Col,unique(Ref_Col))),nrow(d))]];
symbolix <- function(d) { refs <- unique(d[, Ref_Col]); for(i in refs) d[ Ref_Col == i, eval(parse(text = paste0("new_col := ", i)))][ ]; d; };
arun <- function(d) d[,new_col:=get(Ref_Col),Ref_Col];

N <- 100L; d <- data.table(A=seq(1L,N%/%2),B=seq(N%/%2+1L,N),Ref_Col=rep(c('A','B'),N%/%2L));
identical(bgoldst(copy(d)),chinsoon(copy(d)));
## [1] TRUE
identical(bgoldst(copy(d)),{ x <- symbolix(copy(d)); attr(x,'index') <- NULL; x; }); ## irrelevant index attribute difference
## [1] TRUE
identical(bgoldst(copy(d)),arun(copy(d)));
## [1] TRUE

N <- 100L; d <- data.table(A=seq(1L,N%/%2),B=seq(N%/%2+1L,N),Ref_Col=rep(c('A','B'),N%/%2L));
microbenchmark(chinsoon(copy(d)),bgoldst(copy(d)),symbolix(copy(d)),arun(copy(d)));
## Unit: microseconds
##               expr      min       lq      mean    median        uq       max neval
##  chinsoon(copy(d)) 2444.896 2516.955 2941.2385 2597.1400 3501.8410  6343.812   100
##   bgoldst(copy(d)) 1713.608 1790.799 2137.4168 1837.4135 2472.6930  4599.841   100
##  symbolix(copy(d)) 2175.901 2275.972 2769.9504 2354.8740 3173.6170 13897.454   100
##      arun(copy(d))  635.921  685.743  862.7615  722.7345  951.5295  4414.667   100

N <- 1e4L; d <- data.table(A=seq(1L,N%/%2),B=seq(N%/%2+1L,N),Ref_Col=rep(c('A','B'),N%/%2L));
microbenchmark(chinsoon(copy(d)),bgoldst(copy(d)),symbolix(copy(d)),arun(copy(d)));
## Unit: microseconds
##               expr      min        lq     mean   median       uq       max neval
##  chinsoon(copy(d)) 4603.262 4999.6975 7194.594 6277.311 7162.555 49217.352   100
##   bgoldst(copy(d)) 2511.609 2600.5610 3371.723 2682.029 3979.529  6738.964   100
##  symbolix(copy(d)) 2645.893 2761.1450 3588.282 2959.789 4190.149 15062.810   100
##      arun(copy(d))  770.204  849.5345 1048.795  880.753 1126.653  2831.495   100

N <- 1e5L; d <- data.table(A=seq(1L,N%/%2),B=seq(N%/%2+1L,N),Ref_Col=rep(c('A','B'),N%/%2L));
microbenchmark(chinsoon(copy(d)),bgoldst(copy(d)),symbolix(copy(d)),arun(copy(d)));
## Unit: milliseconds
##               expr       min        lq     mean    median        uq       max neval
##  chinsoon(copy(d)) 27.114512 32.982772 59.00385 70.976359 78.864641 131.06167   100
##   bgoldst(copy(d))  9.732538 11.673015 19.02450 13.396672 16.624600  66.72976   100
##  symbolix(copy(d))  6.787716  8.509448 11.07309  9.057487 10.523269  55.60692   100
##      arun(copy(d))  2.127149  2.380748  3.32179  2.813746  3.930136   6.83604   100

So the conclusion is that my solution is somewhat faster for small data, but as the data gets progressively larger, Symbolix's solution becomes significantly faster. We can guess with a high degree of confidence that this is because my solution incurs the penalty of copying the referent columns in order to index them with an index matrix, while Symbolix uses the smarter approach of simply looping over the referent columns and indexing them one at a time. This is a case in R where looping is better than vectorization. +1 to Symbolix.

Update: Whoooooaaaaa! After adding Arun's solution, I am amazed how much faster it is, in addition to being shorter and more elegant than the rest. Game, set, and match to Arun.

I'm reminded of the line Furious activity is no substitute for understanding.

data.table r create new column referring to column

3 Answers

Performance