1
votes

Ok I have a complex function built using data.frames and in trying to speed it up I've turned to data.table. I'm totally new to this so I'm quite befuddled. Anyhow I've made a much much simpler toy example of what I want to do, but I cannot work out how to translate it into data.table format. Here is the example in data.frame form:

    rows <- 10
    data1 <- data.frame(   id =1:rows,
                    a = seq(0.2, 0.55, length.out = rows),
                  b = seq(0.35, 0.7, length.out = rows),
                  c = seq(0.4, 0.83, length.out = rows),
                  d = seq(0.6, 0.87, length.out = rows),
                  e = seq(0.7, 0.99, length.out = rows),
                  f = seq(0.52, 0.90, length.out = rows)             
    )
    DT1 <- data.table(data1) #for later

    data2 <- data.frame(   id =3:1,
                   a = rep(3, 3),
                   d = rep(2, 3),
                   f = rep(1, 3)
    )
    m.names <- c("a", "d", "f")

    data1[match(data2$id, data1$id),m.names] <- data1[match(data2$id, data1$id),m.names] + data2[match(data2$id, data1$id),m.names]

So note in the last step that I want perform addition between the pre-existing figures and the new data and its vectorised across several columns.

In a data.table format I've only gotten this far:

    DT1[id %in% data2$id, m.names, with=FALSE]

This selects the values I want to add to but after that I am lost. I would appreciate any help !

EDIT:

Ok I've figure out part of it - I can use the last line of code above to achieve the vectorised addition part using using data2 to store the added values as follows:

    data2[,m.names] <- data2[,m.names] + data.frame(DT1[id %in% data2$id, m.names, with=FALSE])

Even with 2.5million rows (in DT1) and 10,000 rows in data2 and 6 matching columns this only takes 0.004sec, but I still need to assign the new data2 to the appropriate dynamically assigned columns in data 1

3
You can do a simple binary join if you don't mind calling the column names manually, something like setkey(setDT(data1), id) ; data1[data2, ":="(a = a + i.a,d = d + i.d,f = f + i.f)]David Arenburg
@DavidArenburg Your solution is much better and compact.akrun
@akrun your solution is more general so I'd leave it there and let the OP decideDavid Arenburg
Thanks David , but doesn't work in my real data. I can't use a,d,f I need to use m.names[1] as the number of columns can vary - using m.names[] construct doesn't seem to work.. Second point is in the toy data the match version appears quicker (and in the real code I use fmatch)user2498193
Ok I've found a quick way to do the addition via vectorisation - see my edit - but I still need to pack the results back into data1 using the variable assigned column names ??user2498193

3 Answers

4
votes

Here's another way, using devel version data.table v1.9.5:

require(data.table) ## v1.9.5+
setDT(data1)        ## data1 is now a data.table

cols1 = c("a", "d", "f")
cols2 = paste0("i.", cols1)
setkey(data1, id)   ## setkey and prepare for join
data1[data2, (cols1) := mapply(`+`, mget(cols1), mget(cols2), SIMPLIFY=FALSE)]
#     id         a         b         c    d         e         f
#  1:  1 3.2000000 0.3500000 0.4000000 2.60 0.7000000 1.5200000
#  2:  2 3.2388889 0.3888889 0.4477778 2.63 0.7322222 1.5622222
#  3:  3 3.2777778 0.4277778 0.4955556 2.66 0.7644444 1.6044444
#  4:  4 0.3166667 0.4666667 0.5433333 0.69 0.7966667 0.6466667
#  5:  5 0.3555556 0.5055556 0.5911111 0.72 0.8288889 0.6888889
#  6:  6 0.3944444 0.5444444 0.6388889 0.75 0.8611111 0.7311111
#  7:  7 0.4333333 0.5833333 0.6866667 0.78 0.8933333 0.7733333
#  8:  8 0.4722222 0.6222222 0.7344444 0.81 0.9255556 0.8155556
#  9:  9 0.5111111 0.6611111 0.7822222 0.84 0.9577778 0.8577778
# 10: 10 0.5500000 0.7000000 0.8300000 0.87 0.9900000 0.9000000

The join of the form x[i] is performed on the key column id. For each row of data2's id column, the corresponding matching rows in data1 is found. For example, for id = 2 from data2, the matching row is the 2nd row in data1.

Once we've all the matching rows, we evaluate the expression in j, which updates the data1 columns provided in col1 by adding the values from mget(cols1) and mget(cols2).

cols2 was generated with a i. prefix, which fetches the values from the data.table i -- here data2.

HTH

1
votes

One way would be to use set in a for loop as this involves multiple columns. Convert the second dataset to 'data.table' (DT2), set the key with the 'id' column, and join with 'data1'. Create column index vectors for the 'm.names' column in the dataset ('indx1') and i. columns created after the join ('indx2'). Using a for loop, set the NA elements in 'm.names' columns to '0', and then sum the corresponding columns based on the 'indx1' and 'indx2'.

DT2 <- as.data.table(data2)
DTNew <- setkey(DT2, id)[data1]
indx1 <- match(m.names, names(DTNew))
indx2 <- grep('i\\.', names(DTNew))

for(k in seq_along(indx1)){
  set(DTNew, i=which(is.na(DTNew[[indx1[k]]])), j= indx1[k], value=0)
  set(DTNew, i=NULL, j= indx2[k], value = DTNew[[indx1[k]]]+
                                      DTNew[[indx2[k]]])
  }

 res <- DTNew[,2:4 := NULL]
 setnames(res, names(data1))

Checking with the modified 'data1'

  data1[match(data2$id, data1$id),m.names] <- data1[match(data2$id, 
       data1$id),m.names] + data2[match(data2$id, data1$id),m.names]
  all.equal(setDF(res), data1)
  #[1] TRUE

Benchmarks

On a 1e6 dataset, 

set.seed(24)
data1 <- cbind(id=1:1e6,as.data.frame(matrix(rnorm(1e6*10), ncol=10, 
         dimnames=list(NULL, letters[1:10])) ))
 set.seed(46)
 data2 <- data.frame(id= sample(1:1000, 100, replace=FALSE), 
             a= rnorm(100), d=rnorm(100), f= rnorm(100))
 m.names <- c("a", "d", "f")

 DT2 <- as.data.table(data2)
 system.time({
   DTNew <- setkey(DT2, id)[data1]
   indx1 <- match(m.names, names(DTNew))
   indx2 <- grep('i\\.', names(DTNew))

   for(k in seq_along(indx1)){
   set(DTNew, i=which(is.na(DTNew[[indx1[k]]])), j= indx1[k], value=0)
   set(DTNew, i=NULL, j= indx2[k], value = DTNew[[indx1[k]]]+
                                  DTNew[[indx2[k]]])
   }

  res <- DTNew[,2:4 := NULL]
  setnames(res, names(data1))
  })

 # user  system elapsed 
 # 0.082   0.005   0.086 
0
votes

Ok thanks to @David Arenburg for his suggestion. I've modified it slightly to arrive at the following for my preferred solution

    text <- NULL
    for(i in 1:length(m.names)){
        text <- paste0(text, m.names[i], " = ", m.names[i], " + i.", m.names[i], ", ")
    }
    expr <- parse(text = paste0("\":=\"(", substr(text, 1, nchar(text)-2), ")" ))

    res2 <- DT1[data2, eval(expr)]