3
votes

I have a data.table of a and b that I've partitioned into below with b < .5 and above with b > .5:

DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1))
above = DT[DT$b > .5]
below = DT[DT$b < .5, list(a=a)]

I'd like to do a left outer join between above and below: for each a in above, count the number of rows in below. This is equivalent to the following in SQL:

with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 union select 2, 1 union select 3, 1 union select 3, 1),
  above as (select a, b from dt where b > .5),
  below as (select a, b from dt where b < .5)
select above.a, count(below.a) from above left outer join below on (above.a = below.a) group by above.a;
 a | count 
---+-------
 3 |     0
 2 |     1
(2 rows)

How do I accomplish the same thing with data.tables? This is what I tried so far:

> key(below) = 'a'
> below[above, list(count=length(b))]
     a count
[1,] 2     1
[2,] 3     1
[3,] 3     1
> below[above, list(count=length(b)), by=a]
Error in eval(expr, envir, enclos) : object 'b' not found
> below[, list(count=length(a)), by=a][above]
     a count b
[1,] 2     1 1
[2,] 3    NA 1
[3,] 3    NA 1

I should also be more specific in that I already tried merge but that blows through the memory on my system (and the dataset takes only about 20% of my memory).

4
Can you just write in human terms what you want to achieve?mbq
You could have tried looking for the answer first. a data.table is an extension of data.frame, and has a merge function.Joris Meys
Agree you should explain what you want in natural language, but in addition I get an error in both version 1.6 and (after updating) 1.6.4: Error in abs(j) : Non-numeric argument to mathematical function In addition: Warning message: In is.na(j) : is.na() applied to non-(list or vector) of type 'NULL'IRTFM
Sorry, it was late - hopefully the question is a gazillion times better now.Yang

4 Answers

4
votes

See if this is giving you something useful. Your example is too sparse to let me know what you want, but it appears it might be a tabulation of values of above$a that are also in below$a

table(above$a[above$a %in% below$a])

If you also want the converse ... values not in below, then this would do it:

table(above$a[!above$a %in% below$a])

And you can concatenate them:

> c(table(above$a[above$a %in% below$a]),table(above$a[!above$a %in% below$a]) )
2 3 
1 2

Generally table and %in% run in reasonably small footprints and are quick.

4
votes

Since you appear to be using package data.table: check ?merge.data.table. I haven't used it, but it appears this might do what you want:

merge(above, below, by="a", all.x=TRUE, all.y=FALSE)
2
votes

I think this is easier:

setkey(above,a)
setkey(below,a)

Left outer join:

above[below, .N]

regular join:

above[below, .N, nomatch=0]

full outer join with counts:

merge(above,below, all=T)[,.N, by=a]
1
votes

I eventually found a way to do this with data.table, which I felt is more natural for me to understand than DWin's table, though YMMV:

result = below[, list(count=length(b)), by=a]
key(result) = 'a'
result = result[J(unique(above$a))]
result$count[is.na(result$count)] = 0

I don't know if this could be more compact, though. I especially wanted to be able to do something like result = below[J(unique(above$a)), list(count=length(b))], but that doesn't work.