22
votes

If you have an R data.table that has missing values, how do you replace all of them with say, the value 0? E.g.

aa = data.table(V1=1:10,V2=c(1,2,2,3,3,3,4,4,4,4))
bb = data.table(V1=3:6,X=letters[1:4])
setkey(aa,V1)
setkey(bb,V1)
tt = bb[aa]

    V1  X V2
 1:  1 NA  1
 2:  2 NA  2
 3:  3  a  2
 4:  4  b  3
 5:  5  c  3
 6:  6  d  3
 7:  7 NA  4
 8:  8 NA  4
 9:  9 NA  4
10: 10 NA  4

Any way to do this in one line? If it were just a matrix, you could just do:

tt[is.na(tt)] = 0
4
tt[is.na(tt)] = 0 works for me.thelatemail
You're right, I don't know where I was looking. Feel free to answer I'll accept it.FBC
+1 for the nice question. But the answer you've chosen is not the idiomatic way. I'd advice you to reconsider.Arun
thinking of how I'd want this to work, tt[is.na(tt), .SD := 0] comes to mindeddi
@eddi, Hm, that seems nice and probably not that hard to implement... will give it some thought.Arun

4 Answers

31
votes

is.na (being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set to replace NA with0`.

Using <- to assign will result in a copy of all the columns and this is not the idiomatic way using data.table.

First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):

One way to do this efficiently:

for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)

You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.

Why shouldn't you use <- here:

# by reference - idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# modifies value by reference - no copy
system.time({
for (i in seq_along(tt)) 
    set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
})
#   user  system elapsed 
#  0.284   0.083   0.386 

# by copy - NOT the idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# makes copy
system.time({tt[is.na(tt)] <- 0})
# a bunch of "tracemem" output showing the copies being made
#   user  system elapsed 
#  4.110   0.976   5.187 
13
votes

Nothing unusual here:

tt[is.na(tt)] = 0

..will work.

This is somewhat confusing however given that:

tt[is.na(tt)]

...currently returns:

Error in [.data.table(tt, is.na(tt)) : i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please let datatable-help know if you'd like this, or add your comments to FR #1611.

0
votes

I would make use of data.table and lapply, namely:

tt[,lapply(.SD,function(kkk) ifelse(is.na(kkk),-666,kkk)),.SDcols=names(tt)]

yielding in:

V1    X V2
 1:  1 -666  1
 2:  2 -666  2
 3:  3    a  2
 4:  4    b  3
 5:  5    c  3
 6:  6    d  3
 7:  7 -666  4
 8:  8 -666  4
 9:  9 -666  4
10: 10 -666  4
0
votes

The specific problem OP is posting could also be solved by

tt[is.na(X), X := 0]