is.na
(being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set
to replace NA with
0`.
Using <-
to assign will result in a copy of all the columns and this is not the idiomatic way using data.table
.
First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):
One way to do this efficiently:
for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.
Why shouldn't you use <-
here:
# by reference - idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# modifies value by reference - no copy
system.time({
for (i in seq_along(tt))
set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
})
# user system elapsed
# 0.284 0.083 0.386
# by copy - NOT the idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# makes copy
system.time({tt[is.na(tt)] <- 0})
# a bunch of "tracemem" output showing the copies being made
# user system elapsed
# 4.110 0.976 5.187
tt[is.na(tt)] = 0
works for me. – thelatemailtt[is.na(tt), .SD := 0]
comes to mind – eddi