1
votes

I have a data.table for which I would like to perform some processing. As an initial step I'd like to set a new data.table for columns.
I create a loop for columns interested and attempt to assign NA/0 which fails or has issues as explained below.

library(data.table)    
 
input_allele <- data.table(FID= paste0("gid",1:10),IID=paste0("IID",11:20),PAT=c(1:10),MAT=c(rep(0,10)),SEX=c(rep(1,10)),PHENOTYPE =c(rep(1,10)),
SNP1=(c(rep(1,5), rep(0,5))),SNP2=(c(rep(1,6),rep(0,3),NA)),SNP3=(c(rep(NA,6),rep(1,4))),SNP4=(c(rep(NA,6),rep(0,4))),SNP5=(c(rep(1,6),rep(0,4)))  )


multiplied_value<-input_allele[,c(1:6)]

for(temp_snp in (colnames(input_allele[,.SD,.SDcols=c(7:11)]))){
temp_snpquote<-quote(temp_snp)
multiplied_value[,(temp_snpquote):=0]
}

I get an error:

Error in [.data.table(multiplied_value, , :=((temp_snpquote), 0)) : LHS of := must be a symbol, or an atomic vector (column names or positions).

If I use eval, I run into a weird behavior: After completion of the loop I have to type multiplied_value twice before the data.table is printed on the console. This is startling to me.

for(temp_snp in (colnames(input_allele[,.SD,.SDcols=c(7:11)]))){

temp_snpquote<-quote(temp_snp)
multiplied_value[,eval(temp_snpquote):=0]
}

I would like to understand: 1) how to set new column as NA or 0. 2) why using eval has me type multiplied_value data.table twice it is printed.

R version 4.0.0 (2020-04-24), data.table_1.13.4 Unix debian distribution

1
I'd use set for this rather than :=. Something like: for (i in colnames(input_allele[,.SD,.SDcols=c(7:11)])) set(multiplied_value, j = i, value = 0); multiplied_value[].A5C1D2H2I1M1N2O1R2T1
But you could also do: for(temp_snp in (colnames(input_allele[,.SD,.SDcols=c(7:11)]))) multiplied_value[, (temp_snp):= 0]; multiplied_value[].A5C1D2H2I1M1N2O1R2T1
I see. I should have used [] where I have to type variable name twice. for(temp_snp in (colnames(input_allele[,.SD,.SDcols=c(7:11)]))) multiplied_value[, (temp_snp):= 0][] I couldn't understand your first code snippet due to variables (j and i). How is what set there?Death Metal
If you look at the help page for ?set (where := is also demonstrated), towards the end you'll see timings for different ways of adding multiple columns to a data.table.A5C1D2H2I1M1N2O1R2T1
The last [] is to print after any in-place modification is used.A5C1D2H2I1M1N2O1R2T1

1 Answers

1
votes

Consolidating some of the comments into an answer here...

From ?set, you can find that the overhead of calling [.data.table repeatedly can add up. In those cases, you can try set instead.

Also, any set* functions should be followed by [] to print the output.

With that, here are the two alternatives:

copy1 <- copy2 <- copy3 <- input_allele[,c(1:6)]
new <- colnames(input_allele[,.SD,.SDcols=c(7:11)])

## Using `set` :

for (i in new) {
  set(copy1, j = i, value = 0)[]
}
head(copy1)
##     FID   IID PAT MAT SEX PHENOTYPE SNP1 SNP2 SNP3 SNP4 SNP5
## 1: gid1 IID11   1   0   1         1    0    0    0    0    0
## 2: gid2 IID12   2   0   1         1    0    0    0    0    0
## 3: gid3 IID13   3   0   1         1    0    0    0    0    0
## 4: gid4 IID14   4   0   1         1    0    0    0    0    0
## 5: gid5 IID15   5   0   1         1    0    0    0    0    0
## 6: gid6 IID16   6   0   1         1    0    0    0    0    0
   
## Using `:=` :

for (i in new) {
  copy2[, (i) := 0][]
}
head(copy2)
##     FID   IID PAT MAT SEX PHENOTYPE SNP1 SNP2 SNP3 SNP4 SNP5
## 1: gid1 IID11   1   0   1         1    0    0    0    0    0
## 2: gid2 IID12   2   0   1         1    0    0    0    0    0
## 3: gid3 IID13   3   0   1         1    0    0    0    0    0
## 4: gid4 IID14   4   0   1         1    0    0    0    0    0
## 5: gid5 IID15   5   0   1         1    0    0    0    0    0
## 6: gid6 IID16   6   0   1         1    0    0    0    0    0

You could also avoid the loop:

copy3[, (new) := as.list(rep(0, length(new)))][]
##       FID   IID PAT MAT SEX PHENOTYPE SNP1 SNP2 SNP3 SNP4 SNP5
##  1:  gid1 IID11   1   0   1         1    0    0    0    0    0
##  2:  gid2 IID12   2   0   1         1    0    0    0    0    0
##  3:  gid3 IID13   3   0   1         1    0    0    0    0    0
##  4:  gid4 IID14   4   0   1         1    0    0    0    0    0
##  5:  gid5 IID15   5   0   1         1    0    0    0    0    0
##  6:  gid6 IID16   6   0   1         1    0    0    0    0    0
##  7:  gid7 IID17   7   0   1         1    0    0    0    0    0
##  8:  gid8 IID18   8   0   1         1    0    0    0    0    0
##  9:  gid9 IID19   9   0   1         1    0    0    0    0    0
## 10: gid10 IID20  10   0   1         1    0    0    0    0    0

Note that quote and eval are not needed for these.

Even with this small dataset, the performance difference between set and using := in a loop is measurable:

fun1 <- function() { for (i in new) { set(copy1, j = i, value = 0)[] }; copy1 }
fun2 <- function() { for (i in new) { copy2[, (i) := 0][] } ; copy2 }
fun3 <- function() copy3[, (new) := as.list(rep(0, length(new)))][]

bench::mark(fun1(), fun2(), fun3())
## # A tibble: 3 x 13
##   expression     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
##   <bch:expr> <bch:t> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
## 1 fun1()      64.9µs  69.63µs    13932.        0B     4.17  6689     2
## 2 fun2()       993µs   1.07ms      910.   377.6KB     4.23   430     2
## 3 fun3()     241.9µs 255.12µs     3793.    16.4KB     4.30  1763     2
## # … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
## #   time <list>, gc <list>