Im solving a problem for some time already. I would like to iterate through some columns of the data.table referenced by variable names and for each iteration id like to make new column based on some condition. To be specific i want to check the positions of NA in each column and make new column with variable counter that start from 1 every time the NA occurs.
I already tried for loop function to data.table with lapply, for cycle with data.table, even with set(DT...). I know .SD notation in data.table. I found rleid() function for relative indexes. But nothing seems to be working together.
Here i create data and get names for columns:
B <- data.table(c(2,NA,NA,2,2,1,2,3,4,NA,2,3,4,2,NA,NA,NA,1,1,1,NA,NA,NA,6,6,6),
c(2,NA,NA,2,"a",1,2,3,4,NA,2,"ah",4,2,NA,NA,NA,1,1,1,NA,NA,NA,6,6.6,6),
c(NA,2,2,2,2,1,2,3,4,NA,2,3,4,2,NA,NA,NA,1,1,1,NA,NA,NA,6,6,6))
nm <- letters[seq(1:3)]
nmx <- names(B)
Since i could not make it all in one statement i was experimenting with binary codes. I can create a column or vector of 0/1 for one column like this: as.numeric(!is.na(B[,V1]))
or like this: B[, a := ifelse(!is.na(V1), yes = 1, no = 0)]
with same result:
[1] 1 0 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1
Thank i can use rleid() and make this:
[1] 1 2 2 3 3 3 3 3 3 4 5 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9
which is good but not perfect, i would prefer to count only 1 and have NA on place of NA. But i dont care of group values since i want to find longest stream of rows without NA.
I can do it like this for each column separetly, but i have many columns and nothing is working for me. For example here. I want to create multiple columns with 0/1 but it is not woking.
B[,(nm) := ifelse(!is.na(nmx), yes = 1, no = 0)]
for (i in 1:3) B[, (nm[i]) := as.numeric(!is.na(B[,(nmx[i])]))]
If i can create all 0/1 columns i can use lapply() and than cbind() two datasets, original one with columns and new one with relative incexes.
A <– B[,lapply(.SD,rleid), .SDcols = (nmx)]
cbind(B,A)
To make long story short. Id like to have output like below and it need to be for any number of columns and need to be quite efficient because I have over 2.5mil row dataset. I see a big problem having column name stored on i-th place, data.table somehow cant evaluate the variable name hidden in the index. I feel like im missing some super basic stuff.
V1 V2 V3 a b c
1: 2 2 NA 1 1 NA
2: NA <NA> 2 NA NA 1
3: NA <NA> 2 NA NA 1
4: 2 2 2 2 2 1
5: 2 a 2 2 2 1
6: 1 1 1 2 2 1
7: 2 2 2 2 2 1
8: 3 3 3 2 2 1
9: 4 4 4 2 2 1
10: NA <NA> NA NA NA NA
11: 2 2 2 3 3 2
12: 3 ah 3 3 3 2
13: 4 4 4 3 3 2
14: 2 2 2 3 3 2
15: NA <NA> NA NA NA NA
16: NA <NA> NA NA NA NA
17: NA <NA> NA NA NA NA
18: 1 1 1 4 4 3
19: 1 1 1 4 4 3
20: 1 1 1 4 4 3
21: NA <NA> NA NA NA NA
22: NA <NA> NA NA NA NA
23: NA <NA> NA NA NA NA
24: 6 6 6 5 5 4
25: 6 6.6 6 5 5 4
26: 6 6 6 5 5 4
V1 V2 V3 a b c