2
votes

Im solving a problem for some time already. I would like to iterate through some columns of the data.table referenced by variable names and for each iteration id like to make new column based on some condition. To be specific i want to check the positions of NA in each column and make new column with variable counter that start from 1 every time the NA occurs.

I already tried for loop function to data.table with lapply, for cycle with data.table, even with set(DT...). I know .SD notation in data.table. I found rleid() function for relative indexes. But nothing seems to be working together.

Here i create data and get names for columns:

B <- data.table(c(2,NA,NA,2,2,1,2,3,4,NA,2,3,4,2,NA,NA,NA,1,1,1,NA,NA,NA,6,6,6),
                c(2,NA,NA,2,"a",1,2,3,4,NA,2,"ah",4,2,NA,NA,NA,1,1,1,NA,NA,NA,6,6.6,6),
                c(NA,2,2,2,2,1,2,3,4,NA,2,3,4,2,NA,NA,NA,1,1,1,NA,NA,NA,6,6,6))

nm <- letters[seq(1:3)]
nmx <- names(B)

Since i could not make it all in one statement i was experimenting with binary codes. I can create a column or vector of 0/1 for one column like this: as.numeric(!is.na(B[,V1])) or like this: B[, a := ifelse(!is.na(V1), yes = 1, no = 0)] with same result:

[1] 1 0 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1

Thank i can use rleid() and make this:

[1] 1 2 2 3 3 3 3 3 3 4 5 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9

which is good but not perfect, i would prefer to count only 1 and have NA on place of NA. But i dont care of group values since i want to find longest stream of rows without NA.

I can do it like this for each column separetly, but i have many columns and nothing is working for me. For example here. I want to create multiple columns with 0/1 but it is not woking.

B[,(nm) := ifelse(!is.na(nmx), yes = 1, no = 0)]
for (i in 1:3) B[, (nm[i]) := as.numeric(!is.na(B[,(nmx[i])]))]

If i can create all 0/1 columns i can use lapply() and than cbind() two datasets, original one with columns and new one with relative incexes.

A <– B[,lapply(.SD,rleid), .SDcols = (nmx)]
cbind(B,A)

To make long story short. Id like to have output like below and it need to be for any number of columns and need to be quite efficient because I have over 2.5mil row dataset. I see a big problem having column name stored on i-th place, data.table somehow cant evaluate the variable name hidden in the index. I feel like im missing some super basic stuff.

    V1   V2 V3  a  b  c
 1:  2    2 NA  1  1 NA
 2: NA <NA>  2 NA NA  1
 3: NA <NA>  2 NA NA  1
 4:  2    2  2  2  2  1
 5:  2    a  2  2  2  1
 6:  1    1  1  2  2  1
 7:  2    2  2  2  2  1
 8:  3    3  3  2  2  1
 9:  4    4  4  2  2  1
10: NA <NA> NA NA NA NA
11:  2    2  2  3  3  2
12:  3   ah  3  3  3  2
13:  4    4  4  3  3  2
14:  2    2  2  3  3  2
15: NA <NA> NA NA NA NA
16: NA <NA> NA NA NA NA
17: NA <NA> NA NA NA NA
18:  1    1  1  4  4  3
19:  1    1  1  4  4  3
20:  1    1  1  4  4  3
21: NA <NA> NA NA NA NA
22: NA <NA> NA NA NA NA
23: NA <NA> NA NA NA NA
24:  6    6  6  5  5  4
25:  6  6.6  6  5  5  4
26:  6    6  6  5  5  4
    V1   V2 V3  a  b  c
3
yes, thank you for correction :)Gray Jackal

3 Answers

3
votes

You can write a helper function using rle :

library(data.table)

assign_values <- function(x) {
 value <-  with(rle(!is.na(x)), rep(cumsum(values), lengths))
 value[is.na(x)] <- NA
 value
}

Now apply this function for nmx columns.

B[, (nm) := lapply(.SD, assign_values), .SDcols = nmx]
B

#    V1   V2 V3  a  b  c
# 1:  2    2 NA  1  1 NA
# 2: NA <NA>  2 NA NA  1
# 3: NA <NA>  2 NA NA  1
# 4:  2    2  2  2  2  1
# 5:  2    a  2  2  2  1
# 6:  1    1  1  2  2  1
# 7:  2    2  2  2  2  1
# 8:  3    3  3  2  2  1
# 9:  4    4  4  2  2  1
#10: NA <NA> NA NA NA NA
#11:  2    2  2  3  3  2
#12:  3   ah  3  3  3  2
#13:  4    4  4  3  3  2
#14:  2    2  2  3  3  2
#15: NA <NA> NA NA NA NA
#16: NA <NA> NA NA NA NA
#17: NA <NA> NA NA NA NA
#18:  1    1  1  4  4  3
#19:  1    1  1  4  4  3
#20:  1    1  1  4  4  3
#21: NA <NA> NA NA NA NA
#22: NA <NA> NA NA NA NA
#23: NA <NA> NA NA NA NA
#24:  6    6  6  5  5  4
#25:  6  6.6  6  5  5  4
#26:  6    6  6  5  5  4
#    V1   V2 V3  a  b  c
1
votes

Here is another option:

for (k in seq_along(B))
    eval(substitute(B[, v := rleid(is.na(COL))][!is.na(COL), VAL := rleid(v)],
        list(COL=as.name(nmx[k]), VAL=as.name(nm[k]))))
B[, v:=NULL][]
0
votes

I'm not sure about the NA part, but does using mutate_all and cumsum help you at all? It seems like you want to keep a running tally of how many NAs you have encountered so far in the data, which cumsum does, and mutate_all will do it across all columns. If you wanted to do the operation over some columns, you could use mutate(across)) and pass in which columns you want to apply the operation to. Here is a sample I was working on from your data:

library(data.table)
B <- data.table(c(2,NA,NA,2,2,1,2,3,4,NA,2,3,4,2,NA,NA,NA,1,1,1,NA,NA,NA,6,6,6),
                c(2,NA,NA,2,"a",1,2,3,4,NA,2,"ah",4,2,NA,NA,NA,1,1,1,NA,NA,NA,6,6.6,6),
                c(NA,2,2,2,2,1,2,3,4,NA,2,3,4,2,NA,NA,NA,1,1,1,NA,NA,NA,6,6,6))
C <- B %>% 
  mutate_all(compose(cumsum, is.na))
colnames(C) <- str_replace(colnames(C), "V", "X")
B <- cbind(B, C)
B %>% 
  group_by_at((ncol(B) / 2 + 1): ncol(B)) %>% 
  summarise(count = n() - 1) %>% 
  arrange(-count)

We obtain the following dataframe:

   V1   V2 V3 X1 X2 X3
 1:  2    2 NA  0  0  1
 2: NA <NA>  2  1  1  1
 3: NA <NA>  2  2  2  1
 4:  2    2  2  2  2  1
 5:  2    a  2  2  2  1
 6:  1    1  1  2  2  1
 7:  2    2  2  2  2  1
 8:  3    3  3  2  2  1
 9:  4    4  4  2  2  1
10: NA <NA> NA  3  3  2
11:  2    2  2  3  3  2
12:  3   ah  3  3  3  2
13:  4    4  4  3  3  2
14:  2    2  2  3  3  2
15: NA <NA> NA  4  4  3
16: NA <NA> NA  5  5  4
17: NA <NA> NA  6  6  5
18:  1    1  1  6  6  5
19:  1    1  1  6  6  5
20:  1    1  1  6  6  5
21: NA <NA> NA  7  7  6
22: NA <NA> NA  8  8  7
23: NA <NA> NA  9  9  8
24:  6    6  6  9  9  8
25:  6  6.6  6  9  9  8
26:  6    6  6  9  9  8
    V1   V2 V3 X1 X2 X3

and see the longest streak is 6 rows without any NAs:

 A tibble: 18 x 5
# Groups:   X1, X2 [10]
      X1    X2    X3 count
   <int> <int> <int> <dbl>
 1     2     2     1     6
 2     3     3     2     4
 3     6     6     5     3
 ....

Another alternative would be to use complete_cases and look for the longest streak in there, but this might be a bit messy: B %>% complete.cases() and then go from there.