4
votes

I have something like the following dataset:

myDT <- structure(list(domain = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), id = 2:22, L1 = 2:22), row.names = c(NA, 
-21L), class = c("data.table", "data.frame"))

and I would like to create a new column L2 that creates an index for every 2 rows within domain. However, if there is a remainder, like in the case for domain=2 and id=8,9,10, then those ids should be indexed together as long as its within the same domain. Please note that the specific id values in the toy dataset are made up and not always consecutive as shown. The output would be:

  structure(list(domain = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
                            3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), id = 2:22, L1 = 2:22, L2=c(1L,1L,2L,2L,3L,3L,4L,4L,4L,
                                                                                                        5L,5L,6L,6L,7L,7L,8L,8L,9L,9L,10L,10L)), 
            row.names = c(NA, -21L), class = c("data.table", "data.frame"))

Is there an efficient way to do this in data.table?

I've tried playing with .N/rowid and the integer division operator %/% (since every n rows should give the same value) inside the subset call but it got me nowhere. For example, I tried something like:

myDT[, L2 := rowid(domain)%/%2]

but clearly this doesn't address the requirements that the last 3 rows within in domain=2 have the same index and that the index should continue incrementing for domain=3.

EDIT Please see revised desired output data table and corresponding description.

EDIT 2

Here is an appended version of myDT:

myDT2 <- structure(list(domain = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), id = 2:40, 
    L1 = 2:40), row.names = c(NA, -39L), class = c("data.table", 
"data.frame"))

When I ran @chinsoon12's code on the above, I get:

structure(list(domain = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), id = 2:40, 
    L1 = 2:40, L2 = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 5L, 
    5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L, 11L, 11L, 
    11L, 12L, 12L, 13L, 13L, 14L, 14L, 15L, 15L, 16L, 16L, 17L, 
    17L, 18L, 18L)), row.names = c(NA, -39L), class = c("data.table", 
"data.frame"))

There appears to be 4 values of L2=11, when two of them should be 12 because they are in a different domain.

3
do you need to generalize to diff window size other than 2?chinsoon12

3 Answers

2
votes

An idea is to make a custom function that will create sequential vectors based on the length of each group and the remainder of that length when divided by two. The function is,

f1 <- function(x) {
    v1 <- length(x)
    i1 <- rep(seq(floor(v1 / 2)), each = 2)
    i2 <- c(i1, rep(max(i1), v1 %% 2))
    i2 + seq_along(i2)
}

I tried to apply it via data.table but I was getting an error about a bug so here it is with base R,

cumsum(c(TRUE, diff(with(myDT2, ave(id, domain, FUN = f1))) != 1))
#[1]  1  1  2  2  3  3  4  4  4  5  5  6  6  7  7  8  8  9  9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19
2
votes

Here is another approach updated for the edited question (inspired by @Sotos use of cumsum):

  1. For each domain id, create a repeated sequence 1, 0, 1, 0, 1, ..., setting the final sequence element to zero by default.
  2. Take the cumsum over the created sequence across domain id's.
library(data.table)

setDT(myDT2)

myDT2[, L2 := c(head(rep_len(c(1, 0), .N), -1), 0), by = domain][, L2 := cumsum(L2)][]
#>     domain id L1 L2
#>  1:      2  2  2  1
#>  2:      2  3  3  1
#>  3:      2  4  4  2
#>  4:      2  5  5  2
#>  5:      2  6  6  3
#>  6:      2  7  7  3
#>  7:      2  8  8  4
#>  8:      2  9  9  4
#>  9:      2 10 10  4
#> 10:      3 11 11  5
#> 11:      3 12 12  5
#> 12:      3 13 13  6
#> 13:      3 14 14  6
#> 14:      3 15 15  7
#> 15:      3 16 16  7
#> 16:      3 17 17  8
#> 17:      3 18 18  8
#> 18:      3 19 19  9
#> 19:      3 20 20  9
#> 20:      3 21 21 10
#> 21:      3 22 22 10
#> 22:      4 23 23 11
#> 23:      4 24 24 11
#> 24:      5 25 25 12
#> 25:      5 26 26 12
#> 26:      5 27 27 13
#> 27:      5 28 28 13
#> 28:      5 29 29 14
#> 29:      5 30 30 14
#> 30:      5 31 31 15
#> 31:      5 32 32 15
#> 32:      5 33 33 16
#> 33:      5 34 34 16
#> 34:      5 35 35 17
#> 35:      5 36 36 17
#> 36:      5 37 37 18
#> 37:      5 38 38 18
#> 38:      5 39 39 19
#> 39:      5 40 40 19
#>     domain id L1 L2
2
votes

Here is another option for variable number of repeats other than 2:

n <- 4
setDT(myDT)[, L2 := 
        myDT[, {
            x <- ceiling(seq_along(id)/n)
            if (sum(x==x[.N]) < n) x[x==x[.N]] <- floor(.N/n)
            x
        }, domain][, rleid(domain, V1)]
    ]

Or a recursive approach:

n <- 4
s <- 0
setDT(myDT)[, L2 := 
        myDT[, {
            x <- s + ceiling(seq_along(id)/n)
            if (sum(x==x[.N]) < n) x[x==x[.N]] <- s + floor(.N/n)
            s <- if (s<max(x)) max(x) else s + 1
            x
        }, domain]$V1
    ]

output for n=2:

    domain id L1 L2
 1:      2  2  2  1
 2:      2  3  3  1
 3:      2  4  4  2
 4:      2  5  5  2
 5:      2  6  6  3
 6:      2  7  7  3
 7:      2  8  8  4
 8:      2  9  9  4
 9:      2 10 10  4
10:      3 11 11  5
11:      3 12 12  5
12:      3 13 13  6
13:      3 14 14  6
14:      3 15 15  7
15:      3 16 16  7
16:      3 17 17  8
17:      3 18 18  8
18:      3 19 19  9
19:      3 20 20  9
20:      3 21 21 10
21:      3 22 22 10
22:      4 23 23 11
23:      4 24 24 11
24:      5 25 25 12
25:      5 26 26 12
26:      5 27 27 13
27:      5 28 28 13
28:      5 29 29 14
29:      5 30 30 14
30:      5 31 31 15
31:      5 32 32 15
32:      5 33 33 16
33:      5 34 34 16
34:      5 35 35 17
35:      5 36 36 17
36:      5 37 37 18
37:      5 38 38 18
38:      5 39 39 19
39:      5 40 40 19
    domain id L1 L2

output for n=4:

    domain id L1 L2
 1:      2  2  2  1
 2:      2  3  3  1
 3:      2  4  4  1
 4:      2  5  5  1
 5:      2  6  6  2
 6:      2  7  7  2
 7:      2  8  8  2
 8:      2  9  9  2
 9:      2 10 10  2
10:      3 11 11  3
11:      3 12 12  3
12:      3 13 13  3
13:      3 14 14  3
14:      3 15 15  4
15:      3 16 16  4
16:      3 17 17  4
17:      3 18 18  4
18:      3 19 19  5
19:      3 20 20  5
20:      3 21 21  5
21:      3 22 22  5
22:      4 23 23  6
23:      4 24 24  6
24:      5 25 25  7
25:      5 26 26  7
26:      5 27 27  7
27:      5 28 28  7
28:      5 29 29  8
29:      5 30 30  8
30:      5 31 31  8
31:      5 32 32  8
32:      5 33 33  9
33:      5 34 34  9
34:      5 35 35  9
35:      5 36 36  9
36:      5 37 37 10
37:      5 38 38 10
38:      5 39 39 10
39:      5 40 40 10
    domain id L1 L2