4
votes

This data.table shows the months of year attended by students.

DT = data.table(
 Student = c(1, 1, 1, 1, 1, 1, 1, 1, 1,
             2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3),
 Month   = c(1, 2, 3, 5, 6, 7, 8, 11, 12,
             2, 3, 4, 5, 7, 8, 9, 10,
             1, 2, 3, 5, 6, 7, 8, 9))

DT
    Student Month
 1:       1     1
 2:       1     2
 3:       1     3
 4:       1     5
 5:       1     6
 6:       1     7
 7:       1     8
 8:       1    11
 9:       1    12
10:       2     2
11:       2     3
12:       2     4
13:       2     5
14:       2     7
15:       2     8
16:       2     9
17:       2    10
18:       3     1
19:       3     2
20:       3     3
21:       3     5
22:       3     6
23:       3     7
24:       3     8
25:       3     9

I want to identify periods of three consecutive months (identified by the first month in the period). This is visualization of the data table and the eligible periods.

       1   2   3   4   5   6   7   8   9   10  11  12


1      *   *   *       *   *   *   *           *   *
       [-------]       [-------]
                           [-------]                           


2          *   *   *   *       *   *   *   *
           [-------]           [-------]
               [-------]           [-------]


3      *   *   *       *   *   *   *   *      
       [-------]       [-------]
                           [-------]
                               [-------]

Desired output:

id   First_month_in_the_period 

1    1
1    5
1    6
2    2
2    3
2    7
2    8
3    1
3    5
3    6
3    7

Looking for data.table (or dplyr) solutions.

6
Just a doubt, why is 1, 2, 3 not consecutive for the id 1akrun
They are: the desired output includes id = 1 with Month = 1. Or am I missing something?Orion
Sorry, I didn't get the logic correct. In the desired output, 2, 3 rows are removed, similarly 7 is removed fro 'id' 1. May be if you can describe more clearlyakrun
My apologies! I should've made it clearerOrion

6 Answers

7
votes

Use standard method (cumsum...diff...condition) to identify runs of consecutive values, which then is used as grouping variable together with 'Student'. Within each group, create sequence based on length of each run and add to first month.

DT[ , .(start = if(.N >= 3) Month[1] + 0:(.N - 3)),
    by = .(Student, r = cumsum(c(1L, diff(Month) > 1)))]
#     Student r start
#  1:       1 1     1
#  2:       1 2     5
#  3:       1 2     6
#  4:       2 3     2
#  5:       2 3     3
#  6:       2 4     7
#  7:       2 4     8
#  8:       3 4     1
#  9:       3 5     5
# 10:       3 5     6
# 11:       3 5     7

Equivalent dplyr alternative:

DT %>% 
  group_by(Student, r = cumsum(c(1L, diff(Month) > 1))) %>%
  summarise(list(data.frame(start = if(n() >= 3) Month[1] + 0:(n() - 3)))) %>%
  tidyr::unnest()

# # A tibble: 11 x 3
# # Groups:   Student [3]
#       Student     r start
#         <dbl> <int> <dbl>
#     1       1     1     1
#     2       1     2     5
#     3       1     2     6
#     4       2     3     2
#     5       2     3     3
#     6       2     4     7
#     7       2     4     8
#     8       3     4     1
#     9       3     5     5
#    10       3     5     6
#    11       3     5     7
6
votes

A solution using the tidyverse.

library(tidyverse)
library(data.table)

DT2 <- DT %>%
  arrange(Student, Month) %>%
  group_by(Student) %>%
  # Create sequence of 3
  mutate(Seq = map(Month, ~seq.int(.x, .x + 2L))) %>%
  # Create a flag to show if the sequence match completely with the Month column 
  mutate(Flag = map_lgl(Seq, ~all(.x %in% Month))) %>%
  # Filter the Flag for TRUE
  filter(Flag) %>%
  # Remove columns
  select(-Seq, -Flag) %>%
  ungroup()

DT2
# # A tibble: 11 x 2
#    Student Month
#      <dbl> <dbl>
#  1       1     1
#  2       1     5
#  3       1     6
#  4       2     2
#  5       2     3
#  6       2     7
#  7       2     8
#  8       3     1
#  9       3     5
# 10       3     6
# 11       3     7
3
votes

Here is one solution, its uses the group by that data.table provides,

seqfun <- function(month) {
    n <- length(month)
    tmp <- data.table(a=month[1:(n-2)],b=month[2:(n-1)],c=month[3:n])
    month[which(apply(tmp,1,function(x){all(c(1,1)==diff(x))}))]}

Result <- DT[,seqfun(Month), by=Student]
names(Result) <- c("Student","Month")
> Result
    Student Month
 1:       1     1
 2:       1     5
 3:       1     6
 4:       2     2
 5:       2     3
 6:       2     7
 7:       2     8
 8:       3     1
 9:       3     5
10:       3     6
11:       3     7

Essentially it takes the groups month vector, creates 3 vectors to compare diffs and checks if both diffs are a distance of 1. If so, the original month vector's index is returned.

A little bit of detail. Suppose we have,

month <- c(1,2,3,5,6,7,8,11,12)

and we compute the tmp data.table (Note: you can also use the rollapply function in zoo to create a similar table, I'll show this at the very bottom)

   a  b  c
1: 1  2  3
2: 2  3  5
3: 3  5  6
4: 5  6  7
5: 6  7  8
6: 7  8 11
7: 8 11 12

When we take the diff across rows, we get,

> apply(tmp,1,function(x){all(c(1,1)==diff(x))})
[1]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE

The true values are the indices we are interested in.

As mentioned above, using the zoo library's rollapply we could have,

> apply(c(1,1)==rollapply(month,width=3,FUN=diff),1,all)
[1]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE

to get a boolean vector of the indices we are interested for a specific Student.

2
votes

Here is a base R solution that creates a function that can be applied to a data.table:

cons3fun<-function(x,n){

              consec.list<-split(x,cumsum(c(1,diff(x)!=1))) #Splits into list based on consecutive numbers

              min.len.seq<-consec.list[which(sapply(consec.list,length)>(n-1))] #Selects only the list elements >= to n

              seq.start<-lapply(min.len.seq,function(i) i[1:(length(i)-(n-1))]) #Extracts the first number of each sequence of n

              return(as.vector(unlist(seq.start))) #Returns result as a vector
}

Note that this function will allow you to change the number of consecutive numbers you are looking for fairly easily. Here you would use n=3. Then you can apply this function using either data.table or dplyr. I will use data.table since you used one.

DT[,cons3fun(Month,3),by=.(Student)]

Hope you find this useful. Good luck!

1
votes

Here is my approach using tidyverse:

> as_tibble(DT) %>%
      arrange(Student, Month) %>%
      group_by(Student) %>%
      # create an identifier for the start of the sequence
      mutate(seq_id = ifelse(row_number() == 1 | Month - lag(Month) > 1,
                             letters[row_number()], NA)) %>%
      fill(seq_id) %>%
      # add another grouping level (sequence identifier)
      group_by(Student, seq_id) %>%
      # only keep data with attendance in 3 or more consecutive months 
      filter(length(seq_id) > 2) %>%
      # n consecutive months => n - 2 periods
      slice(1:(n() - 2)) %>%
      # clean up
      ungroup() %>%
      select(Student, Month)
# A tibble: 11 x 2
#   Student Month
#    <dbl> <dbl>
#1       1     1
#2       1     5
#3       1     6
#4       2     2
#5       2     3
#6       2     7
#7       2     8
#8       3     1
#9       3     5
#10      3     6
#11      3     7
1
votes

Another data.table approach...

#first, clculate the difference between months, by student.
ans <- DT[, diff := shift( Month, type = "lead" ) - Month ), by = .(Student)]
#then filter rows that are at the start of 2 consecutive differences of 1
#also, drop the temporary diff-column
ans[ diff == 1 & shift( diff, type = "lead" ) == 1,][, diff := NULL][]

voila

#    Student Month
# 1:       1     1
# 2:       1     5
# 3:       1     6
# 4:       2     2
# 5:       2     3
# 6:       2     7
# 7:       2     8
# 8:       3     1
# 9:       3     5
# 10:      3     6
# 11:      3     7