0
votes

I have a data frame with 3 columns. (https://pastebin.com/DFqUuuDp)

The first two columns ("Time1","Time2") contain datetime data and have both the posixct format: "%Y-%m-%d %H:%M:%S".

So what i need ultimately is a subselect of rows, where for a particular time in Time1, only the rows are selected where Time2

  • is one day before Time1 (here one day is not always 24 hours; one day means the calendar day before. basically "yesterday")
  • and Time2 is < 12:00 Am.

A correct example:

+---------------------+----------------------+
|        Time1        |        Time2         |
+---------------------+----------------------+
| 2016-11-01 00:00:00 | 2016-10-31 00:00:00  |
+---------------------+----------------------+

A wrong example:

+---------------------+----------------------+
|        Time1        |        Time2         |
+---------------------+----------------------+
| 2016-11-01 00:00:00 | 2016-10-31 12:00:00  |
+---------------------+----------------------+

In the uploaded file, I added manually the third column ("Value") as a guidance for rows which i want to have filtered at the end. The rows with a "True" are interesting me.

I solved it with two for loops, but it is very slow going through large tables.

3
First off, the code from pastebin is giving an error of "Error: unexpected '<' in " -3000L". What did you do in the for loop? This doesn't seem like a problem that needs to be solved with loops. Have you tried using dplyr and just putting in the conditions in a filter? i.e. df %>% filter(time2-time1 <= 1 day & time2 < 12am)? - Pawel
sorry, i reuploaded the file. It was a data.table before. Now i i have uploaded a data.frame. ---- I don't know how to formulate (time2-time1 <= 1 day & time2 < 12am). How would you code the 'time2-time1 <= 1 day' part? - UDE_Student
@UDE_Student Your data is very big. Why dont you just share dput(head(df,20))? It will be easier for everyone. - MKR
yes you are right, i updated my initial post with the shorter version: pastebin.com/DFqUuuDp - UDE_Student
I have added a possible solution in base-R using lubridate. Have a look, - MKR

3 Answers

2
votes

The question has been tagged data.table. In addition, the sample datasets provided by the OP are of class data.table. Therefore, I feel obliged to post a data.table solution:

library(data.table)
DT[as.IDate(Time1) - 1L == as.IDate(Time2) & hour(Time2) < 12]
                 Time1      Time2 Value
1: 2016-11-01 00:00:00 2016-10-31  TRUE
2: 2016-11-01 00:30:00 2016-10-31  TRUE
3: 2016-11-01 01:00:00 2016-10-31  TRUE
4: 2016-11-01 01:30:00 2016-10-31  TRUE
# check if result is correct
identical(DT[as.IDate(Time1) - 1L == as.IDate(Time2) & hour(Time2) < 12L],
          DT[Value == "TRUE"])
[1] TRUE

as.IDate() is a Date class with integer storage for fast sorting and grouping. So, we can use integer arithmetic to compute the previous day. hour() is also provided by the data.table package and returns the hours of the day as integer value.

Reproducible data

Data copied from the pastebin link provided by the OP on 2018-05-29 at 22:00 UTC but with the .internal.selfref pointer removed:

DT <- structure(
  list(
    Time1 = structure(c(1477958400, 1477958400, 1477958400,
                        1477958400, 1477958400, 1477958400, 1477958400, 1477960200, 1477960200,
                        1477960200, 1477960200, 1477960200, 1477960200, 1477960200, 1477962000,
                        1477962000, 1477962000, 1477962000, 1477962000, 1477962000, 1477962000,
                        1477963800, 1477963800, 1477963800, 1477963800, 1477963800, 1477963800,
                        1477963800), 
                      class = c("POSIXct", "POSIXt"), tzone = "UTC"),
    Time2 = structure(c(1477699200, 1477742400, 1477785600, 1477828800,
                        1477872000, 1477915200, 1477958400, 1477699200, 1477742400,
                        1477785600, 1477828800, 1477872000, 1477915200, 1477958400,
                        1477699200, 1477742400, 1477785600, 1477828800, 1477872000,
                        1477915200, 1477958400, 1477699200, 1477742400, 1477785600,
                        1477828800, 1477872000, 1477915200, 1477958400), 
                      class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Value = c("FALSE", "FALSE", "FALSE",
              "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE",
              "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE",
              "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE",
              "FALSE", "TRUE", "FALSE", "FALSE")), 
  .Names = c("Time1", "Time2", "Value"), 
  row.names = c(NA, -28L), 
  class = c("data.table", "data.frame"))
2
votes

The following solution works. The solution uses data shared by OP

library(dplyr)
library(lubridate)

df %>%
  filter((as.Date(Time2)+days(1)) == as.Date(Time1) & format(Time2, "%H") < 12)


#                   Time1      Time2 Value
# 1   2016-11-01 00:00:00 2016-10-31  TRUE
# 2   2016-11-01 00:30:00 2016-10-31  TRUE
# 3   2016-11-01 01:00:00 2016-10-31  TRUE
# 4   2016-11-01 01:30:00 2016-10-31  TRUE
# 5   2016-11-01 02:00:00 2016-10-31  TRUE
# 6   2016-11-01 02:30:00 2016-10-31  TRUE
# 7   2016-11-01 03:00:00 2016-10-31  TRUE
# 8   2016-11-01 03:30:00 2016-10-31  TRUE
# 9   2016-11-01 04:00:00 2016-10-31  TRUE
# 10  2016-11-01 04:30:00 2016-10-31  TRUE
# so on
1
votes

A solution using subset and lubridate package can be using an approach as:

  1. Add 1 day to Time2 and then check Time1 & Time2 belongs to same day.
  2. Format Time2 in HHMMSS format and then check it is less than 120000 (noon)

The code:

library(lubridate)

subset(df, format(Time1,"%Y%m%d") == format(Time2+days(1),"%Y%m%d") &
        as.integer(format(Time2, "%H%M%S")) < 120000 )

#                    Time1      Time2 Value
# 19   2016-11-01 00:00:00 2016-10-31  TRUE
# 39   2016-11-01 00:30:00 2016-10-31  TRUE
# 59   2016-11-01 01:00:00 2016-10-31  TRUE
# 79   2016-11-01 01:30:00 2016-10-31  TRUE
# 99   2016-11-01 02:00:00 2016-10-31  TRUE
# 119  2016-11-01 02:30:00 2016-10-31  TRUE
# 139  2016-11-01 03:00:00 2016-10-31  TRUE
# 159  2016-11-01 03:30:00 2016-10-31  TRUE
# 179  2016-11-01 04:00:00 2016-10-31  TRUE
# 
# so on

Note: Time2 for all rows as part of subset contains 00:00:00. Hence its not appearing in above print.

Data:

head(df, 20)
# Time1               Time2 Value
# 1  2016-11-01 2016-10-22 00:00:00 FALSE
# 2  2016-11-01 2016-10-22 12:00:00 FALSE
# 3  2016-11-01 2016-10-23 00:00:00 FALSE
# 4  2016-11-01 2016-10-23 12:00:00 FALSE
# 5  2016-11-01 2016-10-24 00:00:00 FALSE
# 6  2016-11-01 2016-10-24 12:00:00 FALSE
# 7  2016-11-01 2016-10-25 00:00:00 FALSE
# 8  2016-11-01 2016-10-25 12:00:00 FALSE
# 9  2016-11-01 2016-10-26 00:00:00 FALSE
# 10 2016-11-01 2016-10-26 12:00:00 FALSE
# 11 2016-11-01 2016-10-27 00:00:00 FALSE
# 12 2016-11-01 2016-10-27 12:00:00 FALSE
# 13 2016-11-01 2016-10-28 00:00:00 FALSE
# 14 2016-11-01 2016-10-28 12:00:00 FALSE
# 15 2016-11-01 2016-10-29 00:00:00 FALSE
# 16 2016-11-01 2016-10-29 12:00:00 FALSE
# 17 2016-11-01 2016-10-30 00:00:00 FALSE
# 18 2016-11-01 2016-10-30 12:00:00 FALSE
# 19 2016-11-01 2016-10-31 00:00:00  TRUE
# 20 2016-11-01 2016-10-31 12:00:00 FALSE