Filtering several date columns in R

Question

I have a data frame with 3 columns. (https://pastebin.com/DFqUuuDp)

The first two columns ("Time1","Time2") contain datetime data and have both the posixct format: "%Y-%m-%d %H:%M:%S".

So what i need ultimately is a subselect of rows, where for a particular time in Time1, only the rows are selected where Time2

is one day before Time1 (here one day is not always 24 hours; one day means the calendar day before. basically "yesterday")
and Time2 is < 12:00 Am.

A correct example:

+---------------------+----------------------+
|        Time1        |        Time2         |
+---------------------+----------------------+
| 2016-11-01 00:00:00 | 2016-10-31 00:00:00  |
+---------------------+----------------------+

A wrong example:

+---------------------+----------------------+
|        Time1        |        Time2         |
+---------------------+----------------------+
| 2016-11-01 00:00:00 | 2016-10-31 12:00:00  |
+---------------------+----------------------+

In the uploaded file, I added manually the third column ("Value") as a guidance for rows which i want to have filtered at the end. The rows with a "True" are interesting me.

I solved it with two for loops, but it is very slow going through large tables.

First off, the code from pastebin is giving an error of "Error: unexpected '<' in " -3000L". What did you do in the for loop? This doesn't seem like a problem that needs to be solved with loops. Have you tried using dplyr and just putting in the conditions in a filter? i.e. df %>% filter(time2-time1 <= 1 day & time2 < 12am)? — Pawel
sorry, i reuploaded the file. It was a data.table before. Now i i have uploaded a data.frame. ---- I don't know how to formulate (time2-time1 <= 1 day & time2 < 12am). How would you code the 'time2-time1 <= 1 day' part? — UDE_Student
@UDE_Student Your data is very big. Why dont you just share dput(head(df,20))? It will be easier for everyone. — MKR
yes you are right, i updated my initial post with the shorter version: pastebin.com/DFqUuuDp — UDE_Student
I have added a possible solution in base-R using lubridate. Have a look, — MKR

Uwe Uwe · Accepted Answer · 2018-05-29T23:03:24

The question has been tagged data.table. In addition, the sample datasets provided by the OP are of class data.table. Therefore, I feel obliged to post a data.table solution:

library(data.table)
DT[as.IDate(Time1) - 1L == as.IDate(Time2) & hour(Time2) < 12]

                 Time1      Time2 Value
1: 2016-11-01 00:00:00 2016-10-31  TRUE
2: 2016-11-01 00:30:00 2016-10-31  TRUE
3: 2016-11-01 01:00:00 2016-10-31  TRUE
4: 2016-11-01 01:30:00 2016-10-31  TRUE

# check if result is correct
identical(DT[as.IDate(Time1) - 1L == as.IDate(Time2) & hour(Time2) < 12L],
          DT[Value == "TRUE"])

[1] TRUE

as.IDate() is a Date class with integer storage for fast sorting and grouping. So, we can use integer arithmetic to compute the previous day. hour() is also provided by the data.table package and returns the hours of the day as integer value.

Reproducible data

Data copied from the pastebin link provided by the OP on 2018-05-29 at 22:00 UTC but with the .internal.selfref pointer removed:

DT <- structure(
  list(
    Time1 = structure(c(1477958400, 1477958400, 1477958400,
                        1477958400, 1477958400, 1477958400, 1477958400, 1477960200, 1477960200,
                        1477960200, 1477960200, 1477960200, 1477960200, 1477960200, 1477962000,
                        1477962000, 1477962000, 1477962000, 1477962000, 1477962000, 1477962000,
                        1477963800, 1477963800, 1477963800, 1477963800, 1477963800, 1477963800,
                        1477963800), 
                      class = c("POSIXct", "POSIXt"), tzone = "UTC"),
    Time2 = structure(c(1477699200, 1477742400, 1477785600, 1477828800,
                        1477872000, 1477915200, 1477958400, 1477699200, 1477742400,
                        1477785600, 1477828800, 1477872000, 1477915200, 1477958400,
                        1477699200, 1477742400, 1477785600, 1477828800, 1477872000,
                        1477915200, 1477958400, 1477699200, 1477742400, 1477785600,
                        1477828800, 1477872000, 1477915200, 1477958400), 
                      class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Value = c("FALSE", "FALSE", "FALSE",
              "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE",
              "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE",
              "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE",
              "FALSE", "TRUE", "FALSE", "FALSE")), 
  .Names = c("Time1", "Time2", "Value"), 
  row.names = c(NA, -28L), 
  class = c("data.table", "data.frame"))

Filtering several date columns in R

3 Answers

Reproducible data