1
votes

I have a large dataset of datetimes for almost a full year for every second of the year. I am trying to dplyr::left_join a second dataset that has a datetime column with values within the time range in the first dataset. When I join the dataset, only a small number of records join (about 100 from about 45k) and I know most records should be joining. The checks I'm doing to ensure the columns are the same include:

dput(df_all_dates$date_time[1])
dput(df_subset_dates$date_time[1])

Both of these produce the following:

structure(1485781200, class = c("POSIXct", "POSIXt"), tzone = "")

I've also done the following comparison (the 10 and the 4701 in the following code reflect the same dates in the data):

as.numeric(df_all_dates$date_time[10]) # produces value 1485785900
as.numeric(df_subset_dates$date_time[4701]) # produces value 1485785900

However, in the join, the data from the df_subset_dates does not join into the resulting dataset, even though the datetime values are the same. Is there something else about datetimes that would cause these not to join? Some values do join, but I don't see any pattern as to why those records are different from the ones that do not join.

Here is the code of the actual join, if helpful:

df_all_dates %>%
 left_join(df_subset_dates, by = 'date_time')
2
Please include a reproducible example - Moody_Mudskipper
I have had issues joining dates in the past... Not quite what you describe here, but I would just convert them to numeric, join, and convert back. And without a more detailed example it is difficult to diagnose. - Ian Wesley
If the OP could make a reproducible example, they would know the solution! I found the question and answer helpful. Changing the dates to numeric didn't show any immediate issues. I used lubridate::seconds() on both dates and found jstauss answer was helpful. - ARobertson

2 Answers

2
votes

It's a little silly, but I figured out the problem - some of the dttm's had milliseconds included in them. I couldn't see them anywhere than in a couple examples using dput(). Once I removed these, the data joined fine.

1
votes

On my separate dataset, checking both 'by' columns with lubridate::seconds(date_time) showed they were formatted differently due to milliseconds, though it didn't show up in most displays. "1522267608S" vs "1522267308.443S" (these aren't supposed to match, just to show formatting)

Wrapping one or both of the columns in the following to remove milliseconds did the trick for me:

library(lubridate)
as_datetime(floor(seconds(date_time)))

I haven't extensively tested other cases, but floor() worked perfectly for my data. It might be that round() or ceiling() works for others.