0
votes

I have a time-series of hourly weather data from 2018. Two time points are showing up as duplicated even though they are different values. The same points are considered duplicates before and after the series is converted to POSIXct. In every case they appear as different numeric values. I am not sure how to create a REPREX since the values are not considered duplicates after I create a dataframe containing the numeric values. I am reading the data into R from an XLSX file. My guess is the problem is somehow created by timezones and daylight saving time, but I would expect duplicated points when an hour is "gained" in October rather than in June. Any advice would be greatly appreciated.

df$Time[which(duplicated(df$Time))]
[1] 43253.50 43253.54

class(df$Time)
[1] "numeric"


df$Time[which(duplicated(df$Time))]
[1] "2018-06-02 12:00:00 UTC" "2018-06-02 13:00:00 UTC"

class(df$Time)
[1] "POSIXct" "POSIXt"

unclass(df$Time[which(duplicated(df$Time))])
[1] 1527940800 1527944400
attr(,"tzone")
[1] "UTC"
2
I believe duplicated is telling you that both of those times appear twice. - SmokeyShakers
Would using lubridate::as_datetime on the times first help here? - gabesolomon10
@SmokeyShakers you're correct at least partially. One value appears twice before conversion to POSIXct, but the other value does not exist. ` which(df$Time==43253.50) [1] 3637 3661 which(df$Time==43253.54) integer(0) ` - Texanum
Careful with your floating points - SmokeyShakers
Try which(df$Time %in% df$Time[which(duplicated(df$Time))]) - SmokeyShakers

2 Answers

1
votes

duplicated "returns a logical vector indicating which elements (rows) are duplicates". That is, the first occurrence (not yet a duplicate) returns FALSE.

Both of those times appear once more in df$Time

0
votes

Your use of duplicated() only returns one instance of a duplicate value (but more when there are three or more values):

df <- data.frame(Time = c("Mon", "Mon", "Tue", "Wed", "Wed", "Wed", "Thu", "Fr"))

df$Time[duplicated(df$Time)]

# [1] "Mon" "Wed" "Wed"

You could alternatively use

janitor::get_dupes(df, Time)

#   Time dupe_count
# 1  Mon          2
# 2  Mon          2
# 3  Wed          3
# 4  Wed          3
# 5  Wed          3