1
votes

Subject

I have two (simplified) datasets:

  • A dataset of 500 observations of some.value every hour (date.time variable as POSIXct)
  • A dataset of 10 daily temperatures (date variable as Date)

The objective is to add the temperature of the second dataset as a new variable to the first dataset where the variable date.time corresponds to the date variable.

I tried a data.table solution using setkey() and roll="nearest" according to : R – How to join two data frames by nearest time-date?

Unfortunately the temperature that gets merged is always the same value for the entire merged dataset.

A simplified example

Here is the exemple code that illustrates my problem and my solution attempt:

Setting random seed

set.seed(10)

Generating the two datasets

observations <- data.frame(date.time = seq(from=ymd_hms("2017-02-01 00:00:00"), length.out=500, by=60*60), some.value = runif(500,0.0,1.0))
daily.temperature <- data.frame(date = seq(from=as.Date("2017-02-01"), length.out = 10, by=1), temperature = runif(10,10,40))

Solution attempt using data.tables and roll="nearest"

# converting dataframes to datatables
library(data.table)
observations <- as.data.table(observations)
daily.temperature <- as.data.table(daily.temperature)

# setting the keys of the two datasets
setkey(observations,date.time)
setkey(daily.temperature,date)

# Combinding the datasets
combined <- daily.temperature[observations, roll = "nearest" ]
combined

Note that the temperature variable in the combined dataset is always the same regardless of date.

Notes regading the unsimplified (real) problem:

  • In my real problem the observations are recorded every minute instead of every hour.
  • In my real problem the daily.temperature dataset does not cover the entire range of observations. In that case, adding 'NA' or nothing at all as the temperature would be fine.
1

1 Answers

0
votes

Do you want something like this?

    set.seed(10)
library(dplyr)
    observations <- data.frame(date.time = seq(from=ymd_hms("2017-02-01 00:00:00"), length.out=500, by=60*60), some.value = runif(500,0.0,1.0))
    daily.temperature <- data.frame(date = seq(from=as.Date("2017-02-01"), length.out = 10, by=1), temperature = runif(10,10,40))
    observations$date<-as.Date(observations$date.time)
    combined<-left_join(observations,daily.temperature,by="date")
> head(combined)
            date.time some.value       date temperature
1 2017-02-01 00:00:00  0.8561467 2017-02-01    38.64702
2 2017-02-01 01:00:00  0.7820957 2017-02-01    38.64702
3 2017-02-01 02:00:00  0.2443390 2017-02-01    38.64702
4 2017-02-01 03:00:00  0.3138552 2017-02-01    38.64702
5 2017-02-01 04:00:00  0.1284753 2017-02-01    38.64702
6 2017-02-01 05:00:00  0.9299472 2017-02-01    38.64702