1
votes

I am pulling data from a SQL Server database where each row uses an ISO timestamp as its key. In the database, the timestamps are stored as datetime2 datatype.

When I run the following statement:

data <- dbGetQuery(con, "SELECT timestamp from table1")

it returns a data frame where the timestamp column is a chr type. Here is what str() returns on the data frame:

$ timestamp: chr  "2020-03-25 12:19:48.0000000" "2020-03-25 12:20:48.0000000"...

With this I can convert it back to a POSIXct object using the following code:

data$timestamp <- as.POSIXct(data$timestamp, format = '%Y-%m-%d %H:%M:%S')

However, the database is storing values every minute (and occasionally every second), so if I try to convert a few months or years worth of timestamps using the as.POSIXct() function, it can take a long time (in the range of minutes). The dbGetQuery function is relatively fast when pulling that much data (in the range of a few seconds).

Is there a parameter in the dbGetQuery or other method that will directly interpret the timestamps as datetime, rather than strings?

Here is my connection command:

con <- dbConnect(odbc(),
                 Driver = "SQL Server",
                 Server = "server1",
                 Database = "db1",
                 UID = "user", 
                 PWD = "pwd")
1
If the database uses DATETIME types, then normally odbc recognizes it and will give you POSIXt. Unfortunately, I discourage datetime since it does not include TZ, preferring DATETIMEOFFSET. Unfortunately, odbc does not recognize that and will return it as character. You are therefore forced to process it post-fetch.r2evans
For instance, DBI::dbGetQuery(con, "select cast ( SYSDATETIMEOFFSET() as datetime ) as now") returns POSIXt, whereas DBI::dbGetQuery(con, "select cast ( SYSDATETIMEOFFSET() as datetimeoffset ) as now") does not.r2evans
I am using the ODBC library, though when I tried my same code using DBI, I got the same results. Additionally, there is a second timestamp column that stores UTC timestamps. Selecting from that column yields the same result though...setty
BTW: with decimal seconds, you might need %OS instead of %S, as in "%Y-%m-%d %H:%M:%OS". There are several packages that deal with converting timestamps to POSIXt, some claim ease-of-use, some claim speed. I can't vouch for them, but you can certainly check them out (and using microbenchmark to confirm speed at-scale), including lubridate and anytime.r2evans
Unless I'm mistaken, dbGetQuery means you're using DBI and odbc. You should never need to do library(odbc), using DBI::dbConnect(odbc::odbc(), ...) usually suffices. And for the most part, connection-specific drivers (including odbc) really augment's DBI's methods for most of the functions. ... So I'm not at all surprised that using DBI gave you the same results, it's likely using the same functions internally.r2evans

1 Answers

1
votes

TL;DR

(updated a little from my comment)

DBI::dbGetQuery(con, "select cast ( SYSDATETIMEOFFSET() at time zone 'UTC' as DATETIME ) as now")
#                       now
# 1 2020-03-25 20:30:33.026
Sys.time()
# [1] "2020-03-25 13:30:31.177 PDT"

(my laptop and the remote sql server are not synced)

Explanation

The odbc driver (using the nanodbc C++ library) will recognize data of SQL Server's type DATETIME. However, this type does not include time zone, so dumbing down the data can introduce error if two rows do not reference the same TZ.

DBI::dbExecute(con, "create table r2mt (id INTEGER, tm DATETIMEOFFSET)")
# [1] 0
DBI::dbExecute(con, "insert into r2mt (id,tm) values (1,'2020-03-23 12:34:56 +00:00'),(2,'2020-03-23 12:34:56.100 -04:00')")
# [1] 2

dat <- DBI::dbGetQuery(con, "select id, tm from r2mt")
str(dat)
# 'data.frame': 2 obs. of  2 variables:
#  $ id: int  1 2
#  $ tm: chr  "2020-03-23 12:34:56.0000000 +00:00" "2020-03-23 12:34:56.5000000 -04:00"
as.POSIXct(gsub("([-+]?[0-9]{2}):([0-9]{2})$", "\\1\\2", dat$tm),
           format = "%Y-%m-%d %H:%M:%OS %z")
# [1] "2020-03-23 05:34:56.0 PDT" "2020-03-23 09:34:56.5 PDT"
diff( as.POSIXct(gsub("([-+]?[0-9]{2}):([0-9]{2})$", "\\1\\2", dat$tm),
      format = "%Y-%m-%d %H:%M:%OS %z") )
# Time difference of 4.000139 hours

dat <- DBI::dbGetQuery(con, "select id, cast(tm as DATETIME) as tm from r2mt")
str(dat)
# 'data.frame': 2 obs. of  2 variables:
#  $ id: int  1 2
#  $ tm: POSIXct, format: "2020-03-23 12:34:56.0" "2020-03-23 12:34:56.5"
diff(dat$tm)
# Time difference of 0.5 secs

(In R, the time zone is an attribute of the vector, the whole column, so will not vary between different elements in that column.)

Since you're trying to do as much in SQL as possible (good idea), when you cast to the DATETIME class, make sure you force a time zone for all so that at least all times are comparable.

dat <- DBI::dbGetQuery(con, "select id, cast(tm at time zone 'UTC' as DATETIME) as tm from r2mt")
str(dat)
# 'data.frame': 2 obs. of  2 variables:
#  $ id: int  1 2
#  $ tm: POSIXct, format: "2020-03-23 12:34:56.0" "2020-03-23 16:34:56.5"

dat <- DBI::dbGetQuery(con, "select id, cast(tm at time zone 'Central European Standard Time' as datetime) as tm from r2mt")
str(dat)
# 'data.frame': 2 obs. of  2 variables:
#  $ id: int  1 2
#  $ tm: POSIXct, format: "2020-03-23 13:34:56.0" "2020-03-23 17:34:56.5"

(Unfortunately, the time zones used in SQL Server are not the same as in R. I tend to prefer 'UTC' for lack of ambiguity, over to you.)