4
votes

I'm using fwrite to write a csv in R.

I've noticed that the csv was written has different values in the POSIXct columns that seem to be off from the original values by one hour (though I'm unsure if it's consistently off by an hour or by random values).

For example, if I start with the following data (the datetimes are classed as POSIXct):
View(test)

Id | Date
---+--------------------------
01 | 2016-10-26 08:19:00
02 | 2016-10-26 09:05:00
03 | 2016-10-26 09:15:00
04 | 2016-10-26 09:15:00
05 | 2016-10-26 09:28:00

after
fwrite(test, file = "test.csv", quote = TRUE)

nexttest <- fread("M:\\test.csv", sep = ",", header = TRUE)
View(nexttest)
the return is

Id | Date
---+--------------------------
01 | 2016-10-26T07:19:00Z
02 | 2016-10-26T08:05:00Z
03 | 2016-10-26T08:15:00Z
04 | 2016-10-26T08:15:00Z
05 | 2016-10-26T08:28:00Z

etc etc
the outcome is the same using read.table. I presume I'm missing out an argument in the fwrite statement. Any ideas?

sessionInfo()    
R version 3.4.2 (2017-09-28)    
Platform: x86_64-w64-mingw32/x64 (64-bit)    
Running under: Windows 7 x64 (build 7601) Service Pack 1    

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United 
Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           

[5] LC_TIME=English_United Kingdom.1252    


dput(head(test))  
Date.Created = structure(c(1477919160, 1477915320, 1477936980, 
    1477919820, 1477917780, 1477909680), class = c("POSIXct", 
    "POSIXt"), tzone = "")    
2
done (only included the date column if that's ok) - Saleem Khan

2 Answers

3
votes

See the dateTimeAs argument of ?fwrite:

How Date/IDate, ITime and POSIXct items are written.

  • "ISO" (default) - 2016-09-12, 18:12:16 and 2016-09-12T18:12:16.999999Z. 0, 3 or 6 digits of fractional seconds are printed if and when present for convenience, regardless of any R options such as digits.secs. The idea being that if milli and microseconds are present then you most likely want to retain them. R's internal UTC representation is written faithfully to encourage ISO standards, stymie timezone ambiguity and for speed. An option to consider is to start R in the UTC timezone simply with "$ TZ='UTC' R" at the shell (NB: it must be one or more spaces between TZ='UTC' and R, anything else will be silently ignored; this TZ setting applies just to that R process) or Sys.setenv(TZ='UTC') at the R prompt and then continue as if UTC were local time.

[emphasis mine, see ISO 8061]

You'll want to be sure you control which timezone your POSIXct object is displayed in (see the tz argument of as.POSIXct) and proceed with caution.

Something about R that's come to bother me is that when tzone = '', implicitly your POSIXct objects have acquired your local timezone (for printing). To be fully explicit, you can examine your Sys.timezone() and add this to the tz argument when you declare the POSIXct object, then use attr(Date, 'tzone') = 'UTC' to convert it to UTC.

Compare the output of the following on your machine:

t = Sys.time()
t
attr(t, 'tzone') = 'UTC'
t
# [1] "2017-11-09 10:43:59 UTC"

So, if you do:

attr(test$Date, 'tzone') = 'UTC'
test$Date

You should see that this matches the output produced by fwrite.

You can examine the code for print.POSIXct and format.POSIXct and look at unclass(t) before and after setting attr(t, 'tzone') = 'UTC' to convince yourself that all of the use of timezones is superficial -- the internal representation of date-time objects in R is always "seconds since epoch", it's simply the print/format methods that apply timezone differences and affect the surface of the object that you see when printing.

1
votes

Have you tried to change the the argument dateTimeAs ?

See the help :

dateTimeAs
How Date/IDate, ITime and POSIXct items are written.

"ISO" (default) - 2016-09-12, 18:12:16 and 2016-09-12T18:12:16.999999Z. 0, 3 or 6 digits of fractional seconds are printed if and when present for convenience, regardless of any R options such as digits.secs. The idea being that if milli and microseconds are present then you most likely want to retain them. R's internal UTC representation is written faithfully to encourage ISO standards, stymie timezone ambiguity and for speed. An option to consider is to start R in the UTC timezone simply with "$ TZ='UTC' R" at the shell (NB: it must be one or more spaces between TZ='UTC' and R, anything else will be silently ignored; this TZ setting applies just to that R process) or Sys.setenv(TZ='UTC') at the R prompt and then continue as if UTC were local time.

"squash" - 20160912, 181216 and 20160912181216999. This option allows fast and simple extraction of yyyy, mm, dd and (most commonly to group by) yyyymm parts using integer div and mod operations. In R for example, one line helper functions could use %/%10000, %/%100%%100, %%100 and %/%100 respectively. POSIXct UTC is squashed to 17 digits (including 3 digits of milliseconds always, even if 000) which may be read comfortably as integer64 (automatically by fread()).

"epoch" - 17056, 65536 and 1473703936.999999. The underlying number of days or seconds since the relevant epoch (1970-01-01, 00:00:00 and 1970-01-01T00:00:00Z respectively), negative before that (see ?Date). 0, 3 or 6 digits of fractional seconds are printed if and when present.

"write.csv" - this currently affects POSIXct only. It is written as write.csv does by using the as.character method which heeds digits.secs and converts from R's internal UTC representation back to local time (or the "tzone" attribute) as of that historical date. Accordingly this can be slow. All other column types (including Date, IDate and ITime which are independent of timezone) are written as the "ISO" option using fast C code which is already consistent with write.csv.

The first three options are fast due to new specialized C code. The epoch to date-part conversion uses a fast approach by Howard Hinnant (see references) using a day-of-year starting on 1 March. You should not be able to notice any difference in write speed between those three options. The date range supported for Date and IDate is [0000-03-01, 9999-12-31]. Every one of these 3,652,365 dates have been tested and compared to base R including all 2,790 leap days in this range.

This option applies to vectors of date/time in list column cells, too.

A fully flexible format string (such as "%m/%d/%Y") is not supported. This is to encourage use of ISO standards and because that flexibility is not known how to make fast at C level. We may be able to support one or two more specific options if required.