2
votes

I'm receiving warnings when I try to read vector a which contains many dates.

Here's the text file for vector a which I made using write(a,"a.txt"). As it's pretty big, I've attached it on Google drive for anyone to download. Basically, it contains the dates from 2012-01-01 to 2012-12-31, repeated many times.

https://drive.google.com/file/d/0B12dCpdCVHeSZjA4YmVXNmV6VUU/edit?usp=sharing

I tried to do this and received a warning message.

> head(ymd(a))
[1] "2012-01-01 UTC" "2012-01-01 UTC" "2012-01-01 UTC" "2012-01-01 UTC" "2012-01-01 UTC"
[6] "2012-01-01 UTC"
Warning message:
 7202 failed to parse. 

Looking at the warning message, it would be easy to assume that the date format is wrong. However, YYYY-MM-DD is a format that is supported by lubridate. When I do the same to a part of the vector, nothing happens.

> head(ymd(a[1:50000]))
[1] "2012-01-01 UTC" "2012-01-01 UTC" "2012-01-01 UTC" "2012-01-01 UTC" "2012-01-01 UTC"
[6] "2012-01-01 UTC"

Using strptime and as.Date also doesn't produce any error

> head(strptime(a,format="%Y-%m-%d"))
[1] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
> head(as.Date(a))
[1] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"

My question is, do I need to worry about the warning message or can I safely ignore it?

2
I am getting values like "2012-03-01(1)" from row number 122241 onwards. I don't know may be i am wrong but because of this may be you are getting a warning.PKumar
Ah, that's it. Thank you!Wet Feet

2 Answers

2
votes
> b <- which( is.na( ymd(as.character(a[[1]]), tz="UTC") ) )
Warning message:
 7202 failed to parse. 
> head(b)
[1] 122241 122242 122243 122244 122245 122246
> as.character(a[[1]])[head(b)]
[1] "2012-03-01(1)" "2012-03-01(1)" "2012-03-01(1)" "2012-03-01(1)" "2012-03-01(1)"
[6] "2012-03-01(1)"

Looks like there are strings of such anomalous dates separated by large gaps:

rle( diff(b) )
Run Length Encoding
  lengths: int [1:15] 1183 1 831 1 928 1 605 1 639 1 ...
  values : int [1:15] 1 111657 1 29857 1 26065 1 25111 1 65729 ...
0
votes

Different solution using the anydate package, which seems to work.

library(anydate)
text <- anydate(a[, 1])
sum(is.na(text)) 

# 0