4
votes

Sometimes I am given data sets that has two different date formats but common variables that have to been joined into one dataframe. Over the years, I've tried various solutions to get around this workflow hassle. Now that I've been using lubridate, it seems like many of these problems are easily solved. However, I am encountering some behaviour that seems weird to me though I imagine there is a good explanation that is beyond me. Say I am given a data set with different date formats that I join into one data frame. This dataframe looks like this:

library(ludridate)
library(dplyr)

df<-data.frame(Lab=c("A","B"),DATE=c("12/15/15","12/15/2013")); df

I want to convert this data to a date format with lubridate. However the following does not format consistently:

df %>% 
  mutate(mdy(DATE))

...but rather creates a 0015 date. If I filter just for Lab "A":

df %>% 
  filter(Lab=="A") %>%
  mutate(mdy(DATE))

... or even group_by Lab:

df %>% 
  group_by(Lab) %>%
  mutate(mdy(DATE))

Then I get the desired year format. Is this the correct behaviour of the lubridate family of date formatting functions? Is there a better way to accomplish what I am doing? I am sure that multiple date formats in one column is a relatively common (and annoying) occurence.

Thanks in advance.

2
This SO question might be helpful.eipi10
I think my question is unique because the same lubridate command applies to both scenarios - the two digit year and the four digit year.boshek

2 Answers

2
votes

parse_date_time of lubridate package can help format multiple date formats in one go.

Syntax:

df$date = parse_date_time(df$date, c(format1, format2, format3))

You need to specify all the possible format types.

Since lubridate has some difficulty understanding (correctly) some format types, you need to make custom format.

In the help section , you will find the below illustration. You can recreate it to suit your requirement.

## ** how to use `select_formats` argument **
## By default %Y has precedence:
parse_date_time(c("27-09-13", "27-09-2013"), "dmy")
## [1] "13-09-27 UTC"   "2013-09-27 UTC"

## to give priority to %y format, define your own select_format function:

my_select <-   function(trained){
   n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%y", names(trained))*1.5
   names(trained[ which.max(n_fmts) ])
}

parse_date_time(c("27-09-13", "27-09-2013"), "dmy", select_formats = my_select)
## '[1] "2013-09-27 UTC" "2013-09-27 UTC"
1
votes

From the help on parse_date_time:

## ** how to use select_formats **
## By default %Y has precedence:
parse_date_time(c("27-09-13", "27-09-2013"), "dmy")
## [1] "13-09-27 UTC"   "2013-09-27 UTC"

## to give priority to %y format, define your own select_format function:

my_select <-   function(trained){
  n_fmts <- nchar(gsub("[^%]", "", names(trained))) + grepl("%y",     names(trained))*1.5
  names(trained[ which.max(n_fmts) ])
}

parse_date_time(c("27-09-13", "27-09-2013"), "dmy", select_formats = my_select)
## '[1] "2013-09-27 UTC" "2013-09-27 UTC"