4
votes

I am trying to import an Excel spreadsheet in to R (via read.xlsx2()). The Excel data has a date column. That date column contains mixed types of date formats e.g. some rows are 42669, and some are in date format e.g. 26/10/2016.

read.xlsx2() reads it in as a factor, so I converted it to as.Date using the code below. This works for all the dates in numeric form (e.g. 42669) but R warns me that it added some NAs (for the ones in format 26/10/2016). My question is how can I import the excel data with proper dates for all the variable i.e. tell R that there is mixed data?

library(xlsx)
#Import excel file
df <- read.xlsx2(mydata, 1, header=true)

#Output = recd_date : Factor w/ 590 levels "", "26/10/2016", "42669" ...
levels(df$recd_date)

#Output = [1] "" "26/10/2016" "42669" ...

#This works for numeric dates: 

df$recd_date <- as.Date( as.numeric (as.character(df$recd_date) ),origin="1899-12-30")

#Output = recd_date : Date, format "2016-10-26" ...

#but it doesn't work for dd/mm/yyyy dates, R just replaces these with NA
3
Wouldn't it be easier to change all dates do the same format inside excel?Fino
I tried this first, but for some reason some fields just stayed in the wrong format and I have many date columns to go throughpurplealpha

3 Answers

2
votes

Try convert_to_date from the janitor package, specifying the character-to-date function from the lubridate package that matches your date format:

library(janitor)
x <- c("26/10/2016", "42669")
convert_to_date(x, character_fun = lubridate::dmy)
#> [1] "2016-10-26" "2016-10-26"

Self-promotion disclaimer: I maintain this package. I'm adding this answer as this function was created to address this exact problem of a mix of Excel date numbers and formatted dates in the same variable.

1
votes

We could apply a function to clean date if necessary, basically like this:

cleanDate <- function(x) {
  if (all(nchar(df2$date.mix) < 10)) {
    cd <- as.Date(x)
  } else {
    cd <- do.call(c, 
                  lapply(x, function(i)
                    if (nchar(i) < 10)  
                      as.Date(as.numeric(i), origin="1970-01-01")
                    else as.Date(i)))
  }
  return(cd)
}

Example

# generate test df
df1 <- data.frame(date.chr=as.character(as.Date(1:3, origin=Sys.Date())), 
                  date.num=as.numeric(as.Date(1:3, origin=Sys.Date())),
                  date.mix=as.character(as.Date(1:3, origin=Sys.Date())),
                  stringsAsFactors=FALSE)

df1[2, 3] <- as.character(as.numeric(as.Date(df1[2, 1])))
> df1
    date.chr date.num   date.mix
1 2019-02-01    17928 2019-02-01
2 2019-02-02    17929      17929
3 2019-02-03    17930 2019-02-03

# write it to working directory
library(xlsx)
write.xlsx2(df1, "df1.xlsx")

# read it
# we use opt. `stringsAsFactors=FALSE` to prevent generation of factors
df2 <- read.xlsx2("df1.xlsx", 1, stringsAsFactors=FALSE)
> df2
  X.   date.chr date.num   date.mix
1  1 2019-02-01    17928 2019-02-01
2  2 2019-02-02    17929      17929
3  3 2019-02-03    17930 2019-02-03

Now we apply the function using lapply().

date.cols <- c("date.chr", "date.num", "date.mix")  # select date columns
df2[date.cols] <- lapply(df2[date.cols], cleanDate)

Result

> df2
  X.   date.chr   date.num   date.mix
1  1 2019-02-01 2019-02-01 2019-02-01
2  2 2019-02-02 2019-02-02 2019-02-02
3  3 2019-02-03 2019-02-03 2019-02-03
0
votes

Here is a way to do this,

Once we read in the data we convert the date columns (df$recd_date) to class character and then create two lists, one with the dd/mm/YYYY dates, and the other with the numeric dates. Once that is done we independently convert to date class, and then merge the two to get a final product.

#Test Data, read in anyway you want
data<-c("26/10/2016","27/10/2016","42669","52673","28/10/2016")
Index<-c(1:5)
df<-data.frame(Index, date=data)

#Put entire date column into character format
df$date<-as.character(df$date)

#Create Date from Numeric Date, Create Date from Character Date
Date_N<-as.Date(as.numeric(df$date),origin="1899-12-30")
Date_C<-as.Date(as.character(df$date),format="%d/%m/%Y")

#Create DF from list
Date_N_df<-as.data.frame(Date_N)
Date_C_df<-as.data.frame(Date_C)

#Replace NA from Date_C_df with index from Date_N_df
Date_C_df[is.na(Date_C_df)] <- Date_N_df[is.na(Date_C_df)]
Final<-Date_C_df 
names(Final)<-"Date"

> Final
        Date
1 2016-10-26
2 2016-10-27
3 2016-10-26
4 2044-03-17
5 2016-10-28