2
votes

I have a data.table containing two date variables. The data set was read into R from a .csv file (was originally an .xlsx file) as a data.frame and the two variables then converted to date format using as.Date() so that they display as below:

df
  id   specdate    recdate
1  1 2014-08-12 2014-08-17
2  2 2014-08-15 2014-08-20
3  3 2014-08-21 2014-08-26
4  4       <NA> 2014-08-28
5  5 2014-08-25 2014-08-30
6  6       <NA>       <NA>

I then converted the data.frame to a data.table:

df <- data.table(df)

I then wanted to create a third variable, that would include "specdate" if present, but replace it with "recdate" if "specdate" was missing (NA). This is where I'm having some difficulty, as it seems that no matter how I approach this, data.table displays dates in date format only if a complete variable that is already in date format is copied. Otherwise, individual values are displayed as a number (even when using as.IDate) and I gather that an origin date is needed to correct this. Is there any way to avoid supplying an origin date but display the dates as dates in data.table?

Below is my attempt to fill the NAs of specdate with the recdate dates:

# Function to fill NAs:
fillnas <- function(dataref, lookupref, nacol, replacecol, replacelist=NULL) {
      nacol <- as.character(nacol)
      if(!is.null(replacelist)) nacol <- factor(ifelse(dataref==lookupref & (is.na(nacol) | nacol %in% replacelist), replacecol, nacol))
      else nacol <- factor(ifelse(dataref==lookupref & is.na(nacol), replacecol, nacol))
      nacol                
    }

# Fill the NAs in specdate with the function:
    df[, finaldate := fillnas(dataref=id, lookupref=id, nacol=specdate, replacecol=as.IDate(recdate, format="%Y-%m-%d"))]

Here is what happens:

    > df
   id   specdate    recdate  finaldate
1:  1 2014-08-12 2014-08-17 2014-08-12
2:  2 2014-08-15 2014-08-20 2014-08-15
3:  3 2014-08-21 2014-08-26 2014-08-21
4:  4       <NA> 2014-08-28      16310
5:  5 2014-08-25 2014-08-30 2014-08-25
6:  6       <NA>       <NA>         NA

The display problem is compounded if I create the new variable from scratch by using ifelse:

df[, finaldate := ifelse(!is.na(specdate), specdate, recdate)]

This gives:

> df
   id   specdate    recdate finaldate
1:  1 2014-08-12 2014-08-17     16294
2:  2 2014-08-15 2014-08-20     16297
3:  3 2014-08-21 2014-08-26     16303
4:  4       <NA> 2014-08-28     16310
5:  5 2014-08-25 2014-08-30     16307
6:  6       <NA>       <NA>        NA

Alternately if I try a find-and-replace type approach, I get an error about the number of items to replace not matching the replacement length (I'm guessing this is because that approach is not vectorised?), the values from recdate are recycled and end up in the wrong place:

> df$finaldate <- df$specdate
> df$finaldate[is.na(df$specdate)] <- df$recdate
Warning message:
In NextMethod(.Generic) :
  number of items to replace is not a multiple of replacement length
> df
   id   specdate    recdate  finaldate
1:  1 2014-08-12 2014-08-17 2014-08-12
2:  2 2014-08-15 2014-08-20 2014-08-15
3:  3 2014-08-21 2014-08-26 2014-08-21
4:  4       <NA> 2014-08-28 2014-08-17
5:  5 2014-08-25 2014-08-30 2014-08-25
6:  6       <NA>       <NA> 2014-08-20

So in conclusion - the function I applied gets me closest to what I want, except that where NAs have been replaced, the replacement value is displayed as a number and not in date format. Once displayed as a number, the origin is required to again display it as a date (and I would like to avoid supplying the origin since I usually don't know it and it seems unnecessarily repetitive to have to supply it when the date was originally in the correct format).

Any insights as to where I'm going wrong would be much appreciated.

2
ifelse strips attributeseddi

2 Answers

3
votes

I'd approach it like this, maybe :

DT <- data.table(df)
DT[, finaldate := specdata]
DT[is.na(specdata), finaldate := recdate]

It seems you want to add a new column so you can can retain the original columns as well. I do that as well a lot. Sometimes, I just update in place :

DT <- data.table(df)
DT[!is.na(specdate), specdate:=recdate]
setnames(DT, "specdate", "finaldate")

Using i like that avoids creating a whole new column's worth of data which might be very large. Depends on how important retaining the original columns is to you and how many of them there are and your data size. (Note that a whole column's worth of data is still created by the is.na() call and then again by ! but at least there isn't a third column's worth for the new finaldate. Would be great to optimize i=!is.na() in future (#1386) and if you use data.table this way now you won't need to change your code in future to benefit.)

It seems that you might have various "NA" strings that you're replacing. Note that fread in v1.9.6 on CRAN has a fix for that. From README :

  • correctly handles na.strings argument for all types of columns - it detect possible NA values without coercion to character, like in base read.table. fixes #504. Thanks to @dselivanov for the PR. Also closes #1314, which closes this issue completely, i.e., na.strings = c("-999", "FALSE") etc. also work.

Btw, you've made one of the top 3 mistakes mentioned here : https://github.com/Rdatatable/data.table/wiki/Support

1
votes

Works for me. You may want to test to be sure that your NA values are not strings or factors "<NA>"; they will look like real NA values:

dt[, finaldate := ifelse(is.na(specdate), recdate, specdate)][
  ,finaldate := as.POSIXct(finaldate*86400, origin="1970-01-01", tz="UTC")]
#    id   specdate    recdate  finaldate
# 1:  1 2014-08-12 2014-08-17 2014-08-12
# 2:  2 2014-08-15 2014-08-20 2014-08-15
# 3:  3 2014-08-21 2014-08-26 2014-08-21
# 4:  4         NA 2014-08-28 2014-08-28
# 5:  5 2014-08-25 2014-08-30 2014-08-25
# 6:  6         NA         NA         NA

Data

df <- read.table(text="  id   specdate    recdate
1  1 2014-08-12 2014-08-17
2  2 2014-08-15 2014-08-20
3  3 2014-08-21 2014-08-26
4  4         NA 2014-08-28
5  5 2014-08-25 2014-08-30
6  6         NA         NA", header=T, stringsAsFactors=F)

dt <- as.data.table(df)