2
votes

I want to check a data.table for all dates that have a given category, and set all those dates to TRUE in a separate data.table. I have a data.table that looks something like this:

library(data.table)
DT1 = fread(
'Date          Category
2010-01-01    A
2010-01-01    B
2010-01-02    A
2010-01-02    C
2010-01-02    D
2010-01-04    B
2010-01-04    "B OR D"')
DT1[, Date := as.IDate(Date) ]

and I'm trying to extract the values to a data frame that looks like this:

DT2 = fread('
Date       A     B     C     D
2010-01-01 FALSE FALSE FALSE FALSE
2010-01-02 FALSE FALSE FALSE FALSE
2010-01-03 FALSE FALSE FALSE FALSE
2010-01-04 FALSE FALSE FALSE FALSE
2010-01-05 FALSE FALSE FALSE FALSE')
DT2[, Date := as.IDate(Date) ]

If a particular category exists for a given date, I would like to set the cell with the row's date and the category's column to true. I know this will probably involve something like:

DT2 <- DT2[Date %in% DT1$Date, A := grep(something)]

I want this call to modify DT2 to look like this:

Date       A     B     C     D
2010-01-01 TRUE  FALSE FALSE FALSE
2010-01-02 TRUE  FALSE FALSE FALSE
2010-01-03 FALSE FALSE FALSE FALSE
2010-01-04 FALSE FALSE FALSE FALSE
2010-01-05 FALSE FALSE FALSE FALSE

Right now, my code uses any(grep()), which is a problem because it changes DT2$A[4] to TRUE, which I don't want. I would like to use a regular expression, since some of my category cells refer to multiple values. Because of the last cell, both my checks on the value B and the value D should set their dummy variables in 2010-01-04 to true. As the last 2 rows show, there may be one date which contains multiple references to a single category.

Is there a way to do this in data.table? I can use other packages if absolutely necessary.

Note that I can't just use dcast to transform D1 into the proper data.table, because it doesn't contain all the dates I need.

1
Is it the case that the dates you need are always from min(Date) to max(Date)?Frank
Good question! The dates actually go one past max(Date). I'll edit my question, but i know how to do it now.Pokey
Ok great. Fyi, usually if you edit a question after an answer has been posted, the etiquette here is to edit the answer to account for your change (if you know how) or to comment on the answer notifying them (assuming you can comment there.. I'm not sure). Anyway, the relevant edit seems to be max(Date) to max(Date) + 1L.Frank

1 Answers

2
votes

Here's one attempt created by looping over the possible categories (hat-tip to @Frank for adjustments):

DF1[
  .(seq(min(Date), max(Date) + 1L, by="day")),
  on=.(Date),
  lapply(sapply(c("A","B","C","D"), grepl, x=Category, USE.NAMES=TRUE, simplify=FALSE),any),
  by=.EACHI
]
#         Date     A     B     C     D
#1: 2010-01-01  TRUE  TRUE FALSE FALSE
#2: 2010-01-02  TRUE FALSE  TRUE  TRUE
#3: 2010-01-03 FALSE FALSE FALSE FALSE
#4: 2010-01-04 FALSE  TRUE FALSE  TRUE
#5: 2010-01-05 FALSE FALSE FALSE FALSE