10
votes

My data

I have a data.table DT with the current (F0YR) and the next (F1YR) fiscal year-end (FYE) encoded as integers. Since every next FYE will eventually become a current FYE, the integer will be both in the column F1YR and F0YR. Also, my data contains monthly observations so the same FYE will be in the data set multiple times:

library(data.table)
DT <- data.table(ID     = rep(c("A", "B"), each=9),
                 MONTH  = rep(100L:108L, times=2),
                 F0YR   = rep(c(1L, 4L, 7L), each=3, times=2),
                 F1YR   = rep(c(4L, 7L, 9L), each=3, times=2),
                 value  = c(rep(1:5, each=3), 6, 6, 7),
                 key    = "ID,F0YR")
DT
      ID MONTH F0YR F1YR value
 [1,]  A   100    1    4     1
 [2,]  A   101    1    4     1
 [3,]  A   102    1    4     1
 [4,]  A   103    4    7     2
 [5,]  A   104    4    7     2
 [6,]  A   105    4    7     2
 [7,]  A   106    7    9     3
 [8,]  A   107    7    9     3
 [9,]  A   108    7    9     3
[10,]  B   100    1    4     4
[11,]  B   101    1    4     4
...

What I want to do

For every ID and F1YR combination, I want to get the value for the ID and F0YR combination. As an example: Company A had a value of 2 for FOYR==4. Now, I want an additional column for all combinations with ID=="A" and F1YR==4 which is set to 2, next to the already existent value of 1.

What I tried

intDT <- DT[CJ(unique(ID), unique(F0YR)), list(ID, F0YR, valueNew = value), mult="last"]
setkey(intDT, ID, F0YR)
setkey(DT, ID, F1YR)
DT <- intDT[DT]
setnames(DT, c("F0YR.1", "F0YR"), c("F0YR", "F1YR"))
DT
      ID F1YR valueNew MONTH F0YR value
 [1,]  A    4        2   100    1     1
 [2,]  A    4        2   101    1     1
 [3,]  A    4        2   102    1     1
 [4,]  A    7        3   103    4     2
 [5,]  A    7        3   104    4     2
 [6,]  A    7        3   105    4     2
 [7,]  A    9       NA   106    7     3
 [8,]  A    9       NA   107    7     3
 [9,]  A    9       NA   108    7     3
[10,]  B    4        5   100    1     4
[11,]  B    4        5   101    1     4
...

(Note that I use mult="last" here because, although the values should only change with F0YR or F1YR changes, sometimes they don't and this is just my tie breaker).

What I want

This looks improvable. First of all, I have to make a copy of my DT. Second, since I join basically the same data.table, all the column names have the same name and I have to rename them. I thought that a self join would be the way forward, but I tried and tried and couldn't get a nice solution. I have the hope that there is something easy out there which I just don't see...Does anyone have a clue? Or is my data set up in such a way that it is actually hard (maybe because I have monthly observations, but want to join only quarterly or yearly changing values).

1
I don't think it's needed for this question but := by group is working now in 1.8.1, and could be used for this perhaps. R-Forge is building ok and the binary installs ok (in R 2.15.0 since R-Forge only buids for latest) using install.packages("data.table", repos="http://R-Forge.R-project.org"). Check that test.data.table() returns "653 tests completed ok" to be sure it's the latest version. See latest NEWS to see if any new features are useful for this one.Matt Dowle
@MatthewDowle -- Very nice! I just tried that out. Looks like it works for overwriting an existing column, but not yet for creating a new one. Is that right?Josh O'Brien
@Josh Great. No, it should add new columns fine. You can even subassign to a new column and it'll populate the rest of the column with NA for you. If neither works please file a bug report or new question. Make sure test.data.table() returns 653 tests ok to rule out somehow using an older revision of v1.8.1.Matt Dowle
@MatthewDowle That is indeed very nice. I didn't know how simple it is to install the package that way...Anyways, I also don't see how := by group can help here. Tried it for an hour and couldn't figure anything out. So I guess my solution is OK. Joining one data.table by two columns of that data.table is a very specific use case anyway.Christoph_J
No problem, have added an answer. Gone straight to := by group, grouping by i not by, and using join inherited scope (V1 comes from i scope) to boot. Didn't exactly plan to go that far in the first demo of := by group, but that's just the way it worked it out!Matt Dowle

1 Answers

6
votes

In use cases like this, the mantra "aggregate first, then join with that" often helps. So, starting with your DT, and using v1.8.1 :

> agg = DT[,last(value),by=list(ID,F0YR)]
> agg
   ID F0YR V1
1:  A    1  1
2:  A    4  2
3:  A    7  3
4:  B    1  4
5:  B    4  5
6:  B    7  7

I called it agg because I couldn't think of a better name. In this case you wanted last which isn't really an aggregate as such, but you know what I mean.

Then update DT by reference by group. Here we're grouping by i.

setkey(DT,ID,F1YR)
DT[agg,newcol:=V1]
    ID MONTH F0YR F1YR value newcol
 1:  A   100    1    4     1      2
 2:  A   101    1    4     1      2
 3:  A   102    1    4     1      2
 4:  A   103    4    7     2      3
 5:  A   104    4    7     2      3
 6:  A   105    4    7     2      3
 7:  A   106    7    9     3     NA
 8:  A   107    7    9     3     NA
 9:  A   108    7    9     3     NA
10:  B   100    1    4     4      5
11:  B   101    1    4     4      5
12:  B   102    1    4     4      5
13:  B   103    4    7     5      7
14:  B   104    4    7     5      7
15:  B   105    4    7     5      7
16:  B   106    7    9     6     NA
17:  B   107    7    9     6     NA
18:  B   108    7    9     7     NA

Is that right? Not sure I fully followed. Those ops should be very fast, without any copies, and should scale to large data. At least, that's the intention.