Rank most recent scores of students within a given date - 30 days window

Question

Following is what my dataframe/data.table looks like. The rank column is my desired calculated field.

library(data.table)
df <- fread('
             Name   Score         Date              Rank
             John    42         1/1/2018              3   
             Rob     85         12/31/2017            2
             Rob     89         12/26/2017            1
             Rob     57         12/24/2017            1
             Rob     53         08/31/2017            1
             Rob     72         05/31/2017            2
             Kate    87         12/25/2017            1
             Kate    73         05/15/2017            1
             ')
df[,Date:= as.Date(Date, format="%m/%d/%Y")]

I am trying to calculate the rank of each student at every given point in time in the data within a 30 day windows. For that, I need to fetch the most recent scores of all students at a given point in time and then pass the rank function.

In the 1st row, as of 1/1/2018, John has two more competitors in a past 30 day window: Rob with the most recent score of 85 in 12/31/2017 AND Kate with the most recent score of 87 in 12/25/2017 and both of these dates fall within the 1/1/2018 - 30 Day Window. John gets a rank of 3 with the lowest score of 42. If only one students falls within date(at a given row) - 30 day window, then the rank is 1.

In the 3rd row the date is 12/26/2017. So Rob's score as of 12/26/2017 is 89. There is only one case of another student that falls in the time window of 12/26/2017 - 30 days and that is the most recent score(87) of kate on 12/25/2017. So within the time window of (12/26/2017) - 30 , Rob's score of 89 is higher than Kate's score of 87 and therefore Rob gets rank 1.

I was thinking about using the framework from here Efficient way to perform running total in the last 365 day window but struggling to think of a way to fetch all recent score of all students at a given point in time before using rank.

Rob's score as of 12/31 should also be 89 giving him rank 1 on row 2, right? — Frank
@Frank Hey Frank, I was thinking that as of 12/31, Rob's most recent score is 85 which is second to Kate's 87 on 12/25(which falls in the 12/31 - 30 day window). — gibbz00

Frank Frank · Accepted Answer · 2018-04-03T12:41:26

This seems to work:

ranks = df[.(d_dn = Date - 30L, d_up = Date), on=.(Date >= d_dn, Date <= d_up), allow.cart=TRUE][, 
  .(LatestScore = last(Score)), by=.(Date = Date.1, Name)]

setorder(ranks, Date, -LatestScore)
ranks[, r := rowid(Date)]

df[ranks, on=.(Name, Date), r := i.r]

   Name Score       Date Rank r
1: John    42 2018-01-01    3 3
2:  Rob    85 2017-12-31    2 2
3:  Rob    89 2017-12-26    1 1
4:  Rob    57 2017-12-24    1 1
5:  Rob    53 2017-08-31    1 1
6:  Rob    72 2017-05-31    2 2
7: Kate    87 2017-12-25    1 1
8: Kate    73 2017-05-15    1 1

... using last since the Cartesian join seems to sort and we want the latest measurement.

How the update join works

The i. prefix means it's a column from i in the x[i, ...] join, and the assignment := is always in x. So it's looking up each row of i in x and where matches are found, copying values from i to x.

Another way that is sometimes useful is to look up x rows in i, something like df[, r := ranks[df, on=.(Name,Date), x.r]] in which case x.r is still from the ranks table (now in the x position relative to the join).

There's also...

ranks = df[CJ(Name = Name, Date = Date, unique=TRUE), on=.(Name, Date), roll=30, nomatch=0]
setnames(ranks, "Score", "LatestScore")

# and then use the same last three lines above

I'm not sure about efficiency of one vs another, but I guess it depends on number of Names, frequency of measurement and how often measurement days coincide.

Rank most recent scores of students within a given date - 30 days window

6 Answers

Explanation:

Note:

EDIT after comments by OP: