6
votes

Following is what my dataframe/data.table looks like. The rank column is my desired calculated field.

library(data.table)
df <- fread('
             Name   Score         Date              Rank
             John    42         1/1/2018              3   
             Rob     85         12/31/2017            2
             Rob     89         12/26/2017            1
             Rob     57         12/24/2017            1
             Rob     53         08/31/2017            1
             Rob     72         05/31/2017            2
             Kate    87         12/25/2017            1
             Kate    73         05/15/2017            1
             ')
df[,Date:= as.Date(Date, format="%m/%d/%Y")]

I am trying to calculate the rank of each student at every given point in time in the data within a 30 day windows. For that, I need to fetch the most recent scores of all students at a given point in time and then pass the rank function.

In the 1st row, as of 1/1/2018, John has two more competitors in a past 30 day window: Rob with the most recent score of 85 in 12/31/2017 AND Kate with the most recent score of 87 in 12/25/2017 and both of these dates fall within the 1/1/2018 - 30 Day Window. John gets a rank of 3 with the lowest score of 42. If only one students falls within date(at a given row) - 30 day window, then the rank is 1.

In the 3rd row the date is 12/26/2017. So Rob's score as of 12/26/2017 is 89. There is only one case of another student that falls in the time window of 12/26/2017 - 30 days and that is the most recent score(87) of kate on 12/25/2017. So within the time window of (12/26/2017) - 30 , Rob's score of 89 is higher than Kate's score of 87 and therefore Rob gets rank 1.

I was thinking about using the framework from here Efficient way to perform running total in the last 365 day window but struggling to think of a way to fetch all recent score of all students at a given point in time before using rank.

6
Rob's score as of 12/31 should also be 89 giving him rank 1 on row 2, right?Frank
@Frank Hey Frank, I was thinking that as of 12/31, Rob's most recent score is 85 which is second to Kate's 87 on 12/25(which falls in the 12/31 - 30 day window).gibbz00

6 Answers

5
votes

This seems to work:

ranks = df[.(d_dn = Date - 30L, d_up = Date), on=.(Date >= d_dn, Date <= d_up), allow.cart=TRUE][, 
  .(LatestScore = last(Score)), by=.(Date = Date.1, Name)]

setorder(ranks, Date, -LatestScore)
ranks[, r := rowid(Date)]

df[ranks, on=.(Name, Date), r := i.r]

   Name Score       Date Rank r
1: John    42 2018-01-01    3 3
2:  Rob    85 2017-12-31    2 2
3:  Rob    89 2017-12-26    1 1
4:  Rob    57 2017-12-24    1 1
5:  Rob    53 2017-08-31    1 1
6:  Rob    72 2017-05-31    2 2
7: Kate    87 2017-12-25    1 1
8: Kate    73 2017-05-15    1 1

... using last since the Cartesian join seems to sort and we want the latest measurement.

How the update join works

The i. prefix means it's a column from i in the x[i, ...] join, and the assignment := is always in x. So it's looking up each row of i in x and where matches are found, copying values from i to x.

Another way that is sometimes useful is to look up x rows in i, something like df[, r := ranks[df, on=.(Name,Date), x.r]] in which case x.r is still from the ranks table (now in the x position relative to the join).


There's also...

ranks = df[CJ(Name = Name, Date = Date, unique=TRUE), on=.(Name, Date), roll=30, nomatch=0]
setnames(ranks, "Score", "LatestScore")

# and then use the same last three lines above    

I'm not sure about efficiency of one vs another, but I guess it depends on number of Names, frequency of measurement and how often measurement days coincide.

2
votes

A solution that uses data.table though not sure if it is the most efficient usage:

df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date), 
    .(Rank=frank(-c(iScore[1L], .SD[Name != iName, max(Score), by=.(Name)]$V1), 
        ties.method="first")[1L]), 
    by=.EACHI, 
    on=.(Date >= StartDate, Date <= EndDate)]

Explanation:

1) The outer square brackets do a non-equi join within a date range (i.e. 30days ago and latest date for each row). Try studying the below output against the input data:

df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
    c(.(RowGroup=.GRP), 
        .SD[, .(Name, Score, Date, OrigDate, iName, iScore, iDate, StartDate, EndDate)]),
    by=.EACHI,
    on=.(Date >= StartDate, Date <= EndDate)]

2) .EACHI is to perform j calculations for each row of i.

3) Inside j, iScore[1L] is the score for the current row, .SD[Name != iName] means taking scores not corresponding to the student in the current row. Then, we use the max(Score) for each student of those students within the 30days window.

4) Concatenate all these scores and calculate the rank for the score of the current row while taking care of ties by taking the first tie.

Note:

see ?data.table to understand what i, j, by, on and .EACHI refers to.


EDIT after comments by OP:

I would add a OrigDate column and find those that matches the latest date.

df[, OrigDate := Date]

df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date), 
    .(Name=iName, Score=iScore, Date=iDate, 
        Rank=frank(-c(iScore[1L], 
                .SD[Name != iName, Score[OrigDate==max(OrigDate)], by=.(Name)]$V1), 
            ties.method="first")[1L]), 
    by=.EACHI, 
    on=.(Date >= StartDate, Date <= EndDate)]
1
votes

I came up with following partial solution, encountered however problem - is it possible that there will be two people occuring with the same date?

if not, have a look at following piece of code:

library(tidyverse) # easy manipulation
library(lubridate) # time handling
# This function can be added to 
get_top <- function(df, date_sel) {
  temp <- df %>% 
    filter(Date > date_sel - months(1)) %>% # look one month in the past from given date
    group_by(Name) %>% # and for each occuring name
    summarise(max_score = max(Score)) %>% # find the maximal score
    arrange(desc(max_score)) %>% # sort them
    mutate(Rank = 1:n()) # and rank them
  temp
}

Now, you have to find the name in the table, for given date and return its rank.

1
votes
library(data.table)
library(magrittr)

setorder(df, -Date)

fun <- function(i){
    df[i:nrow(df), head(.SD, 1), by = Name] %$% 
        rank(-Score[Date > df$Date[i] - 30])[1]
}
df[, rank := sapply(1:.N, fun)]
1
votes

This can be done by joining to df those rows of df that are within 30 days behind it or the same date and have higher or equal scores. Then for each original row and joined row Name get the joined row Name that is the most recent. The count of the remaining joined rows for each of the original df rows is the rank.

library(sqldf)

sqldf("with X as
  (select a.rowid r, a.*, max(b.Date) Date
  from df a join df b
  on b.Date between a.Date - 30 and a.Date and b.Score >= a.Score
  group by a.rowid, b.Name)

  select Name, Date, Score, count(*) Rank 
  from X
  group by r
  order by r")

giving:

  Name       Date Score Rank
1 John 2018-01-01    42    3
2  Rob 2017-12-31    85    2
3  Rob 2017-12-26    89    1
4  Rob 2017-12-24    57    1
5  Rob 2017-08-31    53    1
6  Rob 2017-05-31    72    2
7 Kate 2017-12-25    87    1
8 Kate 2017-05-15    73    1
1
votes

A tidyverse solution (dplyr + tidyr):

df %>%
  complete(Name,Date) %>%
  group_by(Name)      %>% 
  mutate(last_score_date = `is.na<-`(Date,is.na(Score))) %>%
  fill(Score,last_score_date) %>%
  filter(!is.na(Score) & Date-last_score_date <30) %>%
  group_by(Date) %>%
  mutate(Rank = rank(-Score)) %>%
  right_join(df)

# # A tibble: 8 x 5
# # Groups:   Date [?]
# Name       Date Score last_score_date  Rank
# <chr>     <date> <int>          <date> <dbl>
# 1  John 2018-01-01    42      2018-01-01     3
# 2   Rob 2017-12-31    85      2017-12-31     2
# 3   Rob 2017-12-26    89      2017-12-26     1
# 4   Rob 2017-12-24    57      2017-12-24     1
# 5   Rob 2017-08-31    53      2017-08-31     1
# 6   Rob 2017-05-31    72      2017-05-31     2
# 7  Kate 2017-12-25    87      2017-12-25     1
# 8  Kate 2017-05-15    73      2017-05-15     1
  • We add all missing combinations of Date and Name
  • then we create a column for the last_score_date, equal to Date when score isn't NA.
  • by filling NAs down Score has become the latest score
  • we filter out NAs and keep only scores that have < 30 days of age
  • That's our table of valid scores by dates
  • From there it's easy to add ranks
  • and a final right_join on the original table gives us the expected output

data

library(data.table)
df <- fread('
            Name   Score         Date   
            John    42         01/01/2018  
            Rob     85         12/31/2017
            Rob     89         12/26/2017
            Rob     57         12/24/2017
            Rob     53         08/31/2017
            Rob     72         05/31/2017
            Kate    87         12/25/2017
            Kate    73         05/15/2017
            ')
df[,Date:= as.Date(Date, format="%m/%d/%Y")]