I have a data.table with about 3 million rows and 40 columns. I would like to sort this table by descending order within groups like the following sql mock code:
sort by ascending Year, ascending MemberID, descending Month
Is there an equivalent way in data.table to do this? So far I have to break it down into 2 steps:
setkey(X, Year, MemberID)
This is very fast and takes only a few second.
X <- X[,.SD[order(-Month)],by=list(Year, MemberID)]
This step takes so much longer (5 minutes).
Update:
Someone made a comment to do X <- X[sort(Year, MemberID, -Month)]
and later deleted. This approach seems to be much faster:
user system elapsed
5.560 11.242 66.236
My approach: setkey() then order(-Month)
user system elapsed
816.144 9.648 848.798
My question is now: if I want to summarize by Year, MemberId and Month after sort(Year, MemberID, Month), does data.table recognize the sort order?
Update 2: to response to Matthew Dowle:
After setkey with Year, MemberID and Month, I still have multiple records per group. What I would like is to summarize for each of the groups. What I meant was: if I use X[order(Year, MemberID, Month)], does the summation utilizes binary search functionality of data.table:
monthly.X <- X[, lapply(.SD[], sum), by = list(Year, MemberID, Month)]
Update 3: Matthew D proposed several approaches. Run time for the first approach is faster than order() approach:
user system elapsed
7.910 7.750 53.916
Matthew: what surprised me was converting the sign of Month takes most of the time. Without it, setkey is blazing fast.