I am re-writing some R scripts which analyse large volumes of data (~17 million rows), and I thought I would try to improve it's memory efficiency by using the data.table
package (which I am only just learning!).
One part of the code has been puzzling me for a bit. I can't post my original solution because (1) it is crap(slow!), and (2) it's very nuanced with regards to the data, and will just complicate this question.
Instead, I have made this toy example (and it really is a toy example):
ds <- data.table(ID=c(1,1,1,1,2,2,2,3,3,3),
Obs=c(1.5,2.5,0.0,1.25,1.45,1.5,2.5,0.0,1.25,1.45),
Pos=c(1,3,5,6,2,3,5,2,3,4))
Which looks like this:
ID Obs Pos
1: 1 1.50 1
2: 1 2.50 3
3: 1 0.00 5
4: 1 1.25 6
5: 2 1.45 2
6: 2 1.50 3
7: 2 2.50 5
8: 3 0.00 2
9: 3 1.25 3
10: 3 1.45 4
For ease of explanation, I'll pretend that we are observing trains (each train having its own ID), moving across a linear one-way track, with observations (some value, not of import to the question) about the train being made at set positions (pos, here from 1-6) along the track. It is not expected that a train will make it the entire length of the track (maybe it exploded before making it to pos 6), and sometimes an observation is missed by the observer... The positions are consecutive (thus if we missed observing the train at position 4, but we observed it at position 5, we know it must have passed through position 4).
From the above data.table, I need to generate a table like this:
Pos Count
1: 1 3
2: 2 3
3: 3 3
4: 4 3
5: 5 2
6: 6 1
Where for each unique Pos in my data.table ds, I have a count of the number of trains that made it to that position on the track (or further), regardless of whether the observation was made at that position on the track.
If anyone has any ideas or suggestions as to how to tackle this, it would be much appreciated. Unfortunately, I'm not familiar enough with data.table to know if this could be done! Or it could be incredibly simple problem to solve and I'm just slow :)