In Hadoop I have a collection of datapoints, each including a "startTime" and "endTime" in milliseconds. I want to group on one field then identify each place in the bag where one datapoint overlaps another in the sense of start/end time. For example, here's some data:
0,A,0,1000
1,A,1500,2000
2,A,1900,3000
3,B,500,2000
4,B,3000,4000
5,B,3500,5000
6,B,7000,8000
which I load and group as follows:
inputdata = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
grouped = GROUP inputdata BY where;
The ideal result here would be
(1,2)
(4,5)
I have written some bad code to generate an individual tuple for each second with some rounding, then do a set intersection, but this seems hideously inefficient, and in fact it still doesn't quite work. Rather than debug a bad approach, I want to work on a good approach.
How can I reasonably efficiently get tuples like (id1,id2) for the overlapping datapoints?
I am thoroughly comfortable writing a Java UDF to do the work for me, but it seems as though Pig should be able to do this without needing to resort to a custom UDF.