My Apache Beam pipeline takes an infinite stream of messages. Each message fans out N elements (N is ~1000 and is different for each input). Then for each element produced by the previous stage there is a map operation that produces new N elements, which should be reduced using a top 1 operation (elements are grouped by the original message that was read from the queue). The results of top 1 is saved to an external storage. In Spark I can easily do it by reading messages from the stream and creating RDD for each message that does map + reduce. Since Apache Beam does not have nested pipelines, I can't see a way implementing it in Beam with an infinite stream input. Example:
Infinite stream elements: A, B
Step 1 (fan out, N = 3): A -> A1, A2, A3
(N = 2): B -> B1, B2
Step 2 (map): A1, A2, A3 -> A1', A2', A3'
B1, B2, B3 -> B1', B2'
Step 3 (top1): A1', A2', A3' -> A2'
B1', B2' -> B3'
Output: A2', B2'
There is no dependency between A and B elements. A2' and B2' are top elements withing their group. The stream is infinite. The map operation can take from a couple of seconds to a couple of minutes. Creating a window watermark for the maximum time it takes to do the map operation would make the overall pipeline time much slower for fast map operations. Nested pipeline would help because this way I could create a pipeline per message.