I have a requirement to process batches of tuples with Storm. My last bolt has to wait until the topology receives the entire batch and only after that can do some processing. To avoid confusion - a batch for me is a set of N messages that comes in real-time and the term doesn't have to be connected with batch processing (Hadoop). Even 2 messages can be a batch.
Reading the Storm's documentation is it safe to say that Storm doesn't inherently support this kind of batch processing (batch = N messages in real-time)?
I know that we have Trident but I am not using it but I did test it a little. The batchSpout concept is indeed what I am looking for because you can make a batch using the collector and the spout will emit messages as a single batch. But Trident aside, what does the plain Storm offer?
How I approached this problem is by using tuple acknowledgment and some timeout hacks (maybe they are not hacks?). As a message broker I used RabbitMQ and I made a spout that takes messages from a queue and sends them downstream as tuples until there is no more messages in the queue. These tuples pass through a couple of stages (3-4 stages aka bolts) and when they reach the final bolt they stop there. I need to stop them (not emit anything) because, as I said, the last bolt needs to process the result of the entire batch and then emit only one tuple (final resulting tuple). So how does it know when it should process? I've made the spout responsible for signalization. When the spout doesn't have any tuples to emit it sleeps for 10 ms. So after it sleeps for, let's say, 1000 ms it goes into a READY state (it is ready to emit an END-OF-BATCH or TIME-OUT signal). But another condition needs to be met. I can't send the signal until I am sure that all tuples reached the final bolt. So I've used tuple acknowledgment to keep track of this. When tuple reaches the final bolt it gets acked. When all tuples get acked and when the spout times out the spout sends the signal and the final bolt is now happy and it can process the result of that batch of tuples.
So my question is to you, my dear Storm gurus, is this topology badly designed and does it look like some sort of a hack? Is there a better way to do this?
Redis
) where each tuple might have a value that will indicate it's state. Then by using STORMSTICK_TUPLE
you can constantly check if all the tuples in a given batch where processed. – Vorauto_delete
– Vor