1
votes

I have a three part topology that's having some serious latency issues but I'm having trouble figuring out where.

kafka -> db lookup -> write to cassandra

The numbers from the storm UI look like this: enter image description here

(I see that the bolts are running at > 1.0 capacity)

If the process latency for the two bolts is ~65ms why is the 'complete latency' > 400 sec? The 'failed' tuples are coming from timeouts I suspect as the latency value is steadily increasing.

The tuples are connected via shuffleGrouping.

Cassandra lives on AWS so there are likely network limitations en route.

The storm cluster has 3 machines. There are 3 workers in the topology.

2

2 Answers

2
votes

Your topology has several problems:

  1. look at the capacity of the decode_bytes_1 and save_to_cassandra spouts. Both are over 1 (the sum of all spouts capacity should be under 1), which means you are using more resources than what do you have available. This is, the topology can't handle the load.
  2. The TOPOLOGY_MAX_SPOUT_PENDING will solve your problem if the throughput of tuples varies during the day. This is, if you have peek hours, and you will be catch up during the off-peek hours.
  3. You need to increase the number of worker machines or optimize the code in the bottle neck spouts (or maybe both). Otherwise you will not be able to process all the tuples.
  4. You probably can improve the cassandra persister by inserting in batches instread of insert tuples one by one...
  5. I seriously recommend you to always set the TOPOLOGY_MAX_SPOUT_PENDING for a conservative value. The max spout pending, means the maximum number of un-acked tuples inside the topology, remember this value is multiplied by the number of spots and the tuples will timeout (fail) if they are not acknowledged 30 seconds after being emitted.
  6. And yes, your problem is having tuples timing out, this is exactly what is happening.
  7. (EDIT) if you are running the dev environment (or just after deploy the topology) you might experience a spike in the traffic generated by messages that were not yet consumed by the spout; it's important you prevent this case to negatively affect your topology -- you never know when you need to restart the production topology, or perform some maintenance --, if this is the case you can handle it as a temporary spike in the traffic --the spout needs to consume all the messages produced while the topology was off-line -- and after a some (or many minutes) the frequency of incoming tuples stabilizes; you can handle this with max pout pending parameter (read item 2 again).
  8. Considering you have 3 nodes in your cluster, and cpu usage of 0,1 you can add more executers to the bolts.
0
votes

FWIW - it appears that the default value for TOPOLOGY_MAX_SPOUT_PENDING is unlimited. I added a call to stormConfig.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 500); and it appears (so far) that the problem has been alleviated. Possible 'thundering herd' issue?


After setting the TOPOLOGY_MAX_SPOUT_PENDING to 500: enter image description here