Tuples taking more time in reaching from spout to last bolt(aka Complete Latency) is high

Question

Version Info: 
   "org.apache.storm" % "storm-core" % "1.2.1" 
   "org.apache.storm" % "storm-kafka-client" % "1.2.1"

I was creating and experimenting with a Storm topology I have created which has 4 bolts and one kafka spout.

I was trying to tune configs like parallelism of these bolts, max-spout-pending, etc to see how much scale I can get out of it. After some configuration config/results looks something like below:

max-spout-pending: 1200
Kafka Spout Executors: 10
Num Workers: 10
+----------+-------------+----------+----------------------+
| boltName | Parallelism | Capacity | Execute latency (ms) |
+----------+-------------+----------+----------------------+
| __acker  |          10 | 0.008    | 0.005                |
| bolt1    |          15 | 0.047    | 0.211                |
| bolt2    |         150 | 0.846    | 33.151               |
| bolt3    |        1500 | 0.765    | 289.679              |
| bolt4    |          48 | 0.768    | 10.451               |
+----------+-------------+----------+----------------------+

Process latency and Execute latency are almost same. There is an HTTP call involved in bolt 3 which is taking approximately that much time and bolt 2 and bolt 4 are also doing some I/O operation.

While I can see that each bolt can individually process more than 3k, (bolt3: 1500/289.679ms = 5.17k qps, bolt4: 48/10.451ms = 4.59k qps and so on), but overall this topology is processing tuples at only ~3k qps. I am running it on 10 boxes(so one worker per box), having 12 core CPU and 32GB RAM. I have given each worker process -xms 8Gb and -xmx 10Gb, so RAM should also not be constraint. I see GC also happening properly, 4 GC per minute taking around total time of 350ms in a minute(from flight recording of worker process of 1 minute).

I see Complete Latency for each tuple to be around 4 sec, which is something I am not able to understand, as If I compute all time taken by all bolts, it comes around 334 ms, but as mentioned here, tuples can be waiting in buffers, it suggests to increase dop(degree of parallelism), which I have done and reached above state.

I add some more metering and I see tuples are taking on average around 1.3sec to reach from bolt 2 to bolt 3 and 5 sec from bolt 3 to bolt 4. While I understand Storm might be keeping them in it's outbound or inbound buffer, My question is how do I reduce it as these bolts should be able to process more tuples in a second as par my earlier calculation, what is holding them from entering and being processed at faster rate?

Tom Cooper Tom Cooper · Accepted Answer · 2018-11-07T16:25:27

I think your issue may be due to ack tuples, that are used to start and stop the complete latency clock, being stuck waiting at the ackers.

You have a lot of bolts and presumably high throughput which will result in a lot of ack messages. Try increasing the number of ackers, using the topology.acker.executors config value which will hopefully reduce the queuing delay for the ack tuples.

If you are also using a custom metrics consumer you may also want to increase the parallelism of this component too, given the number of bolts you have.

Tuples taking more time in reaching from spout to last bolt(aka Complete Latency) is high

1 Answers