Version Info:
"org.apache.storm" % "storm-core" % "1.2.1"
"org.apache.storm" % "storm-kafka-client" % "1.2.1"
I was creating and experimenting with a Storm topology I have created which has 4 bolts and one kafka spout.
I was trying to tune configs like parallelism of these bolts, max-spout-pending, etc to see how much scale I can get out of it. After some configuration config/results looks something like below:
max-spout-pending: 1200
Kafka Spout Executors: 10
Num Workers: 10
+----------+-------------+----------+----------------------+
| boltName | Parallelism | Capacity | Execute latency (ms) |
+----------+-------------+----------+----------------------+
| __acker | 10 | 0.008 | 0.005 |
| bolt1 | 15 | 0.047 | 0.211 |
| bolt2 | 150 | 0.846 | 33.151 |
| bolt3 | 1500 | 0.765 | 289.679 |
| bolt4 | 48 | 0.768 | 10.451 |
+----------+-------------+----------+----------------------+
Process latency and Execute latency are almost same. There is an HTTP call involved in bolt 3 which is taking approximately that much time and bolt 2 and bolt 4 are also doing some I/O operation.
While I can see that each bolt can individually process more than 3k, (bolt3: 1500/289.679ms = 5.17k qps, bolt4: 48/10.451ms = 4.59k qps and so on), but overall this topology is processing tuples at only ~3k qps. I am running it on 10 boxes(so one worker per box), having 12 core CPU and 32GB RAM. I have given each worker process -xms 8Gb and -xmx 10Gb, so RAM should also not be constraint. I see GC also happening properly, 4 GC per minute taking around total time of 350ms in a minute(from flight recording of worker process of 1 minute).
I see Complete Latency
for each tuple to be around 4 sec, which is something I am not able to understand, as If I compute all time taken by all bolts, it comes around 334 ms, but as mentioned here, tuples can be waiting in buffers, it suggests to increase dop(degree of parallelism), which I have done and reached above state.
I add some more metering and I see tuples are taking on average around 1.3sec to reach from bolt 2 to bolt 3 and 5 sec from bolt 3 to bolt 4. While I understand Storm might be keeping them in it's outbound or inbound buffer, My question is how do I reduce it as these bolts should be able to process more tuples in a second as par my earlier calculation, what is holding them from entering and being processed at faster rate?