I've a pipeline where I'm applying transformation rules(from broadcast state) on a stream of events; when I run broadcast stream and original stream in parallel without connecting, stream performance is really good, but the moment I do broadcast performance goes down drastically. How can I achieve better performance. Data passed between operators are in byte[] and data footprint is small as well.
I've attached snapshots of both scenarios:
- Top row shows stream consuming events from Kafka and bottom row shows rules consumed from another topic. With this setup I could achieve throughput of upto ~20K msg/sec per task manager processing 12Gb of data in 4mins
2. I've connected the broadcast stream with the data stream for
processing in future . Note that only to measure performance of
broadcast I've made sure no records are consumed in the data
stream(top row). At the processing side of the broadcast state, i'm
only store received messages to MapState. With this setup I can get
throughput of upto ~1000 msg/sec per task manager processing 12Gb of
data in 18mins.
