0
votes

I have to write spark streaming(createDirectStream API) code. I will be receiving around 90K messages per second so though of using 100 partitions for kafka topic to improve the performance.

Could you please let me know how many executors should I use? Can I use 50 executors and 2 cores per executor?

Also, consider if the batch interval is 10seconds and number of partitions of kafka topic is 100, will I receive 100 RDDs i.e. 1 RDD from each kafka partition? Will there be only 1 RDD from each partition for the 10second batch interval.

Thanks

1

1 Answers

0
votes

There is no good answer, really, and it depends on how much executor memory + cores you have in your cluster.

The hard-limit is that you cannot have more total executor processes than kafka partitions, and you don't want to saturate your network or other IO.

Therefore, first find if you are capping the network and/or memory/disks with one executor, then run two, and see if throughput doubles & network rates cut in half on the one machine. Then scale out the cores and instances, as needed.

Dropbox recently wrote a blog on their performance testing

Regarding RDDs, assuming you have a 1:1 mapping for executor instances to partition, then each executor would see 10sec worth of data per interval for only one partition, and each executor having its own RDD to process, so thus 100 total RDDs proessed per batch. IMO, the "amount of RDDs" isn't too important because you always get 1 RDD per interval.