I am working on a apache spark use case where I need to read data from kafka. I have a very basic question to understand the way spark reads data from kafka.
As per my understanding, if the data velocity and volume is high then I can create multiple partitions in kafka and read it in spark. Now the number of partitions in dstream is same as the number of partitions in kafka.
Can I implement the same scenario by creating multiple kafka topics with one partition each. I can configure my kafka producer to push data to all the topics sequencially. This will create multiple dstream in spark. Then I can simply "union" all the dstream to create my unionedDstream .
Now my question is that that :-
Will the unionedDstream created by "union of other dstreams" will have same number of partitions as the one created by reading "single topic with multiple partitions"
I will put an example below for clarity:-
I have single producer and single consumer.
In first scenario:-
(1) 1 Kafka topic with 4 partitions --> 1 Dstream with 4 partitions
In second scenario:-
(2) 4 Kafka Topics with 1 partitions each --> 4 Dstream with one partitions each.
But Here I can "union" all the dstream to create a single dstream.
unionedDstream= dstream1.union(dstream2).union(dstream3).union(dstream4)
Now will "unionedDstream" becomes "1 Dstream with 4 partitions" (same as 1st scenario). If yes then which process will be more effective performance wise?