2
votes

I am working on a apache spark use case where I need to read data from kafka. I have a very basic question to understand the way spark reads data from kafka.

As per my understanding, if the data velocity and volume is high then I can create multiple partitions in kafka and read it in spark. Now the number of partitions in dstream is same as the number of partitions in kafka.

Can I implement the same scenario by creating multiple kafka topics with one partition each. I can configure my kafka producer to push data to all the topics sequencially. This will create multiple dstream in spark. Then I can simply "union" all the dstream to create my unionedDstream .

Now my question is that that :-

Will the unionedDstream created by "union of other dstreams" will have same number of partitions as the one created by reading "single topic with multiple partitions"

I will put an example below for clarity:-

I have single producer and single consumer.

In first scenario:-

(1) 1 Kafka topic with 4 partitions --> 1 Dstream with 4 partitions

In second scenario:-

(2) 4 Kafka Topics with 1 partitions each --> 4 Dstream with one partitions each.

But Here I can "union" all the dstream to create a single dstream.

unionedDstream= dstream1.union(dstream2).union(dstream3).union(dstream4)

Now will "unionedDstream" becomes "1 Dstream with 4 partitions" (same as 1st scenario). If yes then which process will be more effective performance wise?

1

1 Answers

4
votes

I presume that it is more or less the same in single-node scenarios, but you want to have multiple partitions if you want to make use of Kafka's cluster/load-balancing features.

Horizontal scaling in Kafka is achieved by spreading a consumer group across multiple machines and distributing the partitions amongst them. This only works if you have multiple partitions.

You can probably achieve the same effect if you distribute multiple topics across the machines instead. However, you will have to implement this yourself and cannot make use of Kafka's built-in mechanism.