I'm playing around with spark-streaming and kafka together in python, and loosely following along with this post but I'm a little confused about the KafkaUtils.createStream() function mentioned early on.
The documentation doesn't do much by way of explicitly explaining what the topics dictionary affects. But I have a suspicion that I only think so because my knowledge of how kafka works is shaky, and the answer is obvious.
I understand that it should be a dictionary like so: {"topic.name": 1}
and I can parrot the documentation and say that means the stream created will consume from a single partition.
So I suppose I'm just looking for some clarification on both the usage of this particular function, and also my understanding of kafka concepts. We'll use the following example:
Let's say I have defined a topic my.topic
that has 3 partitions and whose incoming messages are split on a key, let's just say a userid.
If I initialize a stream like so:
from pyspark.streaming.kafka import KafkaUtils
kafkaStream = KafkaUtils.createStream(
ssc,
'kafka:2181',
'consumer-group-name',
{'my.topic':1}
)
Am I right in thinking that this stream will only consume from a single partition, and so will not see every message coming into my.topic
? In other words it will only see messages from userid's sent to one of the 3 partitions?
My questions then:
How do I properly set this parameter to consume all messages sent to
my.topic
?My intuition is that I would just set the topics parameter to
{'my.topic': 3}
, so then my question becomes:Why would I ever use a number smaller than the total number of partitions?
My intuition here tells me that it's a matter of how "atomic" the work you're doing is. For example if I were simply transforming data (say, from a CSV into a list of JSON documents or something) Then having 3 of the above streams each with
{'my.topic': 1}
set as their topics parameter and all part of the same consumer group would be beneficial by enabling parallel consumption from each of the partitions because no information needs to be shared about each message consumed.Meanwhile if I was calculating live metrics meant to be over the entire topic I.E. time-windowed averages with filters, etc. I struggle to see a way to implement something like that without setting
{'my.topic': 3}
,or if it's something like a sum, then doing slightly more complicated downstream processing of each component signal within a consumer group I.E. Sum1 + Sum2 + Sum3 = TotalSumBut again my knowledge is squarely in the "fledgling" stage of playing around with Kafka and Spark.
Is there a way to tell createStream() to consume from all partitions, without knowing ahead of time how many there are? Something like
{'my.topic': -1}
?Can there be more than one topic specified in a single stream? I.E.
{'my.topic': 1, 'my.other.topic': 1}
I would really hate for the answer to this question to simply be "Yes, your intuition is correct.". Best case scenario is someone tells me I'm misunderstanding literally everything and sets me straight. So please...do that!