What happen internally when we join two DStream grouped by keys?

Question

I'am new in spark (spark-streaming in Python) and if i have understood correctly, a DStream is a sequence of RDD.

Imagine that we have in our code:

ssc = StreamingContext(sc, 5)

So for every 5s a DSTream object is generated which is a sequence of RDDs.

Imagine i have two DStreams DS1 and DS2 (each 5s). My code is:

DGS1 = DS1.groupByKey()
DGS2 = DS2.groupByKey()
FinalStream = DS1.join(DS2)

What happen internally when i call groupByKey and Join (in the RDDs level) ?

Thank you !

Yuval Itzchakov Yuval Itzchakov · Accepted Answer · 2016-03-17T09:14:49

When you use groupByKey and join, you're causing a shuffle. A picture to illustrate:

Assume you have a stream of incoming RDD's (called a DStream) which are tuples of a String, Int. What you want is to group them by key (which is a word in this example). But, all the keys aren't locally availaible in the same executor, they are potentionally spread between many workers which have previously done work on the said RDD.

What spark has to do now, is say "Hey guys, I now need all keys which values are equal to X to go to worker 1, and all keys which values are Y to go to worker 2, etc", so you can have all values of a given key in a single worker node, which can then continue to do more work on each RDD which is now of type (String, Iterator[Int]) as a cause of the grouping.

A join is similar in it's behavior to a groupByKey, as it has to have all keys available in order to compare every two keys stream of RDDs.

Behind the scenes, spark has to do a couple of things in order for this to work:

Repartitioning of the data: Since all keys may not be available on a single worker
Data serialization/deserialization and Compression: Since spark has to potentially transfer data across nodes, it has to be serialized and later deserialized
Disk IO: As a cause of a shuffle spill since a single worker may not be able to hold all data in-memory.

For more, see this introduction to shuffling.

What happen internally when we join two DStream grouped by keys?

1 Answers