When you use groupByKey and join, you're causing a shuffle. A picture to illustrate:

Assume you have a stream of incoming RDD's (called a DStream) which are tuples of a String, Int. What you want is to group them by key (which is a word in this example). But, all the keys aren't locally availaible in the same executor, they are potentionally spread between many workers which have previously done work on the said RDD.
What spark has to do now, is say "Hey guys, I now need all keys which values are equal to X to go to worker 1, and all keys which values are Y to go to worker 2, etc", so you can have all values of a given key in a single worker node, which can then continue to do more work on each RDD which is now of type (String, Iterator[Int]) as a cause of the grouping.
A join is similar in it's behavior to a groupByKey, as it has to have all keys available in order to compare every two keys stream of RDDs.
Behind the scenes, spark has to do a couple of things in order for this to work:
- Repartitioning of the data: Since all keys may not be available on a single worker
- Data serialization/deserialization and Compression: Since spark has to potentially transfer data across nodes, it has to be serialized and later deserialized
- Disk IO: As a cause of a shuffle spill since a single worker may not be able to hold all data in-memory.
For more, see this introduction to shuffling.