In particular, if I say
rdd3 = rdd1.join(rdd2)
then when I call rdd3.collect
, depending on the Partitioner
used, either data is moved between nodes partitions, or the join is done locally on each partition (or, for all I know, something else entirely).
This depends on what the RDD paper calls "narrow" and "wide" dependencies, but who knows how good the optimizer is in practice.
Anyways, I can kind of glean from the trace output which thing actually happened, but it would be nice to call rdd3.explain
.
Does such a thing exist?