2
votes

I have 3 tables in Cassandra clustered into several nodes, spark workers sitting on top of each of them. Lets call these tables A, B and C.

A and B are huge but they have same partition key, so that data locality is maintained when I am joining them together.

Now I want to join the other table C, which has different partition key, but not as big as the other two. I am also ready to replicate the table to all my nodes, if I have to.

How do I join them together, maintaining data locality with minimum shuffle?

1

1 Answers

1
votes

As you mentionned, the third table doesn't have the same partition key as the other tables so you can't be sure that all the data will be on the same node.

This means that you have two options, you can use sparkContext.broadcast on your third RDD to perform a map side join on the others RDD. This option will not trigger a shuffle because a broadcasted RDD is " replicated " on every machines in your cluster, one thing you have to check is that you don't broadcast a gigantic RDD ( by gigantic i mean a few Gigabytes, even if I never found any proof that broadcasting such RDD's are evil )

The other option is using an HashPartitioner on a parent RDD, this option allows you to be more flexible than a map-sided join because you can use the rightOuterJoin or leftOuterJoin from the Spark API. However you have to map all your RDD's to a parent RDD and you need to know how many partitions you have to use to gain the best performance in your join operations, from my experience I usually keep around 128 Mb per partition, but nothing is more efficient than testing it by yourself because everything depends on your use case.