I have two big tables partitioned by date column. They are saved as parquet files in hdfs. Every partition is divided by blocks of 64 MB and replicated 3 times accross the cluster machines. To optimize join operation I want to place the same date partitions on the same machines (any join key value is placed in one date partition only).
In Spark there is Partitioner object which can help to distribute blocks of different RDDs accross the cluster. Its pretty similar to my question but I'm afraid that after saving these RDD's file blocks may be shuffled by hdfs mechanism. Explaned: RDD is Spark instance and df method saveAsTable(...) calls (I suppose) some low-level functions which choose data nodes and replicate the data.
Can anyone help me to know if the blocks of my tables are distributed the right way?