0
votes

I have two big tables partitioned by date column. They are saved as parquet files in hdfs. Every partition is divided by blocks of 64 MB and replicated 3 times accross the cluster machines. To optimize join operation I want to place the same date partitions on the same machines (any join key value is placed in one date partition only).

In Spark there is Partitioner object which can help to distribute blocks of different RDDs accross the cluster. Its pretty similar to my question but I'm afraid that after saving these RDD's file blocks may be shuffled by hdfs mechanism. Explaned: RDD is Spark instance and df method saveAsTable(...) calls (I suppose) some low-level functions which choose data nodes and replicate the data.

Can anyone help me to know if the blocks of my tables are distributed the right way?

1
Hi @eakotelnikov , welcome to SO! Can you repartition the two parquet files by the date column before you save them to HDFS? If that is outside your control, I think you have no choice but to repartition after reading the files (which means you cannot avoid the shuffle). Also, I am not clear what you mean by 'after saving these RDD's file blocks may be shuffled by hdfs mechanism'? Could you edit your question to elaborate that part? - suj1th
I think you mean you want to tell Hadoop / HDFS to put "same" partitions on same data nodes / workers. I am not convinced that can be done. - thebluephantom
Thank you for attention! Tables I mentioned in question are already exist on hdfs but new partition appends every day. I can control the process of writing but I didn't found any information how to put same partitions on same data nodes. - eakotelnikov
I will be glad to now if it is possible, but I am sure it is not for files. bucketBy may help you. - thebluephantom
@thebluephantom, thanks for idea of bucketing, it will accelerate join, but same partition blocks (even bucketed) will locate on different data nodes. Is the choise of blocks location completely random (except "rack" rules)? - eakotelnikov

1 Answers

0
votes

The answer to your question is that one cannot control placement of "like / similar" data blocks in terms of partitioning for logically related files / tables expressly. I.e. you cannot influence on which data nodes data blocks are placed by HDFS.

These partitions / chunks of data may coincidentally reside on same data nodes / workers (due to replication by HDFS.

As an aside, with S3 such an approach does not work in any event as the concept of data locality optimization does not exist.