hdfs put/moveFromLocal not distributing data across data nodes?

Question

I found similar question Hadoop HDFS is not distributing blocks of data evenly

but my ask is when replication factor = 1

I still want to understand why HDFS is not evenly distributing file blocks across the cluster nodes? This will result in data skew from start, when I load/run dataframe ops on such files. Am I missing something?

OneCricketeer OneCricketeer · Accepted Answer · 2019-12-17T01:38:39

Even if replication factor is one, files are still split and stored in multiples of the HDFS block size. Block placement is on best effort, AFAIK, not purely balanced; replication placement of 3 picks a random node, then another node on the same rack, then another node off rack at random

You'll need to clarify how large your files are and where you are looking to see if data is being split

Note: not all file formats are splittable

hdfs put/moveFromLocal not distributing data across data nodes?

1 Answers