How does Spark partition's HDFS files?

Question

If we have an uncompressed 320 blocks of HDFS files stored on a 16 data node cluster. Each node with 20 blocks and if we use Spark to read this file into an RDD (without explicitly passing numPartitions when creating an RDD) textFile = sc.textFile("hdfs://input/war-and-peace.txt")

If we have 16 executors one on each node, how many partitions Spark RDD will create per executor? Will it create one partition per HDFS block i.e. 20 partitions?

bob bob · Accepted Answer · 2016-12-08T17:26:08

If you have 320 blocks of HDFS, then following code will create an RDD with 320 partitions:

val textFile = sc.textFile("hdfs://input/war-and-peace.txt")

textFile() method results in an RDD that is partitioned into the same number of blocks as the file is stored on in HDFS.

You may look into this question which may solve your queries about partitioning

How does Spark partition's HDFS files?

1 Answers