Spark RDD partitions vs. Hadoop Splits

Question

I am having a hard time understanding the difference between the RDD partitions and the HDFS Input Splits. So essentially when you submit a Spark application:

When the Spark application wants to read from HDFS, that file on HDFS will have input splits (of let's say 64 mb each and each of these input splits are present on different data nodes).

Now let's say the Spark application wants to load that file from HDFS using the (sc.textFile(PATH_IN_HDFS)). And the file is about 256 mb and has 4 input splits where 2 of the splits are on data node 1 and the other 2 splits are on data node 2.

Now when Spark loads this 256 mb into it's RDD abstraction, will it load each of the input splits (64mb) into 4 separate RDD's (where you will have 2 RDD's with 64mb of data in data node 1 and the other two RDD's of 64mb of data on data node 2). Or will the RDD further partition those input splits on Hadoop? Also how will these partitions be redistributed then? I do not understand if there is a correlation between the RDD partitions and the HDFS input splits?

ozw1z5rd ozw1z5rd · Accepted Answer · 2016-10-08T20:36:47

I'm pretty new to Spark, but splits are strictly related to MapReduce jobs. Spark loads the data in memory in a distributed fashion and which machines will load the data can depend on where the data are (read: somewhat depends on where the data block are and this is very close to the split idea ). Sparks APIs allows you to think in terms of RDD and no longer splits. You will work on RDD, how are distributed the data into the RDD is no longer a programmer problem. Your whole dataset, under spark, is called RDD.

Spark RDD partitions vs. Hadoop Splits

2 Answers