I have an HDFS folder with two 250MB parquet files. The hadoop df block size is set to 128MB. Having the following code:
JavaSparkContext sparkContext = new JavaSparkContext();
SQLContext sqlContext = new SQLContext(sparkContext);
DataFrame dataFrame = sqlContext.read().parquet("hdfs:////user/test/parquet-folder");
LOGGER.info("Nr. of rdd partitions: {}", dataFrame.rdd().getNumPartitions());
sparkContext.close();
I run it on the cluster with spark.executor.instances=3 and spark.executor.cores=4. I can see that the reading of the parquet files is split among 3 executors X 4 cores = 12 tasks:
spark.SparkContext: Starting job: parquet at VerySimpleJob.java:25
scheduler.DAGScheduler: Got job 0 (parquet at VerySimpleJob.java:25) with 12 output partitions
However, when I get the dataframe RDD (or create the RDD with toJavaRDD()) call, I get only 4 partitions. Is this controlled by the hdfs block size - 2 blocks for each file, hence 4 partitions?
Why exactly doesn't this match the number of partitions from the parquet (parent?) operation?