4
votes

I have question on spark dataframe number of partitions.

If I have Hive table(employee) which has columns (name,age,id,location).

CREATE TABLE employee (name String, age String, id Int) PARTITIONED BY (location String);

If the employee table has 10 different locations. So data will be partitioned into 10 partitions in HDFS.

If I create a Spark dataframe(df) by reading the whole data of a Hive table(employee).

How many number of partitions will be created by Spark for a dataframe(df)?

df.rdd.partitions.size = ??

1

1 Answers

1
votes

Partitions are created depending on the block size of HDFS.

Imagine you have read the 10 partitions as a single RDD and if the block size is 128MB then

no of partitions = (size of(10 partitions in MBs)) / 128MB

will be stored on HDFS.

Please refer to the following link:

http://www.bigsynapse.com/spark-input-output