Number of Partitions of Spark Dataframe

Question

Can anyone explain about the number of partitions that will be created for a Spark Dataframe.

I know that for a RDD, while creating it we can mention the number of partitions like below.

val RDD1 = sc.textFile("path" , 6)

But for Spark dataframe while creating looks like we do not have option to specify number of partitions like for RDD.

Only possibility i think is, after creating dataframe we can use repartition API.

df.repartition(4)

So can anyone please let me know if we can specify the number of partitions while creating a dataframe.

If the provided solution answers your question please accept it to close the issue or comment on it why it doesn't solve it ! — eliasah

zero323 zero323 · Accepted Answer · 2016-09-07T12:39:12

You cannot, or at least not in a general case but it is not that different compared to RDD. For example textFile example code you've provides sets only a limit on the minimum number of partitions.

In general:

Datasets generated locally using methods like range or toDF on local collection will use spark.default.parallelism.
Datasets created from RDD inherit number of partitions from its parent.
Datsets created using data source API:
- In Spark 1.x typically depends on the Hadoop configuration (min / max split size).
- In Spark 2.x there is a Spark SQL specific configuration in use.
Some data sources may provide additional options which give more control over partitioning. For example JDBC source allows you to set partitioning column, values range and desired number of partitions.

Number of Partitions of Spark Dataframe

2 Answers