9
votes

Can anyone explain about the number of partitions that will be created for a Spark Dataframe.

I know that for a RDD, while creating it we can mention the number of partitions like below.

val RDD1 = sc.textFile("path" , 6) 

But for Spark dataframe while creating looks like we do not have option to specify number of partitions like for RDD.

Only possibility i think is, after creating dataframe we can use repartition API.

df.repartition(4)

So can anyone please let me know if we can specify the number of partitions while creating a dataframe.

2
If the provided solution answers your question please accept it to close the issue or comment on it why it doesn't solve it !eliasah

2 Answers

12
votes

You cannot, or at least not in a general case but it is not that different compared to RDD. For example textFile example code you've provides sets only a limit on the minimum number of partitions.

In general:

  • Datasets generated locally using methods like range or toDF on local collection will use spark.default.parallelism.
  • Datasets created from RDD inherit number of partitions from its parent.
  • Datsets created using data source API:

  • Some data sources may provide additional options which give more control over partitioning. For example JDBC source allows you to set partitioning column, values range and desired number of partitions.
0
votes

Default number of shuffle partitions in spark dataframe(200)

Default number of partitions in rdd(10)