21
votes

The Spark Programming Guide mentions slices as a feature of RDDs (both parallel collections or Hadoop datasets.) ("Spark will run one task for each slice of the cluster.") But under the section on RDD persistence, the concept of partitions is used without introduction. Also, the RDD docs only mention partitions with no mention of slices, while the SparkContext docs mentions slices for creating RDDs, but partitions for running jobs on RDDs. Are these two concepts the same? If not, how do they differ?

Tuning - Level of Parallelism indicates that "Spark automatically sets the number of “map” tasks to run on each file according to its size ... and for distributed “reduce” operations, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument...." So does this explain the difference between partitions and slices? Partitions are related to RDD storage and slices are related to degree of parallelism, and by default splices are calculated based upon either data size or number of partitions?

1
I am pretty sure they are the same and it is just inconsistent naming. I filed a bug: issues.apache.org/jira/browse/SPARK-1701Daniel Darabos
@DanielDarabos Pyspark parallelize still refers to numSlices, is this an edge case?Chris Snow
Changing code is trickier than changing documentation. There is probably a bunch of code containing sc.parallelize(c, numSlices=100). That would break if the argument were renamed.Daniel Darabos

1 Answers

17
votes

They are the same thing. The documentation has been fixed for Spark 1.2 thanks to Matthew Farrellee. More details in the bug: https://issues.apache.org/jira/browse/SPARK-1701