0
votes

I am creating an RDD (Spark 1.6) from a text file by specifying the number of partitions. But it gives me a different number of partitions than the specified one.

Case 1

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 1)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[50] at textFile at <console>:27

scala> people.getNumPartitions
res36: Int = 1

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 2)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at textFile at <console>:27

scala> people.getNumPartitions
res37: Int = 2

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 3)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[54] at textFile at <console>:27

scala> people.getNumPartitions
res38: Int = 3

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 4)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:27

scala> people.getNumPartitions
res39: Int = 4

Case 2

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 0)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at textFile at <console>:27

scala> people.getNumPartitions
res47: Int = 1

Case 3

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 5)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[58] at textFile at <console>:27

scala> people.getNumPartitions
res40: Int = 6

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 6)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[60] at textFile at <console>:27

scala> people.getNumPartitions
res41: Int = 7

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 7)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[62] at textFile at <console>:27

scala> people.getNumPartitions
res42: Int = 8

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 8)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[64] at textFile at <console>:27

scala> people.getNumPartitions
res43: Int = 9

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 10)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[68] at textFile at <console>:27

scala> people.getNumPartitions
res45: Int = 11

Case 4

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 9)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[66] at textFile at <console>:27

scala> people.getNumPartitions
res44: Int = 11

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 11)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[70] at textFile at <console>:27

scala> people.getNumPartitions
res46: Int = 13

Contents of the file /home/pvikash/data/test.txt are:

This is a test file. Will be used for the rdd partition

On the basis of the above cases, I have a few questions.

  1. For Case 2, the Explicitly specified number of partition is 0 but the actual number of partition is 1 (even default minimum partition is 2), why actual number of partition is 1?
  2. For Case 3, why the actual number of partitions changed by +1 on a specified number of partitions?
  3. For Case 4, why the actual number of partitions changed by +2 on a specified number of partitions?
  4. Why spark is behaving differently in Case 1, Case 2, Case 3 and Case 4?
  5. In case input data is small in size (which can fit into a single partition easily) then why spark creates empty partitions?

Any explanation would be appreciated.

1
stackoverflow.com/questions/24871044/… Normally people ask the otherway round. Small files issues wioth rounding.thebluephantom

1 Answers

0
votes

Not a full answer, but it might get you closer to it.

That number you are passing in is called minSplits. It has an effect on the minimum number of partitions, that is all.

def textFile(path: String, minSplits: Int = defaultMinSplits): RDD[String]

Number of splits should be governed by getSplits method (docs)

And this SO post should answer question 5