I am creating an RDD (Spark 1.6) from a text file by specifying the number of partitions. But it gives me a different number of partitions than the specified one.
Case 1
scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 1)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[50] at textFile at <console>:27
scala> people.getNumPartitions
res36: Int = 1
scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 2)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at textFile at <console>:27
scala> people.getNumPartitions
res37: Int = 2
scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 3)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[54] at textFile at <console>:27
scala> people.getNumPartitions
res38: Int = 3
scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 4)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:27
scala> people.getNumPartitions
res39: Int = 4
Case 2
scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 0)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at textFile at <console>:27
scala> people.getNumPartitions
res47: Int = 1
Case 3
scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 5)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[58] at textFile at <console>:27
scala> people.getNumPartitions
res40: Int = 6
scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 6)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[60] at textFile at <console>:27
scala> people.getNumPartitions
res41: Int = 7
scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 7)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[62] at textFile at <console>:27
scala> people.getNumPartitions
res42: Int = 8
scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 8)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[64] at textFile at <console>:27
scala> people.getNumPartitions
res43: Int = 9
scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 10)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[68] at textFile at <console>:27
scala> people.getNumPartitions
res45: Int = 11
Case 4
scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 9)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[66] at textFile at <console>:27
scala> people.getNumPartitions
res44: Int = 11
scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 11)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[70] at textFile at <console>:27
scala> people.getNumPartitions
res46: Int = 13
Contents of the file /home/pvikash/data/test.txt are:
This is a test file. Will be used for the rdd partition
On the basis of the above cases, I have a few questions.
- For Case 2, the Explicitly specified number of partition is 0 but the actual number of partition is 1 (even default minimum partition is 2), why actual number of partition is 1?
- For Case 3, why the actual number of partitions changed by +1 on a specified number of partitions?
- For Case 4, why the actual number of partitions changed by +2 on a specified number of partitions?
- Why spark is behaving differently in Case 1, Case 2, Case 3 and Case 4?
- In case input data is small in size (which can fit into a single partition easily) then why spark creates empty partitions?
Any explanation would be appreciated.