What is "WARN ParallelCollectionRDD: Spark does not support nested RDDs (see SPARK-5063)"?

Question

I have a following syntax

val data = sc.textFile("log1.txt,log2.txt")
val s = Seq(data)
val par = sc.parallelize(s)

Result that i obtained is as follows:

WARN ParallelCollectionRDD: Spark does not support nested RDDs (see SPARK-5063)
par: org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[String]] = ParallelCollectionRDD[2] at parallelize at :28

Question 1

How does a parallelCollection work?.

Question 2

Can I iterate through them and perform transformation?

Question 3

RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

What does this mean?

Jacek Laskowski Jacek Laskowski · Accepted Answer · 2016-12-27T07:57:02

(An interesting case indeed)

When in doubt, I always recommend to follow the types in Scala (after all the types are why we, Scala developers, use the language in the first place, don't we?)

So, let's reveal the types:

scala> val data = sc.textFile("log1.txt,log2.txt")
data: org.apache.spark.rdd.RDD[String] = log1.txt,log2.txt MapPartitionsRDD[1] at textFile at <console>:24

scala> val s = Seq(data)
s: Seq[org.apache.spark.rdd.RDD[String]] = List(log1.txt,log2.txt MapPartitionsRDD[1] at textFile at <console>:24)

scala> val par = sc.parallelize(s)
WARN ParallelCollectionRDD: Spark does not support nested RDDs (see SPARK-5063)
par: org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[String]] = ParallelCollectionRDD[3] at parallelize at <console>:28

As you were told, org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[String]] is not supported case in Spark (however it was indeed accepted by the Scala compiler since it matches the signature of SparkContext.parallelize method...unfortunately).

You don't really need val s = Seq(data) since the records in the two files log1.txt,log2.txt are already "inside" RDD and Spark will process them in distributed and parallel manner all records in all the two files (which I believe is your use case).

I do think I've answered all the three questions that I think are based on false expectations and hence they all are pretty much alike :)

What is "WARN ParallelCollectionRDD: Spark does not support nested RDDs (see SPARK-5063)"?

Question 1

Question 2

Question 3

1 Answers