Split an RDD into multiple RDDS

Question

I have a pair RDD[String,String] where key is a string and value is html. I want to split this rdd into n RDDS based on n keys and store them in HDFS.

htmlRDD = [key1,html
           key2,html
           key3,html
           key4,html
           ........]

Split this RDD based on keys and store html from each RDD individually on HDFS. Why I want to do that? When, I'm trying to store the html from the main RDD to HDFS,it takes a lot of time as some tasks are denied committing by output coordinator. I'm doing this in Scala.

 htmlRDD.saveAsHadoopFile("hdfs:///Path/",classOf[String],classOf[String], classOf[Formatter])

Just a wild guess, how many partitions and executors there are in the htmlRDD? It might happen that spark just overwhelms HDFS with queries to write. But that might occur only if you have a lot of spark executors. — evgenii

Mahendra Mahendra · Accepted Answer · 2016-04-14T15:44:29

You can also try this in place of breaking RDD:

htmlRDD.saveAsTextFile("hdfs://HOST:PORT/path/");

I tried this and it worked for me. I had RDD[JSONObject] and it wrote toString() of JSON Object very well.

Split an RDD into multiple RDDS

2 Answers