0
votes

I have a pair RDD[String,String] where key is a string and value is html. I want to split this rdd into n RDDS based on n keys and store them in HDFS.

htmlRDD = [key1,html
           key2,html
           key3,html
           key4,html
           ........] 

Split this RDD based on keys and store html from each RDD individually on HDFS. Why I want to do that? When, I'm trying to store the html from the main RDD to HDFS,it takes a lot of time as some tasks are denied committing by output coordinator. I'm doing this in Scala.

 htmlRDD.saveAsHadoopFile("hdfs:///Path/",classOf[String],classOf[String], classOf[Formatter])
2
Just a wild guess, how many partitions and executors there are in the htmlRDD? It might happen that spark just overwhelms HDFS with queries to write. But that might occur only if you have a lot of spark executors. - evgenii

2 Answers

0
votes

You can also try this in place of breaking RDD:

htmlRDD.saveAsTextFile("hdfs://HOST:PORT/path/");

I tried this and it worked for me. I had RDD[JSONObject] and it wrote toString() of JSON Object very well.

0
votes

Spark saves each RDD partition into 1 hdfs file partition. So to achieve good parallelism your source RDD should have many partitions(actually depends on size of whole data). So I think you want to split your RDD not into several RDDs, but rather to have RDD with many partitions. You you can do it with repartition() or coallesce()