Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values

Question

I am using Spark 1.0.1 to process a large amount of data. Each row contains an ID number, some with duplicate IDs. I want to save all the rows with the same ID number in the same location, but I am having trouble doing it efficiently. I create an RDD[(String, String)] of (ID number, data row) pairs:

val mapRdd = rdd.map{ x=> (x.split("\\t+")(1), x)}

A way that works, but is not performant, is to collect the ID numbers, filter the RDD for each ID, and save the RDD of values with the same ID as a text file.

val ids = rdd.keys.distinct.collect
ids.foreach({ id =>
    val dataRows = mapRdd.filter(_._1 == id).values
    dataRows.saveAsTextFile(id)
})

I also tried a groupByKey or reduceByKey so that each tuple in the RDD contains a unique ID number as the key and a string of combined data rows separated by new lines for that ID number. I want to iterate through the RDD only once using foreach to save the data, but it can't give the values as an RDD

groupedRdd.foreach({ tup =>
  val data = sc.parallelize(List(tup._2)) //nested RDD does not work
  data.saveAsTextFile(tup._1)
})

Essentially, I want to split an RDD into multiple RDDs by an ID number and save the values for that ID number into their own location.

What's wrong with saving the file after it's grouped by ID, they won't necessarily each be in separate files, but they won't be split among files, and you can control the number of partitions you create which should correspond to the number of files created — aaronman
@aaronman That doesn't work because I need to split the original data source and store the data in separate locations based on id number. Eventually, the data will be requested on-demand based on the id number and it is a very large dataset. — smli
If you save it in the fashion I suggested an RDD can definitely re read the data and get data by user ID, would that be an acceptable solution — aaronman
I had to perform this same operation a few days ago and ran into the same problems as you. As far as I can tell there is no way to group an RDD and then persist the values of that grouping without bringing them in memory to the driver. Have you considered the mailing list? If you find something please update this question so we can get the details. — jhappoldt
@jhappoldt this is most definitely not the case I think I'll just answer the question — aaronman

zhang zhan zhang zhan · Accepted Answer · 2014-10-11T19:27:25

I think this problem is similar to Write to multiple outputs by key Spark - one Spark job

Please refer the answer there.

import org.apache.hadoop.io.NullWritable

import org.apache.spark._
import org.apache.spark.SparkContext._

import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat

class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
  override def generateActualKey(key: Any, value: Any): Any = 
    NullWritable.get()

  override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = 
    key.asInstanceOf[String]
}

object Split {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Split" + args(1))
    val sc = new SparkContext(conf)
    sc.textFile("input/path")
    .map(a => (k, v)) // Your own implementation
    .partitionBy(new HashPartitioner(num))
    .saveAsHadoopFile("output/path", classOf[String], classOf[String],
      classOf[RDDMultipleTextOutputFormat])
    spark.stop()
  }
}

Just saw similar answer above, but actually we don't need customized partitions. The MultipleTextOutputFormat will create file for each key. It is ok that multiple record with same keys fall into the same partition.

new HashPartitioner(num), where the num is the partition number you want. In case you have a big number of different keys, you can set number to big. In this case, each partition will not open too many hdfs file handlers.

Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values

3 Answers