1
votes

I'm making some code with scala & spark and want to make CSV file from RDD or LIST[Row].

I wanted to process 'ListRDD' data parellel so I thouth output data would be more than one file.

val conf = new SparkConf().setAppName("Csv Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val logFile = "data.csv "
val rawdf = sqlContext.read.format("com.databricks.spark.csv")....
val rowRDD = rawdf.map { row =>
  Row(
    row.getAs( myMap.ID).toString,
    row.getAs( myMap.Dept)
    .....
  )
}
 val df = sqlContext.createDataFrame(rowRDD, mySchema)
val MapRDD = df.map { x => (x.getAs[String](myMap.ID), List(x)) }
val ListRDD = MapRDD.reduceByKey { (a: List[Row], b: List[Row]) => List(a, b).flatten }

myClass.myFunction( ListRDD)

in myClass..

def myFunction(ListRDD: RDD[(String, List[Row])]) = {
    var rows: RDD[Row]
    ListRDD.foreach( row => { 
        rows.add? gather? ( make(row._2)) // make( row._2) will return List[Row]
    })
    rows.saveAsFile(" path") // it's my final goal
  }

def make( list: List[Row]) : List[Row] = {
    data processing from List[Row]
}

I tried to make RDD data from List by sc.parallelize( list) BUT somehow nothing works. anyidea to make RDD type data from make function.

1
The question is very hard to understand: what's the expected output? For example, if myFunction receives an RDD with two records, each containing two Rows in the list - what would you like the output to be? These 4 Rows? If so - what's the purpose of using reduceByKey - is it used to perform some kind of sorting? Try clarifying the input and expected output and what exactly goes wrong ("somehow nothing works" doesn't tell much).Tzach Zohar

1 Answers

0
votes

If you want to make an RDD from a List[Row], here is a way to do so

//Assuming list is your List[Row]
val newRDD: RDD[Object] = sc.makeRDD(list.toArray());