Store and iterate over sorted file hdfs/spark

Question

Task:

I have quite big input files (let's assume 50GB each) on hdfs. I need to sort them, store somewhere(driver program/hdfs/something else?) and then iterate over them until specific condition met.

Questions:

How can I implement it the most effectively?

Where should I save sorted files? If in hdfs how can I stream them to spark, will they be loaded by blocks?

khushbu kanojia khushbu kanojia · Accepted Answer · 2017-01-28T12:36:06

As your file is in HDFS so read it from there only and sort it using below code. I am not sure what type of sorting you want but this code will sort your whole data based on values present in your code

val data = sc.textFile("hdfs://user/AppMetaDataPayload.csv").map(line=> line.split(","))

//Use this If you want to store it on memory after sorting and start processing from there only. It will run faster as storing in memory for further processing

val d1=data.flatMap(_.sorted) d1.cache();

//Use this If you want to save your file in HDFS path data.flatMap(_.sorted).saveAsTextFile("hdfs://user/result6.csv")

Hope this will help you.

Store and iterate over sorted file hdfs/spark

1 Answers