Task:
I have quite big input files (let's assume 50GB each) on hdfs. I need to sort them, store somewhere(driver program/hdfs/something else?) and then iterate over them until specific condition met.
Questions:
How can I implement it the most effectively?
Where should I save sorted files? If in hdfs how can I stream them to spark, will they be loaded by blocks?