divide spark rdd into 2 separate files based on certain keys

Question

I'm using Cloudera VM, a linux terminal and spark version 1.6.0

Let's say I have the following dataset:

Priority, qty, sales => I'm not importing headers.

low,6,261.54

high,44,1012

low,1,240

high,25,2500

I can load, "val inputFile = sc.textFile("file:///home/cloudera/stat.txt")

I can sort, "inputFile.sortBy(x=>x(1),true).collect

but I want to place low and high priority data into 2 separate files.

Would that be a filter or reduceby or partitioning? how best could I do that? If I can get help with that I think I might be able to wrap my head around creating an RDD of priority & sales, qty & sales.

maxime G maxime G · Accepted Answer · 2018-11-15T22:51:26

It's maybe not the best solution but you can use 2 filter to create 2 differents RDD, one filter remove low line, the other one high line , then save under HDFS.

inputFile.filter($"Priority" == "low").saveAsTextFile("low_file");
inputFile.filter($"Priority" == "high").saveAsTextFile("high_file");

divide spark rdd into 2 separate files based on certain keys

1 Answers