I'm using Cloudera VM, a linux terminal and spark version 1.6.0
Let's say I have the following dataset:
Priority, qty, sales => I'm not importing headers.
low,6,261.54
high,44,1012
low,1,240
high,25,2500
I can load, "val inputFile = sc.textFile("file:///home/cloudera/stat.txt")
I can sort, "inputFile.sortBy(x=>x(1),true).collect
but I want to place low and high priority data into 2 separate files.
Would that be a filter or reduceby or partitioning? how best could I do that? If I can get help with that I think I might be able to wrap my head around creating an RDD of priority & sales, qty & sales.