I am trying to do one simple java spark application which does the following
Input Data csv format : key1,key2,data1,data2
Basically what I am trying to do here is,
First I am mapping each line by key1 and then doing a groupByKey operation on that rdd.
JavaRDD<String> viewRdd = sc.textFile("testfile.csv", 1);
JavaPairRDD<String, String> customerIdToRecordRDD = viewRdd
.mapToPair(w -> new Tuple2<String, String>(w.split(",")[0], w));
JavaPairRDD<String, Iterable<String>> groupedByKey1RDD = customerIdToRecordRDD.groupByKey();
System.out.println(customerIdToRecordGropedRDD.count());
Now my problem is, I need to do an aggregateByKey with key2 on each group from groupedByKey1RDD. Is there any way to convert Iterable into an RDD ?? or am I missing something here. I am new to this, any help will be
Example input and expected output :
id_1,time0,10,10
id_2,time1,0,10
id_1,time1,11,10
id_1,time0,1,10
id_2,time1,10,10
Output is grouped by 1st column and then aggregated by 2nd column (aggregate logic is to simply add column3 and column4):
id_1 : time0 : { sum1 : 11, sum2 : 20} ,
time1 : { sum1 : 11, sum2 : 10}
id_2 : time1 : { sum1 : 10, sum2 : 20}