I have a spark java program where a groupByKey with a mapValues step is done and it returns a PairRDD with value as an Iterable of all the input rdd values.
I have read that replacing reduceByKey in the place of groupByKey with mapValues will give a performance gain, but i don't know how to apply reduceByKey to my problem here.
Specifically i have the an input pair RDD which has value with type Tuple5. After the groupByKey and mapValues transformations, i need to get a Key-Value pair RDD where the value needs to be an Iterable of the input values.
JavaPairRDD<Long,Tuple5<...>> inputRDD;
...
...
...
JavaPairRDD<Long, Iterable<Tuple5<...>>> groupedRDD = inputRDD
.groupByKey()
.mapValues(
new Function<Iterable<Tuple5<...>>,Iterable<Tuple5<...>>>() {
@Override
public Iterable<Tuple5<...>> call(
Iterable<Tuple5<...>> v1)
throws Exception {
/*
Some steps here..
*/
return mappedValue;
}
});
Is there a way by which i could get the above transformation using reduceByKey?
Some steps here? You'll need a logic to reduce it with. - philantrovertmapValuesfunction i am actually sorting each value based on a key withinTuple5. I thought it wasn't relevant here, that's why i didn't include them. - Vishnu P N