I am trying to join two JavaPairRDD using full outer join. And I want to incorporate filter (like where clause in sql) and also select only one rdd (either left side rdd or right rdd based on some condition). I have tried doing a filter function on the joined result rdd, but it does not seem to support transformation like function to select only one rdd. With mapToPair, it does not allow me filter. Should I try doing a filter and then map (or vice versa), doing two pass on the data. I would have thought of a direct full outer join function support to expose filter and map together.
JavaPairRDD<String, Tuple2<Optional<MyData>, Optional<MyDate>>> bagrp = agrp.fullOuterJoin(agrp);
JavaPairRDD<String, MyData> outmap = fgrp.mapToPair(new PairFunction <Tuple2<String, Tuple2<Optional<MyData>, Optional<MyData>>>, String, MyData>()
{
@Override
public Tuple2<String, MyData> call(Tuple2<String, Tuple2<Optional<MyData>, Optional<MyData>>> arg0) throws Exception
{
if ( based on some condition ) return new Tuple2<String, MyData>(obj1,obj2);
else return null;
}
}
Returning null in mapToPair is still present in the returned RDD. Is there a way to avoid, without doing an explicit filter?
Thanks Srivatsan