0
votes

This is the already asked question but I could not understand the answers properly.

I have two RDDs with same number of columns and same number of records

RDD1(col1,col2,col3)

and

RDD2(colA,colB,colC)

I need to join them as following :

RDD_FINAL(col1,col2,col3,colA,colB,colC)

There is no key to perform a join between records but they are in order which means the first record of RDD1 is corresponded to first record of RDD2.

2
RDD doesn't preserve internal order of rows. Your question is quite broad to answer frankly. Please make some efforts reviewing your question ! - eliasah
Also please read on how to ask question on SO ? stackoverflow.com/help/how-to-ask - eliasah
@eliasah thank you for respond and guide...actually these two rdds come from two different text files...the only thing i need is to combine all the columns...one of the reason is to compare "col1 and colA"... i wanted to try the join operation ,however my data set does not have any identity key like primary key in sql. - Rouzbeh Zarandi

2 Answers

1
votes

You can use the zipWithIndex method to add the index of the row as a key to both RDD's, and join by it by the key.

1
votes

Adding code snippet for Alfilercio's example.

JavaRDD<col1,col2,col3> rdd1 = ...
JavaPairRDD<Long, Tuple3<col1,col2,col3>> pairRdd1 = rdd1.zipWithUniqueId().mapToPair(pair -> new Tuple2<>(pair._2(),pair._1());

JavaRDD<colA,colB,colC> rdd2 = ...
JavaPairRDD<Long, Tuple3<colA,colB,colC>> pairRdd2 = rdd2.zipWithUniqueId().mapToPair(pair -> new Tuple2<>(pair._2(),pair._1());

JavaRDD<Tuple2<Tuple3<col1, col2, col3>, Tuple3<colA,colB,colC>>> mappedRdd = pairRdd1.join(pairRdd2).map(pair -> pair._2());