I am using Spark 1.5/1.6, where I want to do reduceByKey operation in DataFrame, I don't want to convert the df to rdd.
Each row looks like and I have multiple rows for id1.
id1, id2, score, time
I want to have something like:
id1, [ (id21, score21, time21) , ((id22, score22, time22)) , ((id23, score23, time23)) ]
So, for each "id1", I want all records in a list
By the way, the reason why don't want to convert df to rdd is because I have to join this (reduced) dataframe to another dataframe, and I am doing re-partitioning on the join key, which makes it faster, I guess the same cannot be done with rdd
Any help will be appreciated.