I have a large RDD, call it RDD1, that is approximately 300 million rows after an initial filter. What I would like to do is take the ids from RDD1 and find all other instances of it in another large dataset, call it RDD2, that is approximately 3 billion rows. RDD2 is created by querying a parquet table that is stored in Hive as well as RDD1. The number of unique ids from RDD1 is approximately 10 million elements.
My approach is to currently collect the ids and broadcast them and then filter RDD2.
My question is - is there a more efficient way to do this? Or is this best practice?
I have the following code -
hiveContext = HiveContext(sc)
RDD1 = hiveContext("select * from table_1")
RDD2 = hiveContext.sql("select * from table_2")
ids = RDD1.map(lambda x: x[0]).distinct() # This is approximately 10 million ids
ids = sc.broadcast(set(ids.collect()))
RDD2_filter = RDD2.rdd.filter(lambda x: x[0] in ids.value))
RDD1.map()
operation? That's not very efficient. The twoids=
statements are confusing. Why are you collecting it to the head-end, only to send it out as a broadcast variable? Have you tried it without thecollect()
and thesc.broadcast
? – vy32