I have two Spark 1.4.1 PipelineRDD (I am not sure what kind of object is that :-s :
1) a list of ids (ids_alsaciens RDD)
2) a list of personne (personnes RDD)
The 'Personnes' RDD has 4 fields, in a json format, the key being "id". I may have several line for the same personne in this table (the id being the same)
I would like to fetch all the lines on the 'personnes' RDD which id is contained on the 'alsacien' table.
How could I do that in spark ?
>type(ids_alsaciens)
pyspark.rdd.PipelinedRDD
>type(personnes)
pyspark.rdd.PipelinedRDD
>ids_alsaciens.take(10)
[u'1933992',
u'2705919',
u'2914684',
u'2915444',
u'11602833',
u'11801394',
u'10707371',
u'2018422',
u'2312432',
u'233375']
>personnes.take(3)
[{'date': '2013-06-03 00:00',
'field': 'WAID_INDIVIDU_WC_NUMNNI',
'id': '10000149',
'value': '2770278'},
{'date': '2013-05-15 00:00',
'field': 'WAID_INDIVIDU_WC_NUMNNI',
'id': '10009910',
'value': '2570631'},
{'date': '2013-03-01 00:00',
'field': 'WAID_INDIVIDU_WC_NUMNNI',
'id': '10014405',
'value': '1840288'}]
EDIT
Tried : personnes.filter(lambda x: x in ids_alsaciens)
Got exception : Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.