My problem is as follow:
Table 1
ID1 ID2
1 2
3 4
Table 2
C1 VALUE
1 London
4 Texas
Table3
C3 VALUE
2 Paris
3 Arizona
Table 1 has primary and secondary Ids. I need to create a final output which is aggregation of values from Table2 and Table3 based on Ids mapping from table1.
i.e if a value in table2 or table3 is mapped to either of the IDs it should be aggregated as one.
i.e my final output should look like:
ID Aggregated
1 [2, London, Paris] // since Paris is mapped to 2 which is turn is mapped to 1
3 [4, Texas, Arizona] // Texas is mapped to 4 which in turn is mapped to 3
Any suggestion how to achieve this in pyspark.
I am not sure if joining the tables is going to help in this problem.
I was thinking PairedRDD might help me in this but i am not able to come up with proper solution.
Thanks