I have a problem with joining two Dataframes with columns containing Arrays in PySpark. I want to join on those columns if the elements in the arrays are the same (order does not matter).
So, I have one DataFrame containing itemsets and their frequencies in the following format:
+--------------------+----+
| items|freq|
+--------------------+----+
| [1828545, 1242385]| 4|
| [1828545, 2032007]| 4|
| [1137808]| 11|
| [1209448]| 5|
| [21002]| 5|
| [2793224]| 209|
| [2793224, 8590]| 7|
|[2793224, 8590, 8...| 4|
|[2793224, 8590, 8...| 4|
|[2793224, 8590, 8...| 5|
|[2793224, 8590, 1...| 4|
| [2793224, 2593971]| 20|
+--------------------+----+
And another DataFrame which contains information about the user and items in the following format:
+------------+-------------+--------------------+
| user_id| session_id| itemset |
+------------+-------------+--------------------+
|WLB2T1JWGTHH|0012c5936056e|[1828545, 1242385] |
|BZTAWYQ70C7N|00783934ea027|[2793224, 8590] |
|42L1RJL436ST|00c6821ed171e|[8590, 2793224] |
|HB348HWSJAOP|00fa9607ead50|[21002] |
|I9FOENUQL1F1|013f69b45bb58|[21002] |
+------------+-------------+--------------------+
Now I want to join those two dataframes on itemset and items if the elements are the same in the array (it does not matter how they are ordered). My desired output would be:
+------------+-------------+--------------------+----+
| user_id| session_id| itemset |freq|
+------------+-------------+--------------------+----+
|WLB2T1JWGTHH|0012c5936056e|[1828545, 1242385] | 4|
|BZTAWYQ70C7N|00783934ea027|[2793224, 8590] | 7|
|42L1RJL436ST|00c6821ed171e|[8590, 2793224] | 7|
|HB348HWSJAOP|00fa9607ead50|[21002] | 5|
|I9FOENUQL1F1|013f69b45bb58|[21002] | 5|
+------------+-------------+--------------------+----+
I could not find any solution online, only solutions where dataframes are joined where one item is contained in an array.
Thank you very much! :)