I want group a column based on unique values from two columns of pyspark dataframe. The output of the dataframe should be such that once some value used for groupby and if it is present in another column then it should not repeat.
|------------------|-------------------|
| fruit | fruits |
|------------------|-------------------|
| apple | banana |
| banana | apple |
| apple | mango |
| orange | guava |
| apple | pineapple |
| mango | apple |
| banana | mango |
| banana | pineapple |
| -------------------------------------|
I have tried to group by using single column and it needs to be modified or some other logic should be required.
df9=final_main.groupBy('fruit').agg(collect_list('fruits').alias('values'))
I am getting following output from above query;
|------------------|--------------------------------|
| fruit | values |
|------------------|--------------------------------|
| apple | ['banana','mango','pineapple'] |
| banana | ['apple'] |
| orange | ['guava'] |
| mango | ['apple'] |
|------------------|--------------------------------|
But I want following output;
|------------------|--------------------------------|
| fruit | values |
|------------------|--------------------------------|
| apple | ['banana','mango','pineapple'] |
| orange | ['guava'] |
|------------------|--------------------------------|