I need some help with the following issue.
I have a pyspark dataframe with two columns:
+----------------------+------+
| col_list | group|
+----------------------+------+
|[1, 2, 3, 4, 5, 6, 7] |group1|
| [6, 7, 8] |group1|
| [1, 2, 3, 4] |group2|
| [10, 11] |group2|
+----------------------+------+
And I want to do a groupby the column named group
and collect unique values only into one list from column col_list
.
I have tried this:
df.groupby("group").agg(F.flatten(F.collect_set('col_list')))
and it responded with this answer:
+------+-------------------------------+
| group|flatten(collect_set(col_list)) |
+------+-------------------------------+
|group1| [1,2,3,4,5,6,7,6,7,8]|
|group2| [10, 11, 1, 2, 3, 4] |
+------+-------------------------------+
The group1 flatten list has duplicates and I need some help with only returning unique values like:
[1,2,3,4,5,6,7,8]
F.array_distinct()
to remove duplicates from the flattened list – jxc