Pyspark - group by column and collect unique set of values from a column of arrays of integers

Question

I need some help with the following issue.

I have a pyspark dataframe with two columns:

+----------------------+------+
|             col_list | group|
+----------------------+------+
|[1, 2, 3, 4, 5, 6, 7] |group1|
|            [6, 7, 8] |group1|
|         [1, 2, 3, 4] |group2|
|             [10, 11] |group2|
+----------------------+------+

And I want to do a groupby the column named group and collect unique values only into one list from column col_list.

I have tried this:

df.groupby("group").agg(F.flatten(F.collect_set('col_list')))

and it responded with this answer:

+------+-------------------------------+
| group|flatten(collect_set(col_list)) |
+------+-------------------------------+
|group1|          [1,2,3,4,5,6,7,6,7,8]|
|group2|          [10, 11, 1, 2, 3, 4] |
+------+-------------------------------+

The group1 flatten list has duplicates and I need some help with only returning unique values like:

[1,2,3,4,5,6,7,8]

just use F.array_distinct() to remove duplicates from the flattened list — jxc

Grzegorz Skibinski Grzegorz Skibinski · Accepted Answer · 2019-12-31T20:04:24

This should do the trick - you need to explode() first, then collect_set():

df.select("group", F.explode(F.col("col_list")).alias("col_list")).groupby("group").agg(F.collect_set('col_list'))

Pyspark - group by column and collect unique set of values from a column of arrays of integers

1 Answers