How to find out the number of unique elements for a column in a group in PySpark?

Question

I have a PySpark dataframe-

df1 = spark.createDataFrame([
    ("u1", 1),
    ("u1", 2),
    ("u2", 1),
    ("u2", 1),
    ("u2", 1),
    ("u3", 3),
    ],
    ['user_id', 'var1'])

print(df1.printSchema())
df1.show(truncate=False)

Output-

root
 |-- user_id: string (nullable = true)
 |-- var1: long (nullable = true)

None
+-------+----+
|user_id|var1|
+-------+----+
|u1     |1   |
|u1     |2   |
|u2     |1   |
|u2     |1   |
|u2     |1   |
|u3     |3   |
+-------+----+

Now I want to group all the unique users and show the number of unique var for them in a new column. The desired output would look like-

+-------+---------------+
|user_id|num_unique_var1|
+-------+---------------+
|u1     |2              |
|u2     |1              |
|u3     |1              |
+-------+---------------+

I can use collect_set and make a udf to find the set's length. But I think there must be a better way to do it. How do I achieve this in one line of code?

n0obcoder n0obcoder · Accepted Answer · 2021-01-07T07:14:39

df1.groupBy('user_id').agg(F.countDistinct('var1').alias('num')).show()

countDistinct is exactly what I needed.

Output-

+-------+---+
|user_id|num|
+-------+---+
|     u3|  1|
|     u2|  1|
|     u1|  2|
+-------+---+

How to find out the number of unique elements for a column in a group in PySpark?

2 Answers