How to find the distribution of a column in PySpark dataframe for all the unique values present in that column?

Question

I have a PySpark dataframe-

df = spark.createDataFrame([
    ("u1", 0),
    ("u2", 0),
    ("u3", 1),
    ("u4", 2),
    ("u5", 3),
    ("u6", 2),],
    ['user_id', 'medals'])

df.show()

Output-

+-------+------+
|user_id|medals|
+-------+------+
|     u1|     0|
|     u2|     0|
|     u3|     1|
|     u4|     2|
|     u5|     3|
|     u6|     2|
+-------+------+

I want to get the distribution of the medals column for all the users. So if there are n unique values in the medals column, I want n columns in the output dataframe with corresponding number of users who received that many medals.

The output for the data given above should look like-

+------- +--------+--------+--------+
|medals_0|medals_1|medals_2|medals_3|
+--------+--------+--------+--------+
|       2|       1|       2|       1|
+--------+--------+--------+--------+

How do I achieve this?

Steven Steven · Accepted Answer · 2020-12-17T10:27:08

it's a simple pivot:

df.groupBy().pivot("medals").count().show()

+---+---+---+---+
|  0|  1|  2|  3|
+---+---+---+---+
|  2|  1|  2|  1|
+---+---+---+---+

if you need some cosmetic to add the word medals in the column name, then you can do this :

medals_df = df.groupBy().pivot("medals").count()

for col in medals_df.columns:
    medals_df = medals_df.withColumnRenamed(col, "medals_{}".format(col))

medals_df.show()

+--------+--------+--------+--------+
|medals_0|medals_1|medals_2|medals_3|
+--------+--------+--------+--------+
|       2|       1|       2|       1|
+--------+--------+--------+--------+

How to find the distribution of a column in PySpark dataframe for all the unique values present in that column?

1 Answers