I have a PySpark dataframe-
df = spark.createDataFrame([
("u1", 0),
("u2", 0),
("u3", 1),
("u4", 2),
("u5", 3),
("u6", 2),],
['user_id', 'medals'])
df.show()
Output-
+-------+------+
|user_id|medals|
+-------+------+
| u1| 0|
| u2| 0|
| u3| 1|
| u4| 2|
| u5| 3|
| u6| 2|
+-------+------+
I want to get the distribution of the medals column for all the users. So if there are n unique values in the medals column, I want n columns in the output dataframe with corresponding number of users who received that many medals.
The output for the data given above should look like-
+------- +--------+--------+--------+
|medals_0|medals_1|medals_2|medals_3|
+--------+--------+--------+--------+
| 2| 1| 2| 1|
+--------+--------+--------+--------+
How do I achieve this?