I have a dataframe with two columns that looks as follows:
df = spark.createDataFrame([('A', 'Science'),
('A', 'Math'),
('A', 'Physics'),
('B', 'Science'),
('B', 'English'),
('C', 'Math'),
('C', 'English'),
('C', 'Latin')],
['Group', 'Subjects'])
Group Subjects
A Science
A Math
A Physics
B Science
B English
C Math
C English
C Latin
I need to iterate through this data for each unique value in Group column and perform some processing. I'm thinking of creating a dictionary with the each Group name as the key and their corresponding list of Subjects as the value.
So, my expected output would look like below:
{A:['Science', 'Math', 'Physics'], B:['Science', 'English'], C:['Math', 'English', 'Latin']}
How to achieve this in pyspark?