15
votes

I am having a pyspark dataframe as

DOCTOR | PATIENT
JOHN   | SAM
JOHN   | PETER
JOHN   | ROBIN
BEN    | ROSE
BEN    | GRAY

and need to concatenate patient names by rows so that I get the output like:

DOCTOR | PATIENT
JOHN   | SAM, PETER, ROBIN
BEN    | ROSE, GRAY

Can anybody help me regarding creating this dataframe in pyspark ?

Thanks in advance.

1

1 Answers

43
votes

The simplest way I can think of is to use collect_list

import pyspark.sql.functions as f
df.groupby("col1").agg(f.concat_ws(", ", f.collect_list(df.col2)))