0
votes

Consider the following Pyspark dataframe

Col1 Col2 Col3
A D G
B E H
C F I

How can I create the following dataframe which has all pairwise combinations of all the columns?

Col1 Col2 Col3 Col1_Col2_cross Col1_Col3_cross Col2_Col3_cross
A D G A,D A,G D,G
B E H B,E B,H E,H
C F I C,F C,I F,I
1
Does it need to work for any amount of columns, or only 3?ScootCork
Any amount of columns. I actually have about ~30 columns, but just showed 3 here to simplify the problem.ShawnK

1 Answers

0
votes

You can generate column combinations using itertools:

import pyspark.sql.functions as F
import itertools

df2 = df.select(
    '*', 
    *[F.concat_ws(',', x[0], x[1]).alias(x[0] + '_' + x[1] + '_cross') 
      for x in itertools.combinations(df.columns, 2)]
)

df2.show()
+----+----+----+---------------+---------------+---------------+
|Col1|Col2|Col3|Col1_Col2_cross|Col1_Col3_cross|Col2_Col3_cross|
+----+----+----+---------------+---------------+---------------+
|   A|   D|   G|            A,D|            A,G|            D,G|
|   B|   E|   H|            B,E|            B,H|            E,H|
|   C|   F|   I|            C,F|            C,I|            F,I|
+----+----+----+---------------+---------------+---------------+