0
votes

Consider the following PySpark dataframe

Col1 Col2 Col3
A, B D, G A, G
C, F C, D A, G
C, F C, D A, G

I'd like to create a new dataframe with 2 columns, the first with all the different combinations, and the second column is the ratio: Frequency of Combination / Total Number of Combinations. For example,

Combination Ratio
A, B 0.111 (1/9)
C, F 0.222 (2/9)
D, G 0.111 (1/9)
C, D 0.222 (2/9)
A, G 0.333 (3/9)
1

1 Answers

0
votes

You can unpivot, then group by and count:

from pyspark.sql import functions as F, Window

df2 = df.selectExpr(
    'stack(' + str(len(df.columns)) + ', ' + ', '.join(df.columns) + ') as combination'
).groupBy('combination').count().withColumn(
    'ratio', 
    F.col('count') / F.sum('count').over(Window.orderBy())
).drop('count')

df2.show()
+-----------+------------------+
|combination|             ratio|
+-----------+------------------+
|       A, B|0.1111111111111111|
|       C, F|0.2222222222222222|
|       C, D|0.2222222222222222|
|       D, G|0.1111111111111111|
|       A, G|0.3333333333333333|
+-----------+------------------+