PySpark - Creating a single column from multiple columns with some basic math

Question

Consider the following PySpark dataframe

Col1	Col2	Col3
A, B	D, G	A, G
C, F	C, D	A, G
C, F	C, D	A, G

I'd like to create a new dataframe with 2 columns, the first with all the different combinations, and the second column is the ratio: Frequency of Combination / Total Number of Combinations. For example,

Combination	Ratio
A, B	0.111 (1/9)
C, F	0.222 (2/9)
D, G	0.111 (1/9)
C, D	0.222 (2/9)
A, G	0.333 (3/9)

mck mck · Accepted Answer · 2021-06-03T17:56:55

You can unpivot, then group by and count:

from pyspark.sql import functions as F, Window

df2 = df.selectExpr(
    'stack(' + str(len(df.columns)) + ', ' + ', '.join(df.columns) + ') as combination'
).groupBy('combination').count().withColumn(
    'ratio', 
    F.col('count') / F.sum('count').over(Window.orderBy())
).drop('count')

df2.show()
+-----------+------------------+
|combination|             ratio|
+-----------+------------------+
|       A, B|0.1111111111111111|
|       C, F|0.2222222222222222|
|       C, D|0.2222222222222222|
|       D, G|0.1111111111111111|
|       A, G|0.3333333333333333|
+-----------+------------------+

PySpark - Creating a single column from multiple columns with some basic math

1 Answers