Create a group id over a window in Spark Dataframe

Question

I have a dataframe where I want to give id's in each Window partition. For example I have

id | col |
1  |  a  |
2  |  a  |
3  |  b  |
4  |  c  |
5  |  c  |

So I want (based on grouping with column col)

id | group |
1  |  1    |
2  |  1    |
3  |  2    |
4  |  3    |
5  |  3    |

I want to use a window function but I cannot find anyway to assign an Id to each window. I need something like:

w = Window().partitionBy('col')
df = df.withColumn("group", id().over(w))

Is there any way to achive somethong like that. (I cannot simply use col as a group id because I am interested in creating a window over multiple columns)

Ramesh Maharjan Ramesh Maharjan · Accepted Answer · 2018-05-08T18:10:44

Simply using a dense_rank inbuilt function over Window function should give you your desired result as

from pyspark.sql import window as W
import pyspark.sql.functions as f
df.select('id', f.dense_rank().over(W.Window.orderBy('col')).alias('group')).show(truncate=False)

which should give you

+---+-----+
|id |group|
+---+-----+
|1  |1    |
|2  |1    |
|3  |2    |
|4  |3    |
|5  |3    |
+---+-----+

Create a group id over a window in Spark Dataframe

2 Answers