I'm trying to create a new column in a Spark DataFrame using Pyspark, which represents an autoincrement (or ID) based on groups of alternating boolean values. Lets say I have the following DataFrame:
df.show()
+-----+------------+-------------+
|id |par_id |is_on |
+-----+------------+-------------+
|40002|1 |true |
|40003|2 |true |
|40004|null |false |
|40005|17 |true |
|40006|2 |true |
|40007|17 |true |
|40008|240 |true |
|40009|1861 |true |
|40010|1862 |true |
|40011|2 |true |
|40012|null |false |
|40013|1863 |true |
|40014|626 |true |
|40016|208 |true |
|40017|2 |true |
|40018|null |false |
|40019|2 |true |
|40020|1863 |true |
|40021|2 |true |
|40022|2 |true |
+-----+------------+-------------+
I want to extend this DataFrame with an incremental id called id2
using the is_on
attribute. That is, each group of boolean values should get an increasing id. The resulting DataFrame should look like this:
df.show()
+-----+------------+-------------+-----+
|id |par_id |is_on |id2 |
+-----+------------+-------------+-----+
|40002|1 |true |1 |
|40003|2 |true |1 |
|40004|null |false |2 |
|40005|17 |true |3 |
|40006|2 |true |3 |
|40007|17 |true |3 |
|40008|240 |true |3 |
|40009|1861 |true |3 |
|40010|1862 |true |3 |
|40011|2 |true |3 |
|40012|null |false |4 |
|40013|1863 |true |5 |
|40014|626 |true |5 |
|40016|208 |true |5 |
|40017|2 |true |5 |
|40018|null |false |6 |
|40019|2 |true |7 |
|40020|1863 |true |7 |
|40021|2 |true |7 |
|40022|2 |true |7 |
+-----+------------+-------------+-----+
Do you have any suggestions to do that? How can I write a User Defined Function for this?