I have a pyspark dataframe which has three columns id, seq, seq_checker. I need to order by id and check for 4 consecutive 1's in seq_checker column.
I tried using window functions. I'm unable to change all values in a window based on a condition.
new_window = Window.partitionBy().orderBy("id").rangeBetween(0, 3)output = df.withColumn('check_sequence',F.when(F.min(df['seq_checker']).over(new_window) == 1, True))
original pyspark df:
+---+---+-----------+--------------+ | Id|seq|seq_checker|check_sequence| +---+---+-----------+--------------+ | 1| 1| 1| false| | 2| 2| 1| false| | 3| 3| 1| false| | 4| 4| 1| false| | 5| 10| 0| false| | 6| 14| 1| false| | 7| 13| 1| false| | 8| 18| 0| false| | 9| 23| 0| false| | 10| 5| 0| false| | 11| 56| 0| false| | 12| 66| 0| false| | 13| 34| 1| false| | 14| 35| 1| false| | 15| 36| 1| false| | 16| 37| 1| false| | 17| 39| 0| false| | 18| 54| 0| false| | 19| 68| 0| false| | 20| 90| 0| false| +---+---+-----------+--------------+
Required output:
+---+---+-----------+--------------+ | Id|seq|seq_checker|check_sequence| +---+---+-----------+--------------+ | 1| 1| 1| true| | 2| 2| 1| true| | 3| 3| 1| true| | 4| 4| 1| true| | 5| 10| 0| false| | 6| 14| 1| false| | 7| 13| 1| false| | 8| 18| 0| false| | 9| 23| 0| false| | 10| 5| 0| false| | 11| 56| 0| false| | 12| 66| 0| false| | 13| 34| 1| true| | 14| 35| 1| true| | 15| 36| 1| true| | 16| 37| 1| true| | 17| 39| 0| false| | 18| 54| 0| false| | 19| 68| 0| false| | 20| 90| 0| false| +---+---+-----------+--------------+
Based on the above code, my output is:
+---+---+-----------+--------------+ | Id|seq|seq_checker|check_sequence| +---+---+-----------+--------------+ | 1| 1| 1| true| | 2| 2| 1| null| | 3| 3| 1| null| | 4| 4| 1| null| | 5| 10| 0| null| | 6| 14| 1| null| | 7| 13| 1| null| | 8| 18| 0| null| | 9| 23| 0| null| | 10| 5| 0| null| | 11| 56| 0| null| | 12| 66| 0| null| | 13| 34| 1| true| | 14| 35| 1| null| | 15| 36| 1| null| | 16| 37| 1| null| | 17| 39| 0| null| | 18| 54| 0| null| | 19| 68| 0| null| | 20| 90| 0| null| +---+---+-----------+--------------+
Edit: 1. If we have more than 4 consecutive rows having 1's we need to change check_sequence flag for all the rows to True.
- My actual problem is to check for sequences of length greater than 4 in the 'seq' column. I was able to create seq_checker column using lead and lag functions.