Pyspark : How to lead from specific column value in Dataframe

Question

The dataframe is already sorted out by date,

col1 ==1 value is unique,

and only the 0 have duplicates.

I have a dataframe looks like this call it df

+--------+----+----+
    date |col1|col2|
+--------+----+----+
2020-08-01| 5|  -1|
2020-08-02| 4|  -1|
2020-08-03| 3|   3|
2020-08-04| 2|   2|
2020-08-05| 1|   4|
2020-08-06| 0|   1|
2020-08-07| 0|   2|
2020-08-08| 0|   3|
2020-08-09| 0|  -1|
+--------+----+----+

The condition is when col1 == 1, then we start adding backwards from col2 ==4, (eg. 4,5,6,7,8,...) and the after col2 == 4 return 0 all the way (eg. 4,0,0,0,0...)

So, my resulted df will look something like this.

    +--------+----+----+----+
        date |col1|col2|want
    +--------+----+----+----+
    2020-08-01| 5|  -1|  8 |
    2020-08-02| 4|  -1|  7 |
    2020-08-03| 3|   3|  6 |
    2020-08-04| 2|   2|  5 |
    2020-08-05| 1|   4|  4 |
    2020-08-06| 0|   1|  0 |
    2020-08-07| 0|   2|  0 |
    2020-08-08| 0|   3|  0 |
    2020-08-09| 0|  -1|  0 |
   +---------+----+----+----+

Enhancement: I want to add additional condition where col2 == -1 when col1 == 1 row, and -1 goes consecutive, then I want to count consecutive -1, and then add with next col2 == ? value. so here's an example to clear.

    +--------+----+----+----+
        date |col1|col2|want
    +--------+----+----+----+
    2020-08-01| 5|  -1|  11|
    2020-08-02| 4|  -1|  10|
    2020-08-03| 3|   3|  9 |
    2020-08-04| 2|   2|  8 |
    2020-08-05| 1|  -1|  7 |
    2020-08-06| 0|  -1|  0 |
    2020-08-07| 0|  -1|  0 |
    2020-08-08| 0|   4|  0 |
    2020-08-09| 0|  -1|  0 |
   +---------+----+----+----+

so, we see 3 consecutive -1s, (we only care about first consecutive -1s) and after the consecutive we have 4, then we would have 4+ 3 =7 at the col1 ==1 row. is it possible?

any help or how to begin this approach would be much appreciated! — hellotherebj
What happens if col1 has multiple 1 values? I can already see duplicates for 0, can duplicates exist for other values as well? What is the correct sorting in that case? — Dusan Vasiljevic
the value 1 in col1 will be unique. after 1 is passed it will be 0 thereafter. the only duplicate in this case would be 0. — hellotherebj
in terms of sorting, we can't sort it unfortunately they are dependent on the dates which I did not include in this dataframe.. — hellotherebj
Can you update the question with all relevant information? The problem above is 100% solvable using sorting and ranking, but if sorting cannot be used, maybe there is other approach that can be helpful — Dusan Vasiljevic

Lamanus Lamanus · Accepted Answer · 2020-08-06T13:46:44

Here is my try:

from pyspark.sql.functions import sum, when, col, rank, desc
from pyspark.sql import Window

w1 = Window.orderBy(desc('date'))
w2 = Window.partitionBy('case').orderBy(desc('date'))

df.withColumn('case', sum(when(col('col1') == 1, col('col2')).otherwise(0)).over(w1)) \
  .withColumn('rank', when(col('case') != 0, rank().over(w2) - 1).otherwise(0)) \
  .withColumn('want', col('case') + col('rank')) \
  .orderBy('date') \
  .show(10, False)

+----------+----+----+----+----+----+
|date      |col1|col2|case|rank|want|
+----------+----+----+----+----+----+
|2020-08-01|5   |-1  |4   |4   |8   |
|2020-08-02|4   |-1  |4   |3   |7   |
|2020-08-03|3   |3   |4   |2   |6   |
|2020-08-04|2   |2   |4   |1   |5   |
|2020-08-05|1   |4   |4   |0   |4   |
|2020-08-06|0   |1   |0   |0   |0   |
|2020-08-07|0   |2   |0   |0   |0   |
|2020-08-08|0   |3   |0   |0   |0   |
|2020-08-09|0   |-1  |0   |0   |0   |
+----------+----+----+----+----+----+

Pyspark : How to lead from specific column value in Dataframe

1 Answers