I have a Dataframe with the following schema:
root
|-- eventTimestamp: long
|-- trackingId: string
|-- voyageStatus: string
Here are some sample rows:
+--------------+----------+------------+
|eventTimestamp|trackingId|voyageStatus|
+--------------+----------+------------+
| 504 |78911c81 |COMPLETE |
| 504 |3b77a150 |ACTIVE |
| 390 |ece6c8d0 |ACTIVE |
| 390 |78911c81 |ACTIVE |
| 349 |3b77a150 |ACTIVE |
| 349 |ece6c8d0 |ACTIVE |
| 349 |78911c81 |ACTIVE |
| 350 |3b77a150 |ACTIVE |
| 350 |ece6c8d0 |ACTIVE |
| 350 |78911c81 |ACTIVE |
| 351 |3b77a150 |ACTIVE |
| 351 |ece6c8d0 |ACTIVE |
| 351 |78911c81 |ACTIVE |
| 352 |3b77a150 |ACTIVE |
| 352 |ece6c8d0 |ACTIVE |
| 352 |78911c81 |ACTIVE |
| 507 |3b77a150 |COMPLETE |
| 349 |ece6c8d0 |ACTIVE |
| 349 |78911c81 |ACTIVE |
| 349 |3b77a150 |ACTIVE |
+--------------+----------+------------+
I want to add a new column of type long called completionEventTimestamp. For each row, this column will have the following value:
- If there is a record with the same
trackingIdas the current row with the value ofvoyageStatusequal to"COMPLETE", then the value will be theeventTimestampof that record. - Otherwise, the value will be -1 (so the row can be filtered out later).
Here is what the transformation would yield for the example above:
+--------------+----------+------------+------------------------+
|eventTimestamp|trackingId|voyageStatus|completionEventTimestamp|
+--------------+----------+------------+------------------------+
| 504 |78911c81 |COMPLETE | 504|
| 504 |3b77a150 |ACTIVE | 507|
| 390 |ece6c8d0 |ACTIVE | -1|
| 390 |78911c81 |ACTIVE | 504|
| 349 |3b77a150 |ACTIVE | 507|
| 349 |ece6c8d0 |ACTIVE | -1|
| 349 |78911c81 |ACTIVE | 504|
| 350 |3b77a150 |ACTIVE | 507|
| 350 |ece6c8d0 |ACTIVE | -1|
| 350 |78911c81 |ACTIVE | 504|
| 351 |3b77a150 |ACTIVE | 507|
| 351 |ece6c8d0 |ACTIVE | -1|
| 351 |78911c81 |ACTIVE | 504|
| 352 |3b77a150 |ACTIVE | 507|
| 352 |ece6c8d0 |ACTIVE | -1|
| 352 |78911c81 |ACTIVE | 504|
| 507 |3b77a150 |COMPLETE | 507|
| 349 |ece6c8d0 |ACTIVE | -1|
| 349 |78911c81 |ACTIVE | 504|
| 349 |3b77a150 |ACTIVE | 507|
+--------------+----------+------------+------------------------+
If this helps at all, if a record with a given trackingId has a voyageStatus of "COMPLETE", then it will be the last such record with that trackingId (if you were to order by eventTimestamp), and there will only be one such record.