0
votes

I have a table where each row contains all the fields that changed during some event, and a flag associated with each field to flag if the field was updated. For simplicity I only show here the "status" field, but they are several other fields as well. In cases where a given field was not modified by the event, the field is set to null and so is the flag.

+----+---------------------+--------+---------------------+
| id |        date         | status | flag_changed_status |
+----+---------------------+--------+---------------------+
|  1 | 2020-01-03 19:32:17 | TODO   |                   1 |
|  1 | 2020-01-08 15:46:07 | WIP    |                   1 |
|  1 | 2020-01-08 15:53:53 |        |                     | //this line was generated because another field changed
|  1 | 2020-01-08 15:56:53 |        |                     | //this line was generated because another field changed
|  1 | 2020-01-08 16:02:31 | Done   |                   1 |
+----+---------------------+--------+---------------------+

My goal is to replace the field values for the rows where the field was not changed with the last value it had when the flag was equal to one, e.g get :

+----+---------------------+--------+---------------------+
| id |        date         | status | flag_changed_status |
+----+---------------------+--------+---------------------+
|  1 | 2020-01-03 19:32:17 | TODO   |                   1 |
|  1 | 2020-01-08 15:46:07 | WIP    |                   1 |
|  1 | 2020-01-08 15:53:53 | WIP    |                     |
|  1 | 2020-01-08 15:56:53 | WIP    |                     |
|  1 | 2020-01-08 16:02:31 | Done   |                   1 |
+----+---------------------+--------+---------------------+

I understand that I want to use the last_value analytical function in Bigquery, and I tried :

SELECT ID_DEMANDE, date, status, 
last_value(status) OVER (ORDER BY flag_changed_status, DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as current_status, flag_changed_status 
FROM table ORDER BY  id, DATE

The idea was that by using the flag in the order by function, the rows where the flag was set to null would be put in first, and then the last_value(status) would be the last value where flag_changed_status was set to 1

But this can only work if I use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, because the ORDER BY clause will be processed before the window frame clause (rows between ...), thus for the rows where flag_changed_status is null, after the order by is processed, the current row number is 0, so the last value between unbounded preceding and current row is always null.

Is there any way to first run the ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING and then the ORDER BY, so that last_value(status) will return the last value preceding the current row where the flag was set to one ? Or is there something much simpler, still using analytical functions to allow me to complete all the different fields in one query ?

Edit : I really want to copy the status that was set the last time the flag was set, even if this status is null, that is why I am trying to use the flag in the order by. That is if the initial table is :

+----+---------------------+--------+---------------------+
| id |        date         | status | flag_changed_status |
+----+---------------------+--------+---------------------+
|  1 | 2020-01-03 19:32:17 | TODO   |                   1 |
|  1 | 2020-01-08 15:46:07 | null   |                   1 |
|  1 | 2020-01-08 15:53:53 | null   |                null |
|  1 | 2020-01-08 15:56:53 | null   |                null |
|  1 | 2020-01-08 15:57:53 | WIP    |                   1 |
|  1 | 2020-01-08 15:58:53 | null   |                null |
|  1 | 2020-01-08 16:02:31 | Done   |                   1 |
+----+---------------------+--------+---------------------+

I would need:

+----+---------------------+--------+---------------------+
| id |        date         | status | flag_changed_status |
+----+---------------------+--------+---------------------+
|  1 | 2020-01-03 19:32:17 | TODO   |                   1 |
|  1 | 2020-01-08 15:46:07 | null   |                   1 |
|  1 | 2020-01-08 15:53:53 | null   |                null | // we copy the last status where the flag was 1, and it is null
|  1 | 2020-01-08 15:56:53 | null   |                null |
|  1 | 2020-01-08 15:57:53 | WIP    |                   1 |
|  1 | 2020-01-08 15:58:53 | WIP    |                null | //only this line changes
|  1 | 2020-01-08 16:02:31 | Done   |                   1 |
+----+---------------------+--------+---------------------+

But it seems to be too complicated, so I will just replace all the nulls where the flag is set to 1 with a custom status, and then a simple last_value(status IGNORE NULLS) as @gordon-linoff was suggesting will provide almost the desired result

2

2 Answers

1
votes

Below is for BigQuery Standard SQL

#standardSQL
SELECT * EXCEPT(grp),
  LAST_VALUE(status IGNORE NULLS) OVER (PARTITION BY grp ORDER BY date) AS updated_status
FROM (
  SELECT *,
    COUNTIF(flag_changed_status = 1) OVER(ORDER BY `date`) grp
  FROM `project.dataset.table`
)   

if to apply to sample data from your question - result is

Row id  date                status  flag_changed_status updated_status   
1   1   2020-01-03 19:32:17 TODO    1                   TODO     
2   1   2020-01-08 15:46:07 null    1                   null     
3   1   2020-01-08 15:53:53 null    null                null     
4   1   2020-01-08 15:56:53 null    null                null     
5   1   2020-01-08 15:57:53 WIP     1                   WIP  
6   1   2020-01-08 15:58:53 null    null                WIP  
7   1   2020-01-08 16:02:31 Done    1                   Done     
0
votes

I prefer lag(ignore nulls). But BigQuery doesn't support that. Instead, use first_value()/last_value():

with t as (
      select 1 as id, '2020-01-03 19:32:17' as date, 'TODO' as status, 1 as file_changed_status union all
      select 1 as id, '2020-01-08 15:46:07' as date, 'WIP ' as status, 1 as file_changed_status union all
      select 1 as id, '2020-01-08 15:53:53' as date, null as status, null  as file_changed_status union all
      select 1 as id, '2020-01-08 15:56:53' as date, null as status,  null as file_changed_status union all
      select 1 as id, '2020-01-08 16:02:31' as date, 'Done' as status, 1 as file_changed_status
     )
select t.*,
       last_value(status ignore nulls) over (order by date) as imputed_status
from t;