2
votes

I have a table that looks like the below:

Row | Fullvisitorid | Visitid | New_Session_Flag 
1   | A             | 111     | 1
2   | A             | 120     | 0
3   | A             | 128     | 0
4   | A             | 133     | 0
5   | A             | 745     | 1
6   | A             | 777     | 0
7   | B             | 388     | 1
8   | B             | 401     | 0
9   | B             | 420     | 0
10  | B             | 777     | 1
11  | B             | 784     | 0
12  | B             | 791     | 0
13  | B             | 900     | 1  
14  | B             | 904     | 0  

What I want to do is if it's the first row for a fullvisitorid then mark the field as 1, otherwise use the above row as the value, but if the new_session_flag = 1 then use the above row plus 1, example of output I'm looking for below:

Row | Fullvisitorid | Visitid | New_Session_Flag | Rank_Session_Order
1   | A             | 111     | 1                | 1
2   | A             | 120     | 0                | 1
3   | A             | 128     | 0                | 1
4   | A             | 133     | 0                | 1
5   | A             | 745     | 1                | 2
6   | A             | 777     | 0                | 2
7   | B             | 388     | 1                | 1
8   | B             | 401     | 0                | 1
9   | B             | 420     | 0                | 1
10  | B             | 777     | 1                | 2
11  | B             | 784     | 0                | 2
12  | B             | 791     | 0                | 2
13  | B             | 900     | 1                | 3
14  | B             | 904     | 0                | 3

As you can see:

  • Row 1 is 1 because it's the first time fullvisitorid A appears

  • Row 2 is 1 because it's not the first time fullvisitorid A appears and new_session_flag <> 1 therefore it uses the above row (i.e. 1)

  • Row 5 is 2 because it's not the first time fullvisitorid A appears and new_session_Flag = 1 therefore it uses the above row (i.e 1) plus 1

  • Row 7 is 1 because it's the first time fullvisitorid B appears

    etc.

I believe this can be done through a retain statement in SAS but is there an equivalent in Google BigQquery?

Hopefully the above makes sense, let me know if not.

Thanks in advance

2
(1) There is no retains. (2) There are very powerful constructs that are not available in proc SQL. (3) Your question is too difficult to follow. If you show the data as a text table with just the handful of columns you need and then explain what results you want to see, then it would be much simpler to answer. - Gordon Linoff
No. There is no such function. And I second @GordonLinoff - difficult to follow if even possible at all. I suggest you to reshape your question and focus on what you want to accomplish and presenting sample input data as well as expected output in text - not as images! Without doing this - right now your chances to get answer is close to zero. Not a zero though :o) - Mikhail Berlyant
There are full SQL solutions to this, I don't know BigQuery well enough to comment but search for 30 day readmission problem which is essentially what this is, except you're using a much smaller interval. - Reeza
@MikhailBerlyant, thanks for the feedback, I though the screenshots might make things more obvious but obviously not. I've tried to simplify my explanation, can you let me know if this is in a more useful format? - Thomas Chamberlain
absolutely! thank you for following suggestion. see the answer :o) - Mikhail Berlyant

2 Answers

3
votes

Below is for BigQuery Standard SQL

#standardSQL
SELECT *,
  COUNTIF(New_Session_Flag = 1) OVER(PARTITION BY Fullvisitorid  ORDER BY Visitid) Rank_Session_Order
FROM `project.dataset.table`
0
votes

The answer by Mikhail Berlyant using a conditional window count is corret and works. I am answering because I find that a window sum is even simpler (and possibly more efficient on a large dataset):

select
    t.*,
    sum(new_session_flag) over(partition by fullvisitorid order by visid_id) rank_session_order
from mytable t

This works because the new_session_flag contains 0s and 1s only; so counting the 1s is actually equivalent to suming all values.