Query to find records that where created one after another in bigquery

Question

I am playing around with bigquery. Following input is given:

+---------------+---------+---------+--------+----------------------+
|   customer    |  agent  |  value  |  city  |   timestamp          |
+---------------+---------+---------+--------+----------------------+
| 1             | 1       |  106    | LA     |  2019-02-12 03:05pm  |
| 1             | 1       |  251    | LA     |  2019-02-12 03:06pm  |
| 3             | 2       |  309    | NY     |  2019-02-12 06:41pm  |
| 1             | 1       |  654    | LA     |  2019-02-12 05:12pm  |
+---------------+---------+---------+--------+----------------------+

I want to find transactions that where issued one after another (say within 5 minutes) by one and the same agent. So the output for the above table should look like:

+---------------+---------+---------+--------+----------------------+
|   customer    |  agent  |  value  |  city  |   timestamp          |
+---------------+---------+---------+--------+----------------------+
| 1             | 1       |  106    | LA     |  2019-02-12 03:05pm  |
| 1             | 1       |  251    | LA     |  2019-02-12 03:06pm  |
+---------------+---------+---------+--------+----------------------+

The query should somehow group by agent and find such transactions. However the result is not really grouped as you can see from the output. My first thought was using the LEAD function, but I am not sure. Do you have any ideas?

Ideas for a query:

sort by agent and timestamp DESC
start with the first row, look at the following row (using LEAD?)
check if timestamp difference is less than 5 minutes
if so, this two rows should be in the output
continue with next (2nd) row

When the 2nd and 3rd row match the criteria, too, the 2nd row will get into the output, which would cause duplicate rows. I am not sure how to avoid that, yet.

I may be misinterpreting the data, but could you not simply sort by agent and timestamp? — vinoaj
Yes that might be the first step. Once sorted one have to look at the first row, look at the second row and see if the timestamp difference is say less than 5 minutes and the customer is the same. And this should be repeated for all rows. — Upvote

Bobbylank Bobbylank · Accepted Answer · 2019-03-01T08:33:35

There must be an easier way but does this achieve what you are after?

CTE2 AS (
SELECT customer, agent, value, city, timestamp,
  lead(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lead,
  lead(customer,1) OVER (PARTITION BY agent ORDER BY timestamp) customer_lead,
  lead(value,1) OVER (PARTITION BY agent ORDER BY timestamp) value_lead,
  lead(city,1) OVER (PARTITION BY agent ORDER BY timestamp) city_lead,
  lag(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lag
FROM CTE
)

SELECT agent, 
  if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(customer as string),', ',cast(customer_lead as string)),cast(customer as string)) customer, 
  if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(value as string),', ',cast(value_lead as string)),cast(value as string)) value,
  if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(city as string),', ',cast(city_lead as string)),cast(city as string)) cities,
  if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(timestamp as string),', ',cast(timestamp_lead as string)),cast(timestamp as string)) timestamps
FROM CTE2
WHERE (timestamp_diff(timestamp_lead,timestamp,MINUTE)<5 OR NOT timestamp_diff(timestamp,timestamp_lag,MINUTE)<5)

Query to find records that where created one after another in bigquery

2 Answers