1
votes

I am playing around with bigquery. Following input is given:

+---------------+---------+---------+--------+----------------------+
|   customer    |  agent  |  value  |  city  |   timestamp          |
+---------------+---------+---------+--------+----------------------+
| 1             | 1       |  106    | LA     |  2019-02-12 03:05pm  |
| 1             | 1       |  251    | LA     |  2019-02-12 03:06pm  |
| 3             | 2       |  309    | NY     |  2019-02-12 06:41pm  |
| 1             | 1       |  654    | LA     |  2019-02-12 05:12pm  |
+---------------+---------+---------+--------+----------------------+

I want to find transactions that where issued one after another (say within 5 minutes) by one and the same agent. So the output for the above table should look like:

+---------------+---------+---------+--------+----------------------+
|   customer    |  agent  |  value  |  city  |   timestamp          |
+---------------+---------+---------+--------+----------------------+
| 1             | 1       |  106    | LA     |  2019-02-12 03:05pm  |
| 1             | 1       |  251    | LA     |  2019-02-12 03:06pm  |
+---------------+---------+---------+--------+----------------------+

The query should somehow group by agent and find such transactions. However the result is not really grouped as you can see from the output. My first thought was using the LEAD function, but I am not sure. Do you have any ideas?

Ideas for a query:

  • sort by agent and timestamp DESC
  • start with the first row, look at the following row (using LEAD?)
  • check if timestamp difference is less than 5 minutes
  • if so, this two rows should be in the output
  • continue with next (2nd) row

When the 2nd and 3rd row match the criteria, too, the 2nd row will get into the output, which would cause duplicate rows. I am not sure how to avoid that, yet.

2
I may be misinterpreting the data, but could you not simply sort by agent and timestamp? - vinoaj
Yes that might be the first step. Once sorted one have to look at the first row, look at the second row and see if the timestamp difference is say less than 5 minutes and the customer is the same. And this should be repeated for all rows. - Upvote

2 Answers

2
votes

There must be an easier way but does this achieve what you are after?

CTE2 AS (
SELECT customer, agent, value, city, timestamp,
  lead(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lead,
  lead(customer,1) OVER (PARTITION BY agent ORDER BY timestamp) customer_lead,
  lead(value,1) OVER (PARTITION BY agent ORDER BY timestamp) value_lead,
  lead(city,1) OVER (PARTITION BY agent ORDER BY timestamp) city_lead,
  lag(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lag
FROM CTE
)

SELECT agent, 
  if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(customer as string),', ',cast(customer_lead as string)),cast(customer as string)) customer, 
  if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(value as string),', ',cast(value_lead as string)),cast(value as string)) value,
  if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(city as string),', ',cast(city_lead as string)),cast(city as string)) cities,
  if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(timestamp as string),', ',cast(timestamp_lead as string)),cast(timestamp as string)) timestamps
FROM CTE2
WHERE (timestamp_diff(timestamp_lead,timestamp,MINUTE)<5 OR NOT timestamp_diff(timestamp,timestamp_lag,MINUTE)<5)
0
votes

Below is for BigQuery Standard SQL

#standardSQL
SELECT * FROM (
  SELECT *, 
    IF(TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY agent ORDER BY ts), ts, MINUTE) < 5, 
      LEAD(STRUCT(customer AS next_customer, value AS next_value)) OVER(PARTITION BY agent ORDER BY ts), 
    NULL).* 
  FROM `project.dataset.yourtable`
)
WHERE NOT next_customer IS NULL

You can test, play with above using sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 customer, 1 agent, 106 value,'LA' city, '2019-02-12 03:05pm' ts UNION ALL
  SELECT 1, 1, 251,'LA', '2019-02-12 03:06pm' UNION ALL
  SELECT 3, 2, 309,'NY', '2019-02-12 06:41pm' UNION ALL
  SELECT 1, 1, 654,'LA', '2019-02-12 05:12pm' 
), temp AS (
  SELECT customer, agent, value, city, PARSE_TIMESTAMP('%Y-%m-%d %I:%M%p', ts) ts 
  FROM `project.dataset.table`
)
SELECT * FROM (
  SELECT *, 
    IF(TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY agent ORDER BY ts), ts, MINUTE) < 5, 
      LEAD(STRUCT(customer AS next_customer, value AS next_value)) OVER(PARTITION BY agent ORDER BY ts), 
    NULL).* 
  FROM temp
)
WHERE NOT next_customer IS NULL
-- ORDER BY ts

with result

Row customer    agent   value   city    ts                      next_customer   next_value   
1   1           1       106     LA      2019-02-12 15:05:00 UTC 1               251