0
votes

I'm currently working on SQL analysis. There are 3 tables, the first one contains name_id; timestamp; description. the 2nd and 3rd tables have only name_id; timestamp.

What I'd like to achieve is a table like Goal table below:

  • The first 3 columns of that table would be essentially table_1 ordered;
  • The column other_timestamp should be populated by taking name_id; timestamp from table_1 and searching through table_2 and table_3 by name_id and finding the next following chronological timestamp asociated with the name_id from table_1(could be in either one of the tables);
  • Column source_of_other_timestamp should be populated by showing from which table did the other_timestamp came from (either table_2 or table_3)

table_1

    name_id | timestamp   | description
    name_1  | timestamp_1 | description_1
    name_2  | timestamp_2 | description_3
                  ...
    name_n  | timestamp_n | description_n

table_2 and table_3

    name_id | other_timestamp   
    name_1  | other_timestamp_1 
    name_2  | other_timestamp_2  
                  ...
    name_n  | other_timestamp_n 

Goal table

    name_id | timestamp   | description   | other_timestamp   | source_of_other_timestamp
    name_1  | timestamp_1 | description_1 | other_timestamp_1 | table_2

    name_2  | timestamp_2 | description_2 | other_timestamp_2 | table_3
                  ...
    name_n  | timestamp_n | description_n | other_timestamp_n | table_2

Thank you!

2

2 Answers

1
votes

This can be quite complicated. But I think the most efficient method is a union all and window functions:

with t as (
      select name_id, timestamp, description, 1 as source
      from table1
      union all
      select name_id, timestamp, null, 2
      from table2
      select name_id, timestamp, null, 3
     )
select t.*
from (select t.*,
             max(source) over (partition by other_timestamp, nameid) as other_source
      from (select t.*,
                   min(case when source in (2, 3) then timestamp end) over (partition by nameid order by timetamp desc) as other_timestamp
            from t
           ) t
     ) t
where source = 1;
1
votes

Below is for BigQuery Standard SQL

#standardSQL
WITH tables_2_3 AS (
  SELECT *, 'table_2' source_of_other_timestamp FROM `project.dataset.table_2` UNION ALL
  SELECT *, 'table_3' FROM `project.dataset.table_3` 
)
SELECT t1.*,  
  ARRAY(
    SELECT AS STRUCT other_timestamp, source_of_other_timestamp
    FROM tables_2_3 t23
    WHERE t23.name_id = t1.name_id
    AND t23.other_timestamp > t1.timestamp
    ORDER BY t23.other_timestamp
    LIMIT 1
  )[OFFSET(0)].*
FROM `project.dataset.table_1` t1
-- ORDER BY name_id

with output like below

Row name_id timestamp       description     other_timestamp source_of_other_timestamp    
1   name_1  timestamp_11    description_1   timestamp_21    table_3  
2   name_2  timestamp_12    description_3   timestamp_22    table_2  
3   name_n  timestamp_1n    description_n   timestamp_2n    table_3    

Note: because output gets just one closest entry from either table_2 or table_3 - above can be further refactored / simplified to

#standardSQL
WITH tables_2_3 AS (
  SELECT *, 'table_2' source_of_other_timestamp FROM `project.dataset.table_2` UNION ALL
  SELECT *, 'table_3' FROM `project.dataset.table_3` 
)
SELECT t1.*,  
  (
    SELECT AS STRUCT other_timestamp, source_of_other_timestamp
    FROM tables_2_3 t23
    WHERE t23.name_id = t1.name_id
    AND t23.other_timestamp > t1.timestamp
    ORDER BY t23.other_timestamp
    LIMIT 1
  ).*
FROM `project.dataset.table_1` t1
-- ORDER BY name_id

obviously with same output