1
votes

I have a table in BigQuery which contains self referencing data like -

id  | name         | parent_id
----------------------------
1   | Gross Margin | null
2   | Revenue      | 1
3   | Sales A      | 2
4   | Sales B      | 2
5   | 1001         | 3
6   | 1002         | 4
7   | OPEX         | null
8   | Salaries     | 7
9   | Payroll      | 8
10  | Allowances   | 9
11  | Commissions  | 9

I want to write a query that returns the leaf rows of any row. For example if I give Gross Margin (or 1) as input, the query should return 1001 and 1002 (or 5 and 6) as output. Similarly if I give OPEX (or 7) as input, the query should return Allowances and Commissions (or 10 and 11) as output.

2
I am not seeing the link or relationship between Gross Margin and 1001/1002. What about Revenue, Sales A, and Sales B? You don't want those to be returned? - Korean_Of_the_Mountain
Gross margin is parent of Revenue, Revenue is parent of Sales A and Sales B, and Sales A and Sales B are parent of 1001 and 1002 respectively. This is how Revenue is related to 1001 and 1002. In my use case I just need the leaf rows of any row and not the rows that are in between. - abi_pat

2 Answers

2
votes

Updated script with improved convergence logic

DECLARE run_away_stop INT64 DEFAULT 0;
DECLARE flag BYTES;
   
CREATE TEMP TABLE ttt AS 
SELECT parent_id, ARRAY_AGG(id order by id) children FROM `project.dataset.table` WHERE NOT parent_id IS NULL GROUP BY parent_id;

LOOP
  SET (run_away_stop, flag) = (SELECT AS STRUCT run_away_stop + 1, md5(to_json_string(array_agg(t order by parent_id)))  FROM ttt t);

  CREATE OR REPLACE TEMP TABLE ttt1 AS
    SELECT parent_id, ARRAY(SELECT DISTINCT id FROM UNNEST(children) id order by id) children
    FROM (
      SELECT parent_id, ARRAY_CONCAT_AGG(children) children    
      FROM (
        SELECT t2.parent_id, ARRAY_CONCAT(t1.children, t2.children) children 
        FROM ttt t1, ttt t2 
        WHERE (SELECT COUNTIF(t1.parent_id = id) FROM UNNEST(t2.children) id) > 0
      ) GROUP BY parent_id
    );

  CREATE OR REPLACE TEMP TABLE ttt AS
    SELECT * FROM ttt1 UNION ALL
    SELECT * FROM ttt WHERE NOT parent_id IN (SELECT parent_id FROM ttt1);

  IF (flag = (SELECT md5(to_json_string(array_agg(t order by parent_id))) FROM ttt t)) OR run_away_stop > 20 THEN BREAK; END IF;
END LOOP;

CREATE OR REPLACE TEMP TABLE ttt AS
SELECT id, 
  (
    SELECT STRING_AGG(CAST(id AS STRING) order by id)
    FROM ttt.children id
  ) children_as_list
FROM (SELECT DISTINCT id FROM `project.dataset.table`) d
LEFT JOIN ttt ON id = parent_id;  

SELECT t1.id, STRING_AGG(child) leafs_list
FROM ttt t1, UNNEST(SPLIT(children_as_list)) child
JOIN (SELECT id FROM ttt WHERE children_as_list IS NULL) t2
ON t2.id = CAST(child AS INT64)
GROUP BY t1.id
ORDER BY id; 

When applied to sample data in your question - it took 3 iterations and output is

enter image description here

Also, when applied to bigger example from your comments - it took 5 iterations and output is

enter image description here

0
votes

BigQuery Team just introduced Recursive CTE! Hooray!!

With recursive cte you can use below approach

with recursive  iterations as (
  select id, parent_id from your_table
  where not id in (
    select parent_id from your_table
    where not parent_id is null
  )
  union all 
  select b.id, a.parent_id
  from your_table a join iterations b
  on b.parent_id = a.id 
)
select parent_id, string_agg('' || id order by id) as leafs_list
from iterations
where not parent_id is null
group by parent_id

If applied to sample data in your question - output is

enter image description here

Hope you agree it is more manageable and effective then when we were "forced" to use scripts for such logic!