0
votes

I have a database of family relations:

with example_data as(
  SELECT 'abc' as relative_1, 'def' as relative_2
  union all
  SELECT 'abc' as relative_1, '123' as relative_2
  union all
  SELECT 'def' as relative_1, '334' as relative_2
  union all
  SELECT 'fdc' as relative_1, '123' as relative_2
  union all
  SELECT 'fgl' as relative_1, '342' as relative_2
) 

How can I create base on this data full family clans, so that I receive this kind of result:

enter image description here

The code I wrote to create the desired output does not seem at all practical and indeed with a table of 100k rows of input my query hits the 6 hour limit after the 5th self-join. I'm not worried that might end up with the whole table connected in one long string -- I know there are only so many family members in a clan.

Also alternatively, it would be great, if it was possible to return the result as a nested table with list_all_relatives as a repeated field with minimum numbers of steps between relative_1 and the respective distant relative.

My inefficient code that returns the result in the image:

  SELECT 'abc' as relative_1, 'def' as relative_2, 'abc' as list_0
  union all
  SELECT 'abc' as relative_1, '123' as relative_2, 'abc' as list_0
  union all
  SELECT 'def' as relative_1, '334' as relative_2, 'def' as list_0
  union all
  SELECT 'fdc' as relative_1, '123' as relative_2, 'fdc' as list_0
  union all
  SELECT 'fgl' as relative_1, '342' as relative_2, 'fgl' as list_0
)
,
step_0 as (
 SELECT relative_1, relative_2,
        ARRAY_TO_STRING(ARRAY(SELECT DISTINCT x FROM UNNEST(SPLIT(concat(relative_1,',',relative_2,',',list_0), ',')) AS x ORDER BY x), ',') AS combined_list
  from raw_data
)
,
step_1 as (
SELECT relative_1, relative_2, list_1,
       ARRAY_TO_STRING(ARRAY(SELECT DISTINCT x FROM UNNEST(SPLIT(concat(relative_1,',',relative_2,',',list_1), ',')) AS x ORDER BY x), ',') AS combined_list
from 
    step_0
  left join
    (select relative_2,combined_list as list_1 from step_0)
  using(relative_2)
)
,
step_2 as (
SELECT distinct * except (combined_list,list_1), 
       ARRAY_TO_STRING(ARRAY(SELECT DISTINCT x FROM UNNEST(SPLIT(concat(combined_list,',',list_2), ',')) AS x ORDER BY x), ',') AS combined_list,
from 
    step_1
left join
    (select relative_1,combined_list as list_2 from step_1)
  using(relative_1) 
)
,
step_3 as (
SELECT distinct * except (combined_list,list_2), 
       ARRAY_TO_STRING(ARRAY(SELECT DISTINCT x FROM UNNEST(SPLIT(concat(combined_list,',',list_3), ',')) AS x ORDER BY x), ',') AS combined_list,
from 
    step_2
left join
    (select relative_2,combined_list as list_3 from step_2)
  using(relative_2) 
)
,
step_4 as (
SELECT distinct * except (combined_list,list_3), 
       ARRAY_TO_STRING(ARRAY(SELECT DISTINCT x FROM UNNEST(SPLIT(concat(combined_list,',',list_4), ',')) AS x ORDER BY x), ',') AS combined_list,
from 
    step_3
left join
    (select relative_1,combined_list as list_4 from step_3)
  using(relative_1) 
),

step_N as (
SELECT *
from step_4 
)
,
step_prefinal as (
SELECT distinct
        relative_1,list_4, combined_list, 
        1+length(REGEXP_REPLACE(REGEXP_REPLACE(LOWER(combined_list), '[a-z]', ''),'[0-9]','')) as n_elements_in_list,
        max(1+length(REGEXP_REPLACE(REGEXP_REPLACE(LOWER(combined_list), '[a-z]', ''),'[0-9]',''))) over (partition by relative_1) as longest_list_relatives
from step_N
)
,
step_final as (
SELECT relative_1, combined_list, n_elements_in_list,
       count(*) over (partition by relative_1) as cnt_lists_per_relative,
       max(n_elements_in_list) over (partition by relative_1) as max_elements
from 
  step_prefinal 
 where true
 and longest_list_relatives = n_elements_in_list
 group by 1,2,3
)
,
stats as (
SELECT cnt_lists_per_relative, count(distinct relative_1) as cnt,   
       max(max_elements) as max_elements
from 
step_final
group by 1
order by 1
)

SELECT relative_1, combined_list as list_all_relatives 
from step_final
where true
1
I agree that your code might take too much time to process. Would it work for u to have a table schema with a column "relative_1" and a nested column with all the relatives for the current relative (without any repeated fields in relative_1)? However without the number of steps between the relative_1 and the respective distance. - Alexandre Moraes

1 Answers

0
votes

I was able to create a code to reproduce your output in a much simpler way.

I have used the provided example_data to reproduce it. I have achieved the desired output using UDF and JavaScript. Below is the code:

#Custom UDF to return a list of strings from an array
CREATE TEMP FUNCTION
  rel (relatives ARRAY<String>)
  RETURNS string
  LANGUAGE js AS '''
  return relatives;
''';

#provided data
WITH example_data AS(
  SELECT 'abc' AS relative_1, 'def' AS relative_2
  UNION ALL
  SELECT 'abc' AS relative_1, '123' AS relative_2
  UNION ALL
  SELECT 'def' AS relative_1, '334' AS relative_2
  UNION ALL
  SELECT 'fdc' AS relative_1, '123' AS relative_2
  UNION ALL
  SELECT 'fgl' AS relative_1, '342' AS relative_2
),

#Manipulating the data
data AS 
(
  SELECT
    t2.key,
    ARRAY_AGG(CAST(relative_2 AS string) ) AS relatives 
  FROM example_data t1
  LEFT JOIN ( SELECT DISTINCT relative_1 AS key FROM example_data
    GROUP BY relative_1) t2
  ON key=relative_1
  GROUP BY 1
  ORDER BY key 
)

#selecting the desired fields and using the UDF
SELECT key, rel(relatives) AS list_of_relatives FROM data
ORDER BY key

As you can see, the first step was to create a UDF, which receives the nested field of Strings and simply return them in a list of strings for each "Key". After declaring the example_data, in step 2, some data manipulation had to be performed. In order to achieve a table as following:

enter image description here

As you can see, the table has the relative_1 (named as key) and the relatives in a nested column.

After that, the desired fields are selected. That means, key and the UDF as list_of_relatives, which was written in the first step and the output is as below:

enter image description here

Finally, notice that the list_of_relatives is not a nested field anymore. Instead,it is a String with each value separated by comma. As shown below: enter image description here