1
votes

I am trying to get unique values from multiple columns but since the datastructure is an array I can't directly do DISTINCT on all columns. I am using UNNEST() for each column and performing a UNION ALL for each column.

My idea is to create a UDF so that I can simply give the column name each time instead of performing the select every time.

I would like to replace this Query with a UDF since there are many feature columns and I need to do many UNION ALL.


SELECT DISTINCT user_log as unique_value,
    'user_log' as feature        
FROM `my_table`
left join UNNEST(user_Log) AS user_log
union all 
SELECT DISTINCT page_name as unique_value,
    'user_login_page_name' as feature        
FROM `my_table`
left join UNNEST(PageName) AS page_name
order by feature;

My UDF

CREATE TEMP FUNCTION get_uniques(feature_name ARRAY<STRING>, feature STRING)
AS (
(SELECT DISTINCT feature as unique_value,
    'feature' as feature        
FROM `my_table`
left join UNNEST(feature_name) AS feature));

SELECT get_uniques(user_Log, log_feature);

However the UDF to select the column doesnt really work and gives the error Scalar subquery cannot have more than one column unless using SELECT AS STRUCT to build STRUCT values; failed to parse CREATE [TEMP] FUNCTION statement at [8:1]

There is probably a better way of doing this. Appreciate your help.

1
Do you want this in a single select statement OR you are OK to have script statements too? - Mr.Batra
Do you want to get a function that prints the columns of your struct values or the data of those columns? Also, what you expect to achieve by creating such function? ie: select (id) from get_unique(user_log,id) union all select (page_name) from get_getunique(page,page_name) - Betjens

1 Answers

0
votes

By reading what are you trying to achieve, which is:

My idea is to create a UDF so that i can simply give the column name each time instead of performing the select every time.

One approach could be to use format in combination with execution immediate to create your custom query and get the desirable output.

Below example shows the function using format to return a custom query and execute immediate to retrieve the final query output from the final table. I'm using a public data set so you can also try it out on your side:

CREATE TEMP FUNCTION GetUniqueValues(table_name STRING, col_name STRING, nest_col_name STRING) 
AS (format("SELECT DISTINCT %s.%s as unique_val,'%s' as featured  FROM %s ", col_name,nest_col_name,col_name,table_name));


EXECUTE IMMEDIATE (
select CONCAT(
    (SELECT GetUniqueValues('bigquery-public-data.github_repos.commits','Author','name'))
    ,' union all '
    ,(SELECT GetUniqueValues('bigquery-public-data.github_repos.commits','Committer','name'))
    ,' limit 100'))

output

Row  | unique_val            | featured 
1    | Sergio Garcia Murillo | Committer
2    | klimek                | Committer
3    | [email protected] | Committer
4    | acoul                 | Committer
5    | knghtbrd              | Committer
...  | ...                   | ...
100  | Gustavo Narea         | Committer