Working on a user-defined function on BigQuery to extract emails from messy data sets, I'm facing an issued with ARRAY_AGG() not being allowed in the body of a temp user defined-function (UDF).
CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS (
ARRAY_AGG(
DISTINCT
(SELECT * FROM
UNNEST(
SPLIT(
REPLACE(
LOWER(
ARRAY_TO_STRING(emails, ",")
)," ", ""
)
)
) AS e where e like '%@%'
) IGNORE NULLS
)[SAFE_OFFSET(index)]
);
SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bon@yahoo.com"],1) as email_1
I've tried to bypass the ARRAY_AGG by selecting from UNNEST with OFFSET and then WHERE the offset would be the index.
However, now there's a column limitation (not more than one column in inside a scalar sub-query SELECT clause) suggesting to use a SELECT AS STRUCT instead.
I gave a try to the SELECT AS STRUCT:
CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS (
(SELECT AS STRUCT DISTINCT list.e, list.o FROM
UNNEST(
SPLIT(
REPLACE(
LOWER(
ARRAY_TO_STRING(emails, ", ")
)," ", ""
)
)
) AS list
WITH OFFSET as list.o
WHERE list.e like '%@%' AND list.o = index)
);
SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bob@yahoo.com"],1) as email_1
But it doesn't like my DISTINCT and then even removing it, it will complain about parsing e and o.
So I'm out of ideas here, I probably made a knot. Can anyone suggest how to do this job inside a UDF? Thanks.