0
votes

Working on a user-defined function on BigQuery to extract emails from messy data sets, I'm facing an issued with ARRAY_AGG() not being allowed in the body of a temp user defined-function (UDF).

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS (
    ARRAY_AGG(
        DISTINCT 
        (SELECT * FROM 
            UNNEST(
                SPLIT(
                    REPLACE(
                        LOWER(
                            ARRAY_TO_STRING(emails, ",")
                        )," ", ""
                    )
                )
            ) AS e where e like '%@%'
        ) IGNORE NULLS
    )[SAFE_OFFSET(index)]
);

SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bon@yahoo.com"],1) as email_1

I've tried to bypass the ARRAY_AGG by selecting from UNNEST with OFFSET and then WHERE the offset would be the index.

However, now there's a column limitation (not more than one column in inside a scalar sub-query SELECT clause) suggesting to use a SELECT AS STRUCT instead.

I gave a try to the SELECT AS STRUCT:

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS (
   
    (SELECT AS STRUCT DISTINCT list.e, list.o FROM 
        UNNEST(
            SPLIT(
                REPLACE(
                    LOWER(
                        ARRAY_TO_STRING(emails, ", ")
                    )," ", ""
                )
            )
        ) AS list
        WITH OFFSET as list.o
        WHERE list.e like '%@%' AND list.o = index)
);

SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bob@yahoo.com"],1) as email_1

But it doesn't like my DISTINCT and then even removing it, it will complain about parsing e and o.

So I'm out of ideas here, I probably made a knot. Can anyone suggest how to do this job inside a UDF? Thanks.

1

1 Answers

1
votes

Below version works

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS ((
    SELECT ARRAY(
        SELECT * 
          FROM UNNEST(
                SPLIT(
                    REPLACE(
                        LOWER(
                            ARRAY_TO_STRING(emails, ",")
                        )," ", ""
                    )
                )
            ) AS e WHERE e LIKE '%@%'
    )[SAFE_OFFSET(index)]
));
SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bon@yahoo.com"], 1) AS email_1

with result

Row email_1  
1   test@gmail.com   

Or below version (which is just slight correction of your original query)

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS ((
  SELECT ARRAY_AGG(e)[SAFE_OFFSET(index)] 
  FROM UNNEST(
        SPLIT(
            REPLACE(
                LOWER(
                    ARRAY_TO_STRING(emails, ",")
                )," ", ""
            )
        )
    ) AS e WHERE e LIKE '%@%'
));
SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bon@yahoo.com"], 1) AS email_1     

obviously with the same result