0
votes

I have a table like the following:

ID col1 col2 col3 col4
A 100 400 30 800
B 600 50 500 75

and I want a query where I can return something like

ID col1 col2 col3 col4
A 0 1 0 1
B 1 0 1 0

Except, I want the logic to look at each row, and for each row find which two columns have the top 2 values. I imagine there may be a few CTEs or subqueries involved. Even getting to a CTE that would produce the following outcome would be good enough, but don't know how to get to this CTE:

ID top_2_col_name
A col2
A col4
B col1
B col3

Is there a way to do aggregate and window functions row-wise instead of column wise? I'm using Google's BigQuery SQL.

4

4 Answers

1
votes

If you want the top two values, then one method is to unnest the values and calculate the ranking and select them:

with t as (
      select 'A' as id, 100 as col1, 400 as col2,  30 as col3, 800 as col4 union all
      select 'B' as id, 600 as col1, 50  as col2, 500 as col3, 75 as col4 
     )
select * except (seqnum)
from (select t.id, col.*, row_number() over (partition by t.id order by col.val desc) as seqnum
      from t cross join
           unnest(array[struct('col1' as col, t.col1 as val),
                              struct('col2', t.col2),
                              struct('col3', t.col3),
                              struct('col4', t.col4)
                             ]
                        ) col
     ) tc
where seqnum <= 2;

This is the second form of your result set.

You can generalize this to any number of columns using a JSON trick. This produces a string and then parses the string for the columns you care about, unnests them and does similar operations:

with t as (
      select 'A' as id, 100 as col1, 400 as col2,  30 as col3, 800 as col4 union all
      select 'B' as id, 600 as col1, 50  as col2, 500 as col3, 75 as col4 
     )
select t.id, concat('col', n), val
from (select t.id, val, n, row_number() over (partition by t.id order by val desc) as seqnum
      from t cross join
           unnest(regexp_extract_all(to_json_string(t), '"col[0-9]+":([0-9]+)')) val with offset n
     ) t
where seqnum <= 2;

This can work on any number of columns. Of course, if you have a data structure like that, the values should really be stored in an array.

1
votes

Even getting to a CTE that would produce the following outcome would be good enough

Below is for BigQuery Standard SQL and generic enough to work for any number of columns and for any column's names - not just

select id, top.*
from `project.dataset.table` t,
unnest(array(
    select as struct u.*
    from unnest(split(translate(to_json_string((select as struct t.* except(id))), '{}"', ''))) kv,
    unnest([struct(
      split(kv, ':')[offset(0)] as top_2_col_name, 
      cast(split(kv, ':')[offset(1)] as int64) as top_2_col_value
    )]) u
    order by top_2_col_value desc
    limit 2
)) top

If to apply to sample data from your question - output is

enter image description here

If above version looks too overengineered to you - below is simpler / less verbose version

select id, top.*
from `project.dataset.table` t,
unnest(array(
    select as struct 
      split(kv, ':')[offset(0)] as top_2_col_name, 
      cast(split(kv, ':')[offset(1)] as int64) as top_2_col_value
    from unnest(split(translate(to_json_string((select as struct t.* except(id))), '{}"', ''))) kv
    order by top_2_col_value desc
    limit 2
)) top

Choose whatever fit your coding preferences

1
votes

Below is for BigQuery Standard SQL and answers your main question (as opposed to workaround with flattened cte - which I also already answered)

create temp table all_columns as 
select split(kv, ':')[offset(0)] col_name, offset
from (select * from `project.dataset.table` limit 1) t,
unnest(split(translate(to_json_string((select as struct t.* except(id))), '{}"', ''))) kv with offset; 

create temp table top_columns as 
select id, 
  array(
    select top_2_col_name
    from unnest(split(translate(to_json_string((select as struct t.* except(id))), '{}"', ''))) kv,
    unnest([struct(
      split(kv, ':')[offset(0)] as top_2_col_name, 
      cast(split(kv, ':')[offset(1)] as int64) as top_2_col_value
    )]) u
    order by top_2_col_value desc
    limit 2
  ) as top_2_col_names
from `project.dataset.table` t; 

execute immediate (
select '''select id, ''' ||
  string_agg('''if("''' || col_name || '''" in unnest(top_2_col_names), 1, 0) as ''' || col_name, ", " order by offset) 
|| ''' from `project.dataset.table` join top_columns using(id)'''
from all_columns
);

If to apply above script to sample data in your question - output is

enter image description here

0
votes

You can use the analytical function max as follows:

Select id,
       Case when Max(col1) over () as = col1 then 1 else 0 end as col1,
       Case when Max(col2) over () as = col2 then 1 else 0 end as col2,
       Case when Max(col3) over () as = col3 then 1 else 0 end as col3,
       Case when Max(col4) over () as = col4 then 1 else 0 end as col4
  From your_table t
 Where id in ('A', 'B')

Please note that this query will give 1 in the column of to both records if they are same. If you do not want such behaviour then do comment.