3
votes

Using a table which looks likes this

WITH
table_one AS(
  SELECT 0.123 feat_1, 0.645 feat_2 , 1 label_bc, 'a' dontcare UNION ALL
  SELECT 0.567, 0.456, 0, 'v'   UNION ALL
  SELECT 0.243, 0.734, 1, 'x'   UNION ALL
  SELECT 0.456, 0.888, 0, 'c'   UNION ALL
  SELECT 0.645, 0.222, 1, 'x'   UNION ALL
  SELECT 0.321, 0.123, 0, 'z'  
)
SELECT * from table_one

I want to calculate the correlation between features and the label_bc. I can manually specify the features to get the desired result like so:

select corr(label_bc, feat_1) as feat1_corr, corr(label_bc, feat_2) as feat2_corr from table_one

While this is precisely what I am looking for it is not desired since I have to type in the column names. Using my real world data, I have a huge number of features and they could vary from time to time. So this approach is not practical.

I can do things like this to get the desired columns. Note this only works with real tables and not a virtual table like in this sample table_one.

all_columns AS (
  SELECT column_name 
  FROM `my-proj`.my_dataset.INFORMATION_SCHEMA.COLUMNS where column_name not in ('dontcare', 'label_bc') AND
  TABLE_NAME = 'table_one' -- will work if real table
)

This returns a table with one column called "column_name" and each row has a string value which is the column names. In this case it would be 'feat_1' and 'feat_2' in two rows.

I can also do something like this

column_name_array as (
  select ARRAY_AGG(all_columns) from all_columns
)

This will return a table with one column and one row. The column will be "all_columns" and the value will be an array of column names. ['feat_1', 'feat_2']

With that said, I can not combine these primatives to give a working solution for any given table with a varying columns to calculate the corr() between all feature columns and the label_bc column. Any guidance is appreciated.


This is not an answer but some additional effort at a solution.

-- Assume a table like this
----------------------------------------------------
-- feat_1   feat_2  feat_3  label_bc   dontcare
----------------------------------------------------
-- 0.123    0.645   0.656   1.0    a
-- 0.567    0.456   -0.056  0.0    b
-- 0.243    0.734   0.754   1.0    c
-- 0.456    0.888   -0.858  0.0    i
-- 0.645    0.222   0.252   1.0    j
-- 0.321    0.123   -0.153  0.0    c
----------------------------------------------------

WITH
-- my_colnames is currently not used
my_colnames (colname) as (
  SELECT 'feat_1' colname UNION ALL
  SELECT 'feat_2' UNION ALL
  SELECT 'feat_3' UNION ALL
  SELECT 'dontcare' UNION ALL
  SELECT 'label_bc'  
),
-- the static sample table
table_one AS (
  SELECT 0.123 feat_1, 
         0.645 feat_2, 
         0.656 feat_3, 
         1.0 label_bc, 
         'a' dontcare 
         UNION ALL
  SELECT 0.567, 
         0.456, 
         -0.056, 
         0.0, 
         'b'   
         UNION ALL
  SELECT 0.243, 
         0.734, 
         0.754, 
         1.0, 
         'c'   
         UNION ALL
  SELECT 0.456, 
         0.888, 
         -0.858, 
         0.0, 
         'i'   
         UNION ALL
  SELECT 0.645, 
         0.222, 
         0.252, 
         1.0, 
         'j'   
         UNION ALL
  SELECT 0.321, 
         0.123, 
         -0.153, 
         0.0, 
         'c'  
),

-- transform of table_one where feat values are moved
-- from column based to row based such as the table_one is
-- is represented like this:
----------------------------------------------------
-- feat_id  feat_val    label_bc    dontcare
----------------------------------------------------
-- 1        0.123   1.0     a
-- 1        0.567   0.0     b
-- 1        0.243   1.0     c
-- 1        0.456   0.0     i
-- 1        0.645   1.0     j
-- 1        0.321   0.0     c
-- 2        0.645   1.0     a
-- 2        0.456   0.0     b
-- 2        0.734   1.0     c
-- 2        0.888   0.0     i
-- 2        0.222   1.0     j
-- 2        0.123   0.0     c
-- 3        0.656   1.0     a
-- 3       -0.056   0.0     b
-- 3        0.754   1.0     c
-- 3       -0.858   0.0     i
-- 3        0.252   1.0     j
-- 3       -0.153   0.0     c
----------------------------------------------------
--
-- Is it possible to create this table given the my_colnames table
-- and not have to manually specify the field names below?
table_two AS (
  select 1 as feat_id, 
       feat_1 as feat_val, 
       label_bc, 
       dontcare 
  from table_one union all
  select 2, 
       feat_2, 
       label_bc, 
       dontcare 
  from table_one union all
  select 3, 
       feat_3, 
       label_bc, 
       dontcare 
  from table_one 
)

-- works
--select * from my_colnames
--select * from table_one
select * from table_two
--select corr(label_bc, feat_1) as feat1_corr, corr(label_bc, feat_2) as feat2_corr, corr(label_bc, feat_3) as feat3_corr from table_one
--select feat_id, corr(label_bc, feat_val) as feat_corr from table_two GROUP BY feat_id;

This is Mikhail's answer with some build up for the SQL newbies such as me. I had to use the intermediate results to understand his excellent method. I thought we would need the features as a number like an id. Here I modify his regexp to simply use the name.

#standardSQL

-- Assume a table like this
----------------------------------------------------
-- feat_1   feat_2  feat_3  label_bc   dontcare
----------------------------------------------------
-- 0.123    0.645   0.656   1.0        a
-- 0.567    0.456   -0.056  0.0        b
-- 0.243    0.734   0.754   1.0        c
-- 0.456    0.888   -0.858  0.0        i
-- 0.645    0.222   0.252   1.0        j
-- 0.321    0.123   -0.153  0.0        c
----------------------------------------------------



WITH table_one AS (
  SELECT 0.123 feat_1, 0.645 feat_2, 0.656 feat_3, 1.0 label_bc, 'a' dontcare UNION ALL
  SELECT 0.567, 0.456, -0.056, 0.0, 'b' UNION ALL
  SELECT 0.243, 0.734, 0.754, 1.0, 'c' UNION ALL
  SELECT 0.456, 0.888, -0.858, 0.0, 'i' UNION ALL
  SELECT 0.645, 0.222, 0.252, 1.0, 'j' UNION ALL
  SELECT 0.321, 0.123, -0.153, 0.0, 'c' 
), 

-- transform of table_one where feat values are moved
-- from column based to row based such as the table_one is
-- represented like this:
----------------------------------------------------
-- feat_id  feat_val    label_bc    dontcare
----------------------------------------------------
-- 1        0.123   1.0     a
-- 1        0.567   0.0     b
-- 1        0.243   1.0     c
-- 1        0.456   0.0     i
-- 1        0.645   1.0     j
-- 1        0.321   0.0     c
-- 2        0.645   1.0     a
-- 2        0.456   0.0     b
-- 2        0.734   1.0     c
-- 2        0.888   0.0     i
-- 2        0.222   1.0     j
-- 2        0.123   0.0     c
-- 3        0.656   1.0     a
-- 3       -0.056   0.0     b
-- 3        0.754   1.0     c
-- 3       -0.858   0.0     i
-- 3        0.252   1.0     j
-- 3       -0.153   0.0     c
----------------------------------------------------


-- regular expression looks for a quotation mark, feat_,  a number one or more times, a quotation mark,
-- a colon, a negative sign zero or more times,  a number one or more times, a decimal point, a number one or more times 
-- SELECT 
--   REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"feat_\d+":-?\d+.\d+') 
-- from table_one as t;


-- Shows how to create a table of coefficients without specifying
-- the table column names for features using regular expressions
-- by examine each row as a json key:value pair string format.
table_two AS (
  SELECT  
    CAST(REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'feat_(\d+)') AS INT64) AS feat_id, 
    CAST(SPLIT(kv, ':')[SAFE_OFFSET(1)] AS FLOAT64) AS feat_val,
    label_bc, 
    dontcare
  FROM table_one t, 
  UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"feat_\d+":-?\d+.\d+')) kv
),
-- table_2b is first step in building table_2
-- this shows how the table_one table is cross joined with the unnested
-- array of the json for each row of table_one.
-- REGEXP_EXTRACT_ALL does not return a table.  It returns an array. One
-- element for each row in table since it uses to_JSON_STRING
-- kv column looks like
-- "feat_1:0.123
-- The table looks like this
-- Row feat_1   feat_2  feat_3  label_bc    dontcare    kv  
-- 1    0.123   0.645   0.656   1.0       a         "feat_1":0.123
-- 2    0.123   0.645   0.656   1.0       a         "feat_2":0.645
-- 3    0.123   0.645   0.656   1.0       a         "feat_3":0.656
-- 4  0.567   0.456  -0.056   0.0       b         "feat_1":0.567

table_two_b AS (
  SELECT  
    *
  FROM table_one t, 
  UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"feat_\d+":-?\d+.\d+')) kv
),
table_two_c AS (
  SELECT  
    -- split returns an array of strings. offset(0) is the first one 
    -- corresponding to the key name which is the feature_name
    REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'(.+)') AS feat_name,
    CAST(SPLIT(kv, ':')[SAFE_OFFSET(1)] AS FLOAT64) AS feat_val,
    label_bc    
  FROM table_one t, 
  UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"feat_\d+":-?\d+.\d+')) kv
)




-- select * from table_two_c


-- Shows what I want from table_two
SELECT feat_name, CORR(label_bc, feat_val) AS feat_corr 
FROM table_two_c 
GROUP BY feat_name   

-- Shows how the table is now t.feat_1, and an extra column json_row which is
-- a second copy of the table in one row using {"key":value, "key":value ...} syntax.
--SELECT
--  t,
--  TO_JSON_STRING(t) AS json_row
-- FROM table_one AS t;
1
Are you 100% set on BigQuery? This problem screams python. Either use python to calculate the correlations directly, or a python script to loop through your fields to dynamically build the query you want.rtenha
I did this in R with 4 lines of code. I had to download the csv file separately though. I was hoping for purely a BigQuery framework solution. I'm still looking at the BigQuery Scripting reference so I am optimistic it can be done.netskink

1 Answers

2
votes

Below is for BigQuery Standard SQL

#standardSQL
WITH table_two AS (
  SELECT  
    CAST(REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'feat_(\d+)') AS INT64) AS feat_id, 
    CAST(SPLIT(kv, ':')[SAFE_OFFSET(1)] AS FLOAT64) AS feat_val,
    label_bc, dontcare
  FROM `project.dataset.table_one` t, 
  UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"feat_\d+":-?\d+.\d+')) kv
)
SELECT feat_id, CORR(label_bc, feat_val) AS feat_corr 
FROM table_two 
GROUP BY feat_id   

You can test, play with above using sample/dummy data as in below example

#standardSQL
WITH table_one AS (
  SELECT 0.123 feat_1, 0.645 feat_2, 0.656 feat_3, 1.0 label_bc, 'a' dontcare UNION ALL
  SELECT 0.567, 0.456, -0.056, 0.0, 'b' UNION ALL
  SELECT 0.243, 0.734, 0.754, 1.0, 'c' UNION ALL
  SELECT 0.456, 0.888, -0.858, 0.0, 'i' UNION ALL
  SELECT 0.645, 0.222, 0.252, 1.0, 'j' UNION ALL
  SELECT 0.321, 0.123, -0.153, 0.0, 'c' 
), table_two AS (
  SELECT  
    CAST(REGEXP_EXTRACT(SPLIT(kv, ':')[OFFSET(0)], r'feat_(\d+)') AS INT64) AS feat_id, 
    CAST(SPLIT(kv, ':')[SAFE_OFFSET(1)] AS FLOAT64) AS feat_val,
    label_bc, dontcare
  FROM table_one t, 
  UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"feat_\d+":-?\d+.\d+')) kv
)
SELECT feat_id, CORR(label_bc, feat_val) AS feat_corr 
FROM table_two 
GROUP BY feat_id   

with result

Row feat_id feat_corr    
1   1       -0.30526201038849277     
2   2       0.0818318512559385   
3   3       0.838349444539397