0
votes

I am trying to build a data structure in BigQuery using SQL which exactly reflects the data structure which I obtain when uploading JSON. This will enable me to query the view using SQL with dot notation instead of having to UNNEST, which I do understand but many of my clients find extremely confusing and unintuitive.

If I build a really simple dummy dataset with a couple of rows and then nest using the ARRAY_AGG(STRUCT([field list])) pattern:


    WITH
      flat_table AS (
      SELECT "BigQuery" AS name, 23 AS user_count, "Data Warehouse" AS data_thing, 5 AS ease_of_use, "Awesome" AS description UNION ALL
      SELECT "MySQL" AS name, 12 AS user_count, "Database" AS data_thing, 3 AS ease_of_use, "Solid" AS description
      )

    SELECT 
    name, user_count, 
    ARRAY_AGG(STRUCT(data_thing, ease_of_use, description)) AS attributes
    FROM flat_table 
    GROUP BY name, user_count

Then saving and viewing the schema shows that the attributes field is Type = RECORD and Mode = REPEATED. Schema field names are:


name                    
    user_count              
    attributes          
    attributes.data_thing   
    attributes.ease_of_use  
    attributes.description 

If I look at the COLUMN information in the INFORMATION_SCHEMA.COLUMNS query I can see that the attributes field is_nullable = NO and data_type = ARRAY<STRUCT<data_thing STRING, ease_of_use INT64, description STRING>>

If I want to query this structure I need to use the UNNEST pattern as below:


    SELECT
      name,
      user_count
    FROM
      nested_table,
      UNNEST(attributes)
    WHERE
      ease_of_use > 3

However when I upload the following JSON representation of the same data to BigQuery with automatic schema detection:


    {"attributes":{"description":"Awesome","ease_of_use":5,"data_thing":"Data Warehouse"},"user_count":23,"name":"BigQuery"}
    {"attributes":{"description":"Solid","ease_of_use":3,"data_thing":"Database"},"user_count":12,"name":"MySQL"}

The schema looks nearly identical once loaded, except for the attributes field is Mode = NULLABLE (it is still Type = RECORD). The INFORMATION_SCHEMA.COLUMNS shows me that the attributes field is now is_nullable = YES and data_type = STRUCT<data_thing STRING, ease_of_use INT64, description STRING>, i.e. now nullable and not in an array.

However the most interesting thing for me is that I can now query this table using dot notation instead of the UNNEST pattern, so the query above becomes:


    SELECT
      name,
      user_count
    FROM
      nested_table_json
    WHERE
      attributes.ease_of_use > 3

Which is arguably easier to read, even in this trivial case. However once we get to more complex data structures with multiple nested fields and multi-level nesting, the UNNEST pattern becomes extremely difficult to write, QA and debug. The dot notation pattern appears to be much more intuitive and scalable.

So my question is: is it possible to build a data structure equivalent to the loaded JSON by writing queries in SQL, enabling us to build Standard SQL queries using dot notation and not requiring complex UNNEST patterns?

1

1 Answers

1
votes

If you know that your array_agg will produce one row, you can drop the ARRAY notation like this:

SELECT 
    name, user_count, 
    ARRAY_AGG(STRUCT(data_thing, ease_of_use, description))[offset(0)] AS attributes

notice the use of OFFSET(0) this way the returned output will be:

[
  {
    "name": "BigQuery",
    "user_count": "23",
    "attributes": {
      "data_thing": "Data Warehouse",
      "ease_of_use": "5",
      "description": "Awesome"
    }
  }
]

which can be queried using dot notation.

In case you want just to group result in STRUCT, you don't need array_agg.

WITH
      flat_table AS (
      SELECT "BigQuery" AS name, 23 AS user_count, struct("Data Warehouse" AS data_thing, 5 AS ease_of_use, "Awesome" AS description) as attributes UNION ALL
      SELECT "MySQL" AS name, 12 AS user_count, struct("Database" AS data_thing, 3 AS ease_of_use, "Solid" AS description)
      )

    SELECT 
    *
    FROM flat_table