1
votes

This question is a more complex variant of this one I am trying to load event data into Bigquery where the JSON has the following structure:

{
"simpleKV1String": "Foo",
"simpleKV2String": "Bar",
"simpleKV3String": "qux",
"complexKV": {
    "subKey1WithArrayInt": [1, 2, 3],
    "subkey2": {
        "subkey2Subkey1String": "corge",
        "subkey2Subkey2String": "grault",
        "subkey2Subkey3Int": 666       
        }
    }
}

The idea is that the simpleKV column simply map 1 on 1 to a Bigquery String column For the complexKV 'column' however, we have 3 options:

  1. Keep the nested value of the complexKV key as a JSON blob in a BigQuery String field.
  2. Completely normalize it.
  3. Map it to a complex BiqQuery datatype like (again) is done in the other Stackoverflow example.

Our requirements:

  • We already know the nested schema of complexKV is going to evolve: a subkey3 (and maybe 4) that has nested data itself will be added in the future.
  • We want to minimize the changes on the BigQuery table (rooting for option 1).
  • We want using the nested data to be as simple as possible (rooting for option 2 or 3).

As BigQuery does not support schema evolution AFAIK, I think we are left with option 1, which unfortunately makes using the data more complex...

Am I right on this or is there a smarter way to do this?

1
You have a low rate. Important on SO, you have to mark accepted answers by using the tick on the left of the posted answer, below the voting. This will increase your rate. See how this works by visinting this link: meta.stackoverflow.com/questions/5234/…Pentium10

1 Answers

3
votes

If your subkeys evolve new 3 and 4 appears I would add as new columns with their respective structure.

I would combine and would do both, or at least two methods.

1) I would keep as JSON blob. This way at least I have it centralized on every row all the data as it was collected and later can use to re materialize to columns. We used this kind of column and wrote views to simplify JSON attribute extract and further used the view instead of longer queries. 2) Also normalized as much as we could do into it's own columns. Storage is cheap, and storing the same data as JSON blob and as columns is affordable nowadays it gives a complementary solution. You write your own queries and references the column which suits your query better.

So based on this I would go with 6 columns:

  • simpleKV1String
  • simpleKV2String
  • simpleKV3String
  • subKey1WithArrayInt: [1, 2, 3]
  • subkey2: { "subkey2Subkey1String": "corge", "subkey2Subkey2String": "grault", "subkey2Subkey3Int": 666
    }
  • meta (json blob)

Also as mentioned already you can use the JSON blob to re-materialize some columns.