0
votes

If needing to denormalize a nested JSONB structure into a flat table, can it be realistically done in a single SELECT if I'm mixing key/values and arrays like the following? I have table containing two columns - repository and data and an example data column value is like the following:

{
    "branches": {
        "master": {
            "type1": {
                "files": [{
                    "filename": "a.txt",
                    "contents": [{
                        "identifier": "foo",
                        "value": "bar"
                    }]
                }]
            }
        }
    }
}

My goal is to have a query that flattens this structure e.g. into the following CSV pseudocode:

repository,branch,type,filename,identifier,value
a/b,master,type1,a.txt,foo,bar

I have kept the JSON example as simple as possible, but would ultimately end up with around a million lines.

Although the need for a single data JSON column is fixed, the format of it can be refactored if it would make the denormalization easier or more efficient. e.g. instead of branches and types being keyed off their unique identifiers, these could be converted into arrays with identifiers instead.

1
Where does the column repository come from in the output? Btw: you are actually normalizing the de-normalized data you have.a_horse_with_no_name
Can you please fix the JSON? It's invalid because the number of { braces doesn't match the number of } braces. And the contents array isn't closed properly. Is contents really an array? If yes what if there are multiple identifiers and values in there? Should they be shown as rows or columns in the output?a_horse_with_no_name

1 Answers

1
votes

Assuming the elements of contents array should be returned as rows and a valid JSON structure like this:

{
  "branches": {
    "master": {
      "type1": {
        "files": [{"filename": "a.txt","contents": [{"identifier": "foo","value": "bar"}, {"identifier": "a2", "value": "a value"}]},
                  {"filename": "b.txt","contents": [{"identifier": "id2","value": "something else"}]}]
      }
    }
  }
}  

Then you can use this:

select b.branch, 
       t.type, 
       f.file ->> 'filename' as filename, 
       c.content ->> 'identifier' as identifier, 
       c.content ->> 'value' as value
from the_table tt
  cross join jsonb_each(tt.the_column -> 'branches') as b(branch, branch_value)
  cross join jsonb_each(b.branch_value) as t(type, type_value)
  cross join jsonb_array_elements(t.type_value -> 'files') as f(file)
  cross join jsonb_array_elements(f.file -> 'contents') as c(content)

With the corrected JSON above, this returns:

branch | type  | filename | identifier | value         
-------+-------+----------+------------+---------------
master | type1 | a.txt    | foo        | bar           
master | type1 | a.txt    | a2         | a value       
master | type1 | b.txt    | id2        | something else

If contents isn't actually an array, so something like this:

{
  "branches": {
    "master": {
      "type1": {
        "files": [{"filename": "a.txt","contents": {"identifier": "foo","value": "bar"}},
                  {"filename": "b.txt","contents": {"identifier": "id2","value": "something else"}}]
      }
    }
  }
}

Then you can use this query:

select b.branch, t.type, f.file ->> 'filename' as filename, f.file #>> '{contents, identifier}' as identifier, f.file #>> '{contents,value}' as value
from the_table tt
  cross join jsonb_each(tt.the_column -> 'branches') as b(branch, branch_value)
  cross join jsonb_each(b.branch_value) as t(type, type_value)
  cross join jsonb_array_elements(t.type_value -> 'files') as f(file)

which returns:

branch | type  | filename | identifier | value         
-------+-------+----------+------------+---------------
master | type1 | a.txt    | foo        | bar           
master | type1 | b.txt    | id2        | something else