I have the following json structure which is stored as a string in my hive table column
[
{"outer_id": 123000, "outer_field_1": blah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},
{"outer_id": 123001, "outer_field_1": blahblah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},
{"outer_id": 123002, "outer_field_1": blahblahblah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},
]
Now I want to parse this into a table the elements of the outer array as each row. The fields of each json object is parsed as each column, while still keeping inner list as string:
| outer_id | outer_field_1 | inner_list | | | |----------|---------------|-------------|---|---| | 123000 | blah | struct | | | | 123001 | blahblah | struct | | | | 123002 | blahblahblah | struct | | |
Now I am aware of the trick to use regex to create a custom separator, split on it then use lateral view explode, but in this case, there are also nested arrays which will match the regex: Parse json arrays using HIVE
Any ideas on how to do this? I want to do this in raw Spark-SQL if possible. No UDFs or Serdes.
What I have tried:
select explode(get_json_object(outer_list, "$[*]")) from wt_test;
doesnt work, it says input to function explode should be array or map type, not string
select explode(split(substr(outer_list, 2, length(strategies)-2),",")) from wt_test;
this splits every single comma into a row which is not what we want:
{"outer_id": 123000
"outer_field_1": blah
"inner_list": [{"inner_id": 456}
{"inner_id": 789}]}
... more rows ...