1
votes

I have the following json structure which is stored as a string in my hive table column

[  
{"outer_id": 123000, "outer_field_1": blah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},  
{"outer_id": 123001, "outer_field_1": blahblah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},  
{"outer_id": 123002, "outer_field_1": blahblahblah,  "inner_list": [{"inner_id": 456}, {"inner_id": 789}]},  
]

Now I want to parse this into a table the elements of the outer array as each row. The fields of each json object is parsed as each column, while still keeping inner list as string:


| outer_id | outer_field_1 | inner_list  |   |   |   
|----------|---------------|-------------|---|---|  
| 123000   |  blah         |  struct     |   |   |  
| 123001   |  blahblah     |  struct     |   |   |  
| 123002   |  blahblahblah |  struct     |   |   |  

Now I am aware of the trick to use regex to create a custom separator, split on it then use lateral view explode, but in this case, there are also nested arrays which will match the regex: Parse json arrays using HIVE

Any ideas on how to do this? I want to do this in raw Spark-SQL if possible. No UDFs or Serdes.

What I have tried:

  1. select explode(get_json_object(outer_list, "$[*]")) from wt_test;

doesnt work, it says input to function explode should be array or map type, not string

  1. select explode(split(substr(outer_list, 2, length(strategies)-2),",")) from wt_test;

this splits every single comma into a row which is not what we want:

{"outer_id": 123000
"outer_field_1": blah
"inner_list": [{"inner_id": 456}
{"inner_id": 789}]}
... more rows ...
1
could you maybe edit your desired output a little bit more. I am not sure what you expect.Matt
@Matt edited for clarityLumo Woong
so inner_list is not important?Matt
it is, but at this point I just want to explode the outer list, the whole requirements is kinda complicated... Let me edit the question to still keep the inner list as a stringLumo Woong
ahhh sry I didnt fully read it so basicly you have a table like this right: {"a": 1, "outer_list":[{"outer_id": 123000, "outer_field_1": blah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123001, "outer_field_1": blahblah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123002, "outer_field_1": blahblahblah, "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, ]}Matt

1 Answers

1
votes

Assuming I did understand correctly you have the following:

Input

{
   "some_id":1,
   "outer_list":'[{"outer_id": 123000, "outer_field_1": "blah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123001, "outer_field_1": "blahblah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}, {"outer_id": 123002, "outer_field_1": "blahblahblah", "inner_list": [{"inner_id": 456}, {"inner_id": 789}]}]'
}

Desired Output:

| outer_id | outer_field_1 | inner_list  |   |   |   
|----------|---------------|-------------|---|---|  
| 123000   |  blah         |  struct     |   |   |  
| 123001   |  blahblah     |  struct     |   |   |  
| 123002   |  blahblahblah |  struct     |   |   |  

First of all you need to parse the string into a schema for that define a schema:

schema = ArrayType(
   StructType([StructField('outer_id', IntegerType()), 
               StructField('outer_field_1', StringType()), 
               StructField('inner_list', StringType())])
)

Notice this is the simple version where Inner_List is just taken as string.

Apply that schema on your dataframe:

df = df.select(from_json('outer_list', schema).alias('test'))

Now you have a column with an array:

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|test                                                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[123000, blah, [{"inner_id":456},{"inner_id":789}]], [123001, blahblah, [{"inner_id":456},{"inner_id":789}]], [123002, blahblahblah, [{"inner_id":456},{"inner_id":789}]]]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

this you can explode now:

df.select(explode('test').alias('exploded')).select('exploded.*')

Which makes it:

+--------+-------------+-----------------------------------+
|outer_id|outer_field_1|inner_list                         |
+--------+-------------+-----------------------------------+
|123000  |blah         |[{"inner_id":456},{"inner_id":789}]|
|123001  |blahblah     |[{"inner_id":456},{"inner_id":789}]|
|123002  |blahblahblah |[{"inner_id":456},{"inner_id":789}]|
+--------+-------------+-----------------------------------+

Now while you are anyway parsing outer_list you can from the beginning do the same with inner_list. But you should maybe try that yourself first, you have everything here what you need for that.

Dont forget to import:

from pyspark.sql.functions import *
from pyspark.sql.types import *

SQL Version if the input is given as table json_test:

select exploded.* from 
   (select explode(
             from_json(
               outer_list, 
    "array<struct<outer_id:int,outer_field_1:string,inner_list:string>>"
       )
    ) as exploded from json_test
)