0
votes

Expanding rows based on the json sub branch.

ex:

{"attr1" : "attrValue1",
"attr2" : "attrValue2",
"properties": {
    "prop1" : "propValue1",
    "prop2" : "propValue2"
    }
}

result dataframe:

attr1      | attr2      | propertyKey | propertyValue

attrValue1 | attrValue2 | prop1       | propValue1
attrValue1 | attrValue2 | prop2       | propValue2
2

2 Answers

1
votes

Suppose you have a data frame as:

df.show()
+----------+----------+--------------------+
|     attr1|     attr2|          properties|
+----------+----------+--------------------+
|attrValue1|attrValue2|Map(prop2 -> prop...|
+----------+----------+--------------------+

You can use explode function with alias to create the two columns, one corresponding to the key, and another corresponding to the value:

In pyspark:

import pyspark.sql.functions as F
df.select('*', F.explode(df.properties).alias('propertyKey', 'propertyValue')).drop('properties').show()
+----------+----------+-----------+-------------+
|     attr1|     attr2|propertyKey|propertyValue|
+----------+----------+-----------+-------------+
|attrValue1|attrValue2|      prop2|   propValue2|
|attrValue1|attrValue2|      prop1|   propValue1|
+----------+----------+-----------+-------------+
0
votes

Hope this helps!

import json

#sample data - convert JSON to dataframe
js = {"attr1" : "attrValue1",
      "attr2" : "attrValue2",
      "properties": {
              "prop1" : "propValue1",
              "prop2" : "propValue2"
              }
      }
df = sqlContext.read.json(sc.parallelize([json.dumps(js)]))
df.show()

#convert above dataframe to desired format
#wide format
df = df.select("*", "properties.*").drop("properties")
df.show()

#long format
df = df.selectExpr("attr1", "attr2", "stack(2, 'prop1', prop1, 'prop2', prop2) as (propertyKey, propertyValue)")
df.show()

Sample data:

+----------+----------+--------------------+
|     attr1|     attr2|          properties|
+----------+----------+--------------------+
|attrValue1|attrValue2|[propValue1,propV...|
+----------+----------+--------------------+

Data in wide format:

+----------+----------+----------+----------+
|     attr1|     attr2|     prop1|     prop2|
+----------+----------+----------+----------+
|attrValue1|attrValue2|propValue1|propValue2|
+----------+----------+----------+----------+

Output data (in long format):

+----------+----------+-----------+-------------+
|     attr1|     attr2|propertyKey|propertyValue|
+----------+----------+-----------+-------------+
|attrValue1|attrValue2|      prop1|   propValue1|
|attrValue1|attrValue2|      prop2|   propValue2|
+----------+----------+-----------+-------------+