I have some data that is stored in CSV. Sample data is available here - https://github.com/PranayMehta/apache-spark/blob/master/data.csv
I read the data using pyspark
df = spark.read.csv("data.csv",header=True)
df.printSchema()
root
|-- freeform_text: string (nullable = true)
|-- entity_object: string (nullable = true)
>>> df.show(truncate=False)
+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|freeform_text |entity_object |
+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Grapes are good. Bananas are bad.|[{'name': 'Grapes', 'type': 'OTHER', 'salience': '0.8335162997245789', 'sentiment_score': '0.8999999761581421', 'sentiment_magnitude': '0.8999999761581421', 'metadata': {}, 'mentions': {'mention_text': 'Grapes', 'mention_type': 'COMMON'}}, {'name': 'Bananas', 'type': 'OTHER', 'salience': '0.16648370027542114', 'sentiment_score': '-0.8999999761581421', 'sentiment_magnitude': '0.8999999761581421', 'metadata': {}, 'mentions': {'mention_text': 'Bananas', 'mention_type': 'COMMON'}}]|
|the weather is not good today |[{'name': 'weather', 'type': 'OTHER', 'salience': '1.0', 'sentiment_score': '-0.800000011920929', 'sentiment_magnitude': '0.800000011920929', 'metadata': {}, 'mentions': {'mention_text': 'weather', 'mention_type': 'COMMON'}}] |
+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Now, I want to explode and parse the fields in the entity_object
column in this dataframe. Here is some more know-how on what this column contains -
For every freeform_text
stored in the Spark Dataframe, I have written some logic to parse out the entities using google's natural language API. These entities are stores as LIST of DICTIONARIES when I do the computation using pandas. I then convert them to string before storing them to Database.
This CSV is what I read in spark dataframe as 2 columns - freeform_text
and entity_object
.
The entity_object
column as string is actually a LIST of dictionaries. It can be imagined as LIST[ DICT1, DICT2 ]
and so on. So, some entity_object
rows may have 1 element others may have more than 1 based on the number of entities in the output. For instance in the first row, there are 2 entities - grapes
and bananas
, whereas in 2nd row there is only entity weather
.
I want to explode this entity_object
column so that 1 record of freeform_text
can be exploded in multiple records.
Here is a screenshot of how I would like my output to be -
from_json
? – Steven