0
votes

I have a dataframe in PySpark with a string column with value [{"AppId":"APACON","ExtId":"141730"}] (the string is exactly like that in my column, it is a string, not an array)

I want to convert this to an array of struct.

Can I do that simply with native spark function or do I have to parse the string or use UDF ?

sqlContext.createDataFrame(
    [ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
      (2,'[{"AppId":"APACON","ExtId":"141793"}]'),
    ],
    ['idx','txt']
).show()

+---+--------------------+
|idx|                 txt|
+---+--------------------+
|  1|[{"AppId":"APACON...|
|  2|[{"AppId":"APACON...|
+---+--------------------+
1
You need from_jsonphilantrovert
not available in 2.0 ... any alternative ?Steven
You could try pyspark.sql.functions.get_json_object which will parse the txt column and create one column per field with associated valuesplalanne
@plalanne its parses the json, but I do not know how i can access the elements of the list.Steven
I edited my answer and proposed a regexp to ignore square brackets in df.txt as a first step . Let me know if this helpsplalanne

1 Answers

1
votes

With Spark 2.1 or above

You have the following data :

import pyspark.sql.functions as F
from pyspark.sql.types import *

df = sqlContext.createDataFrame(
    [ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
      (2,'[{"AppId":"APACON","ExtId":"141793"}]'),
    ],
    ['idx','txt']
)

you can indeed use pyspark.sql.functions.from_json as follows :

schema = StructType([StructField("AppId", StringType()),
                     StructField("ExtId", StringType())])
df = df.withColumn('array',F.from_json(F.col('txt'), schema))
df.show()

+---+--------------------+---------------+
|idx|                 txt|          array|
+---+--------------------+---------------+
|  1|[{"AppId":"APACON...|[APACON,141730]|
|  2|[{"AppId":"APACON...|[APACON,141793]|
+---+--------------------+---------------+


Version < Spark 2.1

One way to bypass the issue, would be to first slightly modify your input string to have :

# Use regexp_extract to ignore square brackets
df.withColumn('txt_parsed',F.regexp_extract(F.col('txt'),'[^\\[\\]]+',0))
df.show()

+---+-------------------------------------+-----------------------------------+
|idx|txt                                  |txt_parsed                         |
+---+-------------------------------------+-----------------------------------+
|1  |[{"AppId":"APACON","ExtId":"141730"}]|{"AppId":"APACON","ExtId":"141730"}|
|2  |[{"AppId":"APACON","ExtId":"141793"}]|{"AppId":"APACON","ExtId":"141793"}|
+---+-------------------------------------+-----------------------------------+

Then you could use pyspark.sql.functions.get_json_object to parse the txt column

df = df.withColumn('AppId', F.get_json_object(df.txt, '$.AppId'))
df = df.withColumn('ExtId', F.get_json_object(df.txt, '$.ExtId'))
df.show()


+---+--------------------+------+------+
|idx|                 txt| AppId| ExtId|
+---+--------------------+------+------+
|  1|{"AppId":"APACON"...|APACON|141730|
|  2|{"AppId":"APACON"...|APACON|141793|
+---+--------------------+------+------+