I am having an issue in parsing inconsistent datatypes in pyspark. As shown in the example file below, SA key contains always a dictionary but sometimes it can appear as string value. When I try to fetch the column SA.SM.Name, I get the exception as shown below.
How do I put null for SA.SM.Name column in pyspark/hive for the values having other than JSONs. Can someone help me please?
I tried to cast to different datatypes but nothing worked or may be I would be doing something wrong.
Input file Contents: mypath
{"id":1,"SA":{"SM": {"Name": "John","Email": "[email protected]"}}}
{"id":2,"SA":{"SM": {"Name": "Jerry","Email": "[email protected]"}}}
{"id":3,"SA":"STRINGVALUE"}
df=spark.read.json(my_path)
df.registerTempTable("T")
spark.sql("""select id,SA.SM.Name from T """).show()
Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/session.py", line 767, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: "Can't extract value from SA#6.SM: need struct type but got string; line 1 pos 10"