3
votes

I havebeen trying to follow through an example for converting a JSON string to dataframe in spark following the official documentation here.

The following case works fine:

jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":true}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()

====== OUTPUT =======
+----------------+----+
|         address|name|
+----------------+----+
|[Columbus, true]| Yin|
+----------------+----+

root
 |-- address: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: boolean (nullable = true)
 |-- name: string (nullable = true)

But I get an error when I try passing the boolean value True (as applicable in python):

jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":True}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()

====== OUTPUT =======
+--------------------+
|     _corrupt_record|
+--------------------+
|{"name":"Yin","ad...|
+--------------------+

root
 |-- _corrupt_record: string (nullable = true)

To give some context. I am calling a REST api to get JSON data using requests library in python. Then I get the json string calling .json() on the response. This gives me a json string where boolean values are Capitalized like in python. (true becomes True, false becomes False). I think this is the desired behavior but when passing this json to spark, it complains about the format of the JSON string as shown above.

resp = requests.get(url, params=query_str, cookies=cookie_str)
jsonString = resp.json()

I have read through the documentation and searched the web but didn't find anything that might help. Can someone please help me out here!

UPDATE: I found one possible explanation. This may be because of JSON encoding and decoding offered by json library in python. Link But that still doesn't explain why pyspark is not able to identify python json encoding.

2

2 Answers

1
votes

Use the json module:

import json

spark_friendly_json = json.dumps(resp.json())
otherPeopleRDD = sc.parallelize(spark_friendly_json)
otherPeople = spark.read.json(otherPeopleRDD)
0
votes

It’s giving error as spark is not able to convert your data to desired data type . As you are providing T in caps. While reading the Jain in python please convert the value to small case true and it will work as mentioned. There is no issue with your code. You can recreate the error with any other data type also. In your case spark is inferring schema and it’s failing. You can provide schema explicitly or keep True within quote so that it will treat it as string.