Spark read.json does not consider booleans in python

Question

I havebeen trying to follow through an example for converting a JSON string to dataframe in spark following the official documentation here.

The following case works fine:

jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":true}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()

====== OUTPUT =======
+----------------+----+
|         address|name|
+----------------+----+
|[Columbus, true]| Yin|
+----------------+----+

root
 |-- address: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: boolean (nullable = true)
 |-- name: string (nullable = true)

But I get an error when I try passing the boolean value True (as applicable in python):

jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":True}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()

====== OUTPUT =======
+--------------------+
|     _corrupt_record|
+--------------------+
|{"name":"Yin","ad...|
+--------------------+

root
 |-- _corrupt_record: string (nullable = true)

To give some context. I am calling a REST api to get JSON data using requests library in python. Then I get the json string calling .json() on the response. This gives me a json string where boolean values are Capitalized like in python. (true becomes True, false becomes False). I think this is the desired behavior but when passing this json to spark, it complains about the format of the JSON string as shown above.

resp = requests.get(url, params=query_str, cookies=cookie_str)
jsonString = resp.json()

I have read through the documentation and searched the web but didn't find anything that might help. Can someone please help me out here!

UPDATE: I found one possible explanation. This may be because of JSON encoding and decoding offered by json library in python. Link But that still doesn't explain why pyspark is not able to identify python json encoding.

Tanjin Tanjin · Accepted Answer · 2018-06-27T01:59:36

Use the json module:

import json

spark_friendly_json = json.dumps(resp.json())
otherPeopleRDD = sc.parallelize(spark_friendly_json)
otherPeople = spark.read.json(otherPeopleRDD)

Spark read.json does not consider booleans in python

2 Answers