1
votes

I'm using pyspark to create a dataframe from a JSON file.

The structure of the JSON file is as follows:

[
  {
    "Volcano Name": "Abu",
    "Country": "Japan",
    "Region": "Honshu-Japan",
    "Location": {
      "type": "Point",
      "coordinates": [
        131.6,
        34.5
      ]
    },
    "Elevation": 571,
    "Type": "Shield volcano",
    "Status": "Holocene",
    "Last Known Eruption": "Unknown",
    "id": "4cb67ab0-ba1a-0e8a-8dfc-d48472fd5766"
  },
  {
    "Volcano Name": "Acamarachi",
    "Country": "Chile",
    "Region": "Chile-N",
    "Location": {
      "type": "Point",
      "coordinates": [
        -67.62,
        -23.3
}]

I will read in the file using the following line of code:

myjson = spark.read.json("/FileStore/tables/sample.json")

However, I keep on getting the following error message:

Spark Jobs
myjson:pyspark.sql.dataframe.DataFrame
_corrupt_record:string

Can someone let me know what I might doing wrong?

Is the problem with the structure of the json file?

1

1 Answers

1
votes

Seems like your JSON is multiple line Json that why issue is, to fix this below is code snippet,

myjson = spark.read.option("multiline", "true").option("mode", "PERMISSIVE")
         .json("/FileStore/tables/sample.json")

Hope this will solve issue.