0
votes

Spark SQL nested JSON error :

{
  "xxxDetails":{  
      "yyyData":{  
         "0":{  
            "additionalData":{  

            },
            "quantity":80000,
            "www":12.6,
            "ddd":5.0,
            "eee":72000,
            "rrr":false
         },
         "130":{  
            "additionalData":{  
               "quantity":1
            },
            "quantity":0,
            "www":1.0,
            "ddd":0.0,
            "eee":0,
            "rrr":false
         },
         "yyy":{  
            "additionalData":{  
               "quantity":1
            },
            "quantity":0,
            "www":1.0,
            "ddd":0.0,
            "eee":0,
            "rrr":false
         }       
      }
   },
   "mmmDto":{  
      "id":0,
      "name":"",
      "data":null
   }
 }

when reading spark.sql("select cast (xxxDetails.yyyData.yyy.additionalData.quantity as Long) as quantity from table") it will work but: spark.sql("select cast (xxxDetails.yyyData.130.additionalData.quantity as Long) as quantity from table") will throw Exception :

org.apache.spark.sql.catalyst.parser.ParseException: no viable alternative at input 'cast (xxxDetails.yyyData.130.

When I"m usning datafame API for myDF.select("xxxDetails.yyyData.130.additionalData.quantity") its work . Anyone with decent explanation :)

1

1 Answers

3
votes

It's because SQL column names are expected to start with a letter or some other characters like _, @ or # but not a digit. Let's consider this simple example:

Seq((1, 2)).toDF("x", "666").createOrReplaceTempView("test")

Calling spark.sql("SELECT x FROM test").show() would output

+---+
|  x|
+---+
|  1|
+---+

but calling spark.sql("SELECT 666 FROM test").show() instead outputs

+---+
|666|
+---+
|666|
+---+

because 666 is interpreted as literal, not a column name. To fix this, the column name needs to be quoted using backticks:

spark.sql("SELECT `666` FROM test").show()
+---+
|666|
+---+
|  2|
+---+