In mysql jdbc data source which is used for data load into Spark there is a column which contains JSON in string.
// JDBC Connection and load table in Dataframe
val verDf = spark.read.format("jdbc").option("driver", driver).option("url", url).option("dbtable", verticesTable).option("user", user).option("password", pass).load()
verDf.printSchema
root
|-- id: integer (nullable = true)
|-- url: string (nullable = true)
|-- al: string (nullable = true) -->> this is JSON string
|-- batch_id: integer (nullable = true)
|-- x: double (nullable = true)
|-- y: double (nullable = true)
|-- z: double (nullable = true)
|-- size: double (nullable = true)
JSON is in al column and only single value is required. How can I extract it? I've seen from_json/get_json_schema approach and it looks expensive and bulky - schema should be created then JSON is unwrapped into Map etc.
val schema = schema_of_json(lit(verDf.select($"al").as[String].first))
So when I run this line above it timeouts or runs for matter of minutes (4-8 minutes).
- I don't understand why - it should take just first line and parse json to produce a schema of it. Why it is so long (json value there is about 1-2 kilobyte, really small object)?
- Is there any chance to use some function similar to mysql
json_extract()
to extract relatively fast value from a JSON string?