Spark Structured Streaming Databricks Event Hub Schema Defining issue

Question

I am having an issue with defining the structure for the json document.

Now i am trying to do the same schema on streamread.

val jsonSchema = StructType([ StructField("associatedEntities", struct<driver:StringType,truck:StringType>, True), 
                          StructField("heading", StringType, True), 
                          StructField("location", struct<accuracyType:StringType,captureDateTime:StringType,cityStateCode:StringType,description:StringType,latitude:DoubleType,longitude:DoubleType,quality:StringType,transmitDateTime:StringType>, True), 
                          StructField("measurements", array<struct<type:StringType,uom:StringType,value:StringType>>, True), 
                          StructField("source", struct<entityType:StringType,key:StringType,vendor:StringType>, True), 
                          StructField("speed", DoubleType, True)])

val df = spark
 .readStream
 .format("eventhubs")
 //.schema(jsonSchema) 
 .options(ehConf.toMap)
 .load()

When I run this cell in the notebook ":15: error: illegal start of simple expression val jsonSchema = StructType([ StructField("associatedEntities", struct, True),"

Edit: The goal is to get the data into a dataframe. I can get the json string from the body of the event hub message but i am not sure what to do from there if i cant get the schema to work.

Check this SO Question stackoverflow.com/questions/46568435/… — Abhi
how would i handle array<struct<type:StringType,uom:StringType,value:StringType>> in that .add style ? — user1552172
It doesnt seem that schema is needed for the event hub, i am trying to take the binary body col that has the json object and then structure it — user1552172

Hauke Mallow Hauke Mallow · Accepted Answer · 2019-07-07T14:50:03

You get the error message because of your schema definition. The schema definition should look something like this:

import org.apache.spark.sql.types._

val jsonSchema = StructType(
                        Seq(StructField("associatedEntities", 
                                        StructType(Seq(
                                          StructField("driver", StringType), 
                                          StructField ("truck", StringType)
                                        ))),
                            StructField("heading", StringType),
                            StructField("measurements", ArrayType(StructType(Seq(StructField ("type", StringType), StructField ("uom", StringType), StructField("value", StringType)))))
                           )
                         )

You can doublecheck the schema with:

jsonSchema.printTreeString

Giving you the schema back:

root
 |-- associatedEntities: struct (nullable = true)
 |    |-- driver: string (nullable = true)
 |    |-- truck: string (nullable = true)
 |-- heading: string (nullable = true)
 |-- measurements: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uom: string (nullable = true)
 |    |    |-- value: string (nullable = true)

As mentioned in the comments you get binary data. So first you get the raw dataframe:

val rawData = spark.readStream
  .format("eventhubs")
  .option(...)
  .load()

You have to:

convert the data to a string
parse the nested json
and flatten it

Define the dataframe with the parsed data:

val parsedData = rawData
   .selectExpr("cast (Body as string) as json")
   .select(from_json($"json", jsonSchema).as("data"))
   .select("data.*")

Spark Structured Streaming Databricks Event Hub Schema Defining issue

1 Answers