Spark (2.2): deserialise Thrift records from Kafka using Structured Streaming

Question

I am new to spark. I use structured streaming to read data from kafka.

I can read the data using this code in Scala:

val data = spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", brokers)
      .option("subscribe", topics)
      .option("startingOffsets", startingOffsets) 
      .load()

My data in the value column are Thrift records. Streaming api gives the data in binary format. I see examples of casting the data to string or json but I am not able to find any examples of how to deserialize the data to Thrift.

How can I achieve this?

user2289345 user2289345 · Accepted Answer · 2018-04-05T23:32:06

Well, here is the followup solution. I can't post my own code, but here is the public code you can use, credit given to the owner/coder.

https://github.com/airbnb/airbnb-spark-thrift/blob/master/src/main/scala/com/airbnb/spark/thrift/

First of all, you need to convert the array[byte]/value to Row by calling convertObject function, let's call it makeRow

Second of all, you need to get your thrift class structType/schema by calling convert function, let's call the final result schema

Then you need to register an UDF like this val deserializer = udf((bytes: Array[Byte]) => makeRow(bytes), schema)

Note: You can not derictly use makeRow without passing the schema, otherwise Spark will complains: Schema for type org.apache.spark.sql.Row is not supported

Then you can use it in this way:

val stuff = kafkaStuff.withColumn("data", deserializer(kafkaStuff("value"))) val finalStuff = stuff.select("data.*")

And...you are done! Hope this helps.

And give another credit to this post Spark UDF for StructType / Row which gives me the final idea when my previous solution is so close.

Spark (2.2): deserialise Thrift records from Kafka using Structured Streaming

2 Answers