0
votes

I'm strangling Kafka spark streaming with dynamic schema. I"m consuming from Kafka (KafkaUtils.createDirectStream) each message /JSON field can be nested, each field can appear in some messages and sometimes not.

The only thing I found is to do: Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)

case class MyTyp(column1: Option[Any], column2: Option[Any]....) This will cover,im not sure, fields that may appear, and nested Fileds.

Any approval/other Ideas/general help will be appreciated ...

1
Pretty sure you can use Map type for the dynamic json fields/columns. I haven't tried this myself though. - tstites

1 Answers

1
votes

After long integration and trails, two ways to solve non schema Kafka consuming: 1) Throw "editing/validation" each message with "lambda" function .not my favorite. 2) Spark: on each micro batch obtain flatten schema and intersect needed columns. use spark SQL to query the frame for needed data. That worked for me.