How to use from_json to allow for messages to have different fields?

Question

I am trying to process data from Kafka using Spark Structured Streaming. The code for ingesting the data is as follows:

val enriched = df.select($"value" cast "string" as "json")
  .select(from_json($"json", schema) as "data")
  .select("data.*")

ds is a DataFrame with the data consumed from Kafka.

The problem comes when I try to read is as JSON in order to do faster queries. the function that comes from org.apache.spark.sql.functions from_json() is asking obligatory for a schema. What if the messages have some different fields?

Each batch requires the same schema, which has to be known up-front. Inference wouldn't make sense at all. You can check stackoverflow.com/q/41814634/1560062 for some ideas how to handle this. — zero323
@zero323 thanks for the answer! and in a more "theoretical" part, why wouldnt it make sense? I understood that the idea of being structured is just to have a better performance and optimization, if not we are back to relational models more or less, right? — ppanero
I think that "just" is a serious understatement. Having structured data with a well known schema is the things like this work. Reusing execution plans, internal storage, devising optimal compression strategy, interacting with external storage - this one part. Another is the cost of the inference. It requires a full data scan, in the worst case scenario O(N) local memory, and it is just unacceptable when you need soft real time response. Finally it is not that common for schema to change without affecting application logic. And yeah, we are converging towards (more or less) relational models. — zero323

Vidya Vidya · Accepted Answer · 2017-03-31T00:27:43

As @zero323 and the answer he or she referenced suggest, you are asking a contradictory question: essentially how does one impose a schema when one doesn't know the schema? One can't of course. I think the idea to use open-ended collection types is your best option.

Ultimately though, it is almost certainly true that you can represent your data with a case class even if it means using a lot of Options, strings you need to parse, and maps you need to interrogate. Invest in the effort to define that case class. Otherwise, your Spark jobs will essentially a lot of ad hoc, time-consuming busywork.

How to use from_json to allow for messages to have different fields?

1 Answers