Current State:
Today I have built a Spark Structured Streaming application which consumes a single Kafka topic which contain JSON messages. Embedded within the Kafka topic's value contains some information about the source and the schema of the message field. A very simplified version of the message looks something like this:
{
"source": "Application A",
"schema": [{"col_name": "countryId", "col_type": "Integer"}, {"col_name": "name", "col_type": "String"}],
"message": {"countryId": "21", "name": "Poland"}
}
There are a handful of Kafka topics in the system today, and I've deployed this Spark Structured Streaming application per topic, using the subscribe option. The application applies the topic's unique schema (hacked by batch reading the first message in the Kafka topic and mapping the schema) and writes it to HDFS in parquet format.
Desired State:
My organization will soon start producing more and more topics and I don't think this pattern of a Spark Application per topic will scale well. Initially it seems that the subscribePattern option would work well for me, as these topics somewhat have a form of hierarchy, but now I'm stuck on applying the schema and writing to distinct locations in HDFS.
In the future we will most likely have thousands of topics and hopefully only 25 or so Spark Applications.
Does anyone have advice on how to accomplish this?