Kafka is very common. So many companies use it. I completely understand how both Kafka and Spark work and I am experienced with both of them. What I don't understand is the use cases. Why would you use Kafka with Spark, rather than just Spark?
As I see it, Kafka's main usage is as a staging area in an ETL pipeline for real time (streaming) data.
I imagine that there is a data source cluster where the data is originally stored in. It can be for example Vertica, Cassandra, Hadoop etc.
Then there is a processing cluster that reads the data from the data source cluster, and write it to a distributed Kafka log, which is basically a staging data cluster.
Then there is another processing cluster - a Spark cluster that reads the data from Kafka, makes some transformations and aggregations on the data and write it to the final destination.
If what I imagine is correct, I can just cut Kafka from the middle, and in a Spark program that runs on a Spark cluster, the driver will read the data from the original source and will parallelize it for processing. What is the advantage of placing Kafka in the middle?
Can you give me concrete use cases where Kafka is helpful rather than just reading the data to Spark in the first place, without going through Kafka?