Consume kafka data to HDFS Spark Batch

Question

I have a lot of Kafka topics with 1 partition each, being produced and consumed to/from (REST API - Kafka - SQL server). But now I want to take periodic dumps of this data to keep in HDFS to perform analytics on later down the line.

Since this basically is just a dump I need, I'm not sure that I need spark streaming. However all documentation and examples use Spark streaming for this.

Is there a way to populate a DF/RDD from a Kafka topic without having a streaming job running? Or is the paradigm here to kill the "streaming" job once the set window of min-to-max offset have been processed? And thus treating the streaming job as a batch job.

Robin Moffatt Robin Moffatt · Accepted Answer · 2018-04-16T19:58:21

As you've correctly identified, you do not have to use Spark Streaming for this. One approach would be to use the HDFS connector for Kafka Connect. Kafka Connect is part of Apache Kafka. It takes a Kafka topic and writes messages from it to HDFS. You can see the documentation for it here.

Consume kafka data to HDFS Spark Batch

3 Answers