0
votes

I have a lot of Kafka topics with 1 partition each, being produced and consumed to/from (REST API - Kafka - SQL server). But now I want to take periodic dumps of this data to keep in HDFS to perform analytics on later down the line.

Since this basically is just a dump I need, I'm not sure that I need spark streaming. However all documentation and examples use Spark streaming for this.

Is there a way to populate a DF/RDD from a Kafka topic without having a streaming job running? Or is the paradigm here to kill the "streaming" job once the set window of min-to-max offset have been processed? And thus treating the streaming job as a batch job.

3

3 Answers

1
votes

As you've correctly identified, you do not have to use Spark Streaming for this. One approach would be to use the HDFS connector for Kafka Connect. Kafka Connect is part of Apache Kafka. It takes a Kafka topic and writes messages from it to HDFS. You can see the documentation for it here.

1
votes

You can use createRDD method of KafkaUtils to have spark batch job.

Similar question has been answered here- Read Kafka topic in a Spark batch job

0
votes

Kafka is a stream processing platform, so using with spark streaming is easy.

You could use Spark streaming and then check point the data at specified intervals, which fulfills your requirement.

For more on check pointing : - https://spark.apache.org/docs/2.0.2/streaming-programming-guide.html#checkpointing