0
votes

I am using spark streaming to do analysis. after analysis I have to save the kafka message in hdfs. Each kafka message is a xml file. I can't use rdd.saveAsTextFile because it will save whole rdd. Each element of rdd is kafka message ( xml file ). How to save each rdd element (file) in hdfs using spark.

1

1 Answers

2
votes

I would go about this a different way. Stream your transformed data back into Kafka, and then use the HDFS connector for Kafka Connect to stream the data to HDFS. Kafka Connect is part of Apache Kafka. The HDFS connector is open source and available standalone or as part of Confluent Platform.

Doing it this way you decouple your processing from writing your data to HDFS, which makes it easier to manage, to troubleshoot, to scale.