1
votes

I have configured Spark Streaming to receive data from Kafka, following the Kafka Integration Guide.

I configure Spark Streaming's duration to 20 seconds, and I try to save every 20 seconds received messages to HDFS, using DStream method saveAsTextFile

I successfully run the application, and it successfully receives data from Kafka and saves every 20 seconds messages to HDFS. But I am confused about the output layout. It's cool every 20 seconds a directory with prefix specified by the parameter of saveAsTextFile is created, containing some output files with prefix "part-" such as "part-00001"

However, there is only one message content in each output file. It seems Kafka DStream save each message received to a single output file in HDFS. I am looking forward to save multiple messages to one output file.

BTW, I am using Spark Standalone deployment and having only one worker.

2
could you add code that reproduces the issue you're facing? - maasg

2 Answers

1
votes

No, that's certainly not how it works; that would be crazy. One directory is created per batch interval. The contents are part-* files that contain all messages sent in that interval. One file is created per streaming task, which is basically the number of partitions of the streaming RDD.

0
votes

Re-partition the rdd to 1 before calling saveAsTextFile method. you will get single output file. BTW it will add more computation overhead!!