Why does Spark Streaming save each Kafka message as a single file?

Question

I have configured Spark Streaming to receive data from Kafka, following the Kafka Integration Guide.

I configure Spark Streaming's duration to 20 seconds, and I try to save every 20 seconds received messages to HDFS, using DStream method saveAsTextFile

I successfully run the application, and it successfully receives data from Kafka and saves every 20 seconds messages to HDFS. But I am confused about the output layout. It's cool every 20 seconds a directory with prefix specified by the parameter of saveAsTextFile is created, containing some output files with prefix "part-" such as "part-00001"

However, there is only one message content in each output file. It seems Kafka DStream save each message received to a single output file in HDFS. I am looking forward to save multiple messages to one output file.

BTW, I am using Spark Standalone deployment and having only one worker.

Sean Owen Sean Owen · Accepted Answer · 2015-04-19T14:06:38

No, that's certainly not how it works; that would be crazy. One directory is created per batch interval. The contents are part-* files that contain all messages sent in that interval. One file is created per streaming task, which is basically the number of partitions of the streaming RDD.

Why does Spark Streaming save each Kafka message as a single file?

2 Answers