Spark streaming textFileStream not processing the new files as RDDs

Question

We have a java application built using Spark Streaming 1.4, which polls a dir for new files every 20 seconds and there is another script which populates new (no copy) files every 5 seconds.

Issue is - spark log shows that it has picked up new files

2015-07-07 05:13:00,390 - [INFO ] org.apache.spark.streaming.dstream.FileInputDStream:59 New files at time 1436226180000 ms:
file:/home/mata/Downloads/in/365379649921050.txt
file:/home/mata/Downloads/in/365364610737285.txt
file:/home/mata/Downloads/in/365374642289893.txt
file:/home/mata/Downloads/in/365369640106263.txt
2015-07-07 05:13:00,918 - [INFO ] org.apache.spark.storage.MemoryStore:59 ensureFreeSpace(231040) called with curMem=0, maxMem=280248975

But the RDD processing (joins, aggregations) are not happening. I added log statements in the processing part but it displays those statements only once at the start-up. Has anyone experienced this issue?

    // Create the context with a 20 second batch size
    SparkConf sparkConf = new SparkConf()
            .setAppName("Data - Streaming App");
    JavaStreamingContext ssc = new JavaStreamingContext(sparkConf,              
     Durations.seconds(Long.valueOf(args[3]).longValue()));

Mata Mata · Accepted Answer · 2015-07-12T10:18:47

This is resolved.

This was not a Spark issue. The issue was we were reading the flat files and parsing it in a separate java class/object, however there was an exception in this class and Spark did not log that exception for some reason and was silently not creating/processing any RDDs. It was hard to debug but anyway now can process new files.

Spark streaming textFileStream not processing the new files as RDDs

1 Answers