2
votes

I'm setting up an Apache Spark cluster to perform realtime streaming computations and would like to monitor the performance of the deployment by tracking various metrics like sizes of batches, batch processing times, etc. My Spark Streaming program is written in Scala

Questions

  1. The Spark monitoring REST API description lists the various endpoints available. However, I couldn't find endpoints that expose batch-level info. Is there a way to get a list of all the Spark batches that have been run for an application and other per-batch details such as follows:
    • Number of events per batch
    • Processing time
    • Scheduling delay
    • Exit status: ie, whether the batch was processed successfully or not
  2. In case such a batch-level API is unavailable, can batch-level stats (eg: size, processing time, scheduling delay, etc.) be obtained by adding custom instrumentation to the spark streaming program.

Thanks in advance,

1
Regarding 2. this answer might help stackoverflow.com/questions/41980447/…ImDarrenG

1 Answers

4
votes

If you have no luck with 1., this will help with 2.:

ssc.addStreamingListener(new JobListener());

// ...

class JobListener implements StreamingListener {

    @Override
    public void onBatchCompleted(StreamingListenerBatchCompleted batchCompleted) {

        System.out.println("Batch completed, Total delay :" + batchCompleted.batchInfo().totalDelay().get().toString() +  " ms");

    }

   /*

   snipped other methods

   */


}

Taken from In Spark Streaming, is there a way to detect when a batch has finished?

batchCompleted.batchInfo() contains:

  • numRecords
  • batchTime, processsingStartTime, processingEndTime
  • schedulingDelay
  • outputOperationInfos

Hopefully you can get what you need from those properties.