Spark : How to speedup foreachRDD?

Question

We have a Spark streaming application which ingests data @10,000/ sec ... We use the foreachRDD operation on our DStream( since spark doesn't execute unless it finds the output operation on DStream)

so we have to use the foreachRDD output operation like this , it takes upto to 3 hours ...to write a singlebatch of data (10,000) which is slow

CodeSnippet 1:

requestsWithState.foreachRDD { rdd =>

     rdd.foreach {
     case (topicsTableName, hashKeyTemp, attributeValueUpdate) => {          
          val client = new AmazonDynamoDBClient()
          val request = new UpdateItemRequest(topicsTableName, hashKeyTemp, attributeValueUpdate)
          try client.updateItem(request)

        catch {
            case se: Exception => println("Error executing updateItem!\nTable ", se)
         }
        }
        case null =>
      }
    }
  }

So i thought the code inside foreachRDD might be the problem so commented it out to see how much time it takes ....to my surprise ...even with nocode inside the foreachRDD it still run's for 3 hours

CodeSnippet 2:

requestsWithState.foreachRDD { 
rdd => rdd.foreach { 
// No code here still takes a lot of time ( there used to be code but removed it to see if it's any faster without code) // 
 }
}

Please let us know if we are missing anything or an alternative way to do this as i understand without a output operation on DStream spark streaming application will not run .. at this time i can't use other output operations ...

Note : To isolate the problem and make sure that dynamo code is not problem ...i ran with empty loop .....look's like foreachRDD is slow on it's own when iterating over a huge record set coming in @10,000/sec ...and not the dynamo code as empty foreachRDD and with dynamo code took the same time ...

ScreenShot showing all the stages that are executed and time taken by foreachRDD even though it's jus looping and no code inside

Time taken by the foreachRDD empty loop

Task distribution for large running task among 9 worker nodes for the foreachRDD empty loop ...

Not sure about why the empty one would be slow, but make sure your write throughput to dynamo is insanely high, if you have a lot of clusters running this at one time. Might be helpful to post your spark streaming configuration as well. — Derek_M
@Derek_M thanks for your comment ....we have insanely huge numbers for read and write throughput ...writing 10,000 should not be an issue .......like i mentioned in the question ..dynamo code should not be an issue and to prove that it's not an issue i ran with empty loop ...and looks like foreachRDD is slow on it's own .... — user2359997
What does your streaming config look like? Are you doing any additional operations before you run foreachRDD? — Derek_M
@Derek_M i do run map transformations but as u can see ..from the updated screen shot ..that they are only taking few second's the major time in minutes is taken by foreachRDD empty loop.. — user2359997
@zero323 yes you are absolutely right in your assumption that it has a mapWithState in upstream .......i have updated the question with the task distribution ....please let me know if u need more info — user2359997

youngjack youngjack · Accepted Answer · 2018-12-14T09:11:19

I know it is late,but if you like to hear,I have some guess that may give you some insights.

It is not the code inside rdd.foreach that takes long time,but the code before rdd.foreach, the code which generate the rdd. Transformations are lazy,spark does not compute it until you use the result. When code runs in rdd.foreach,spark do the computation,and generate the data rows.The code in rdd.foreach loops only manipulate the result. You can check this by commenting out the rdd.foreach

requestsWithState.foreachRDD { 
  //rdd => rdd.foreach { 
  // No code here still takes a lot of time ( there used to be code but removed it to //see if it's any faster without code) 
  //}
}

I guess it will be extremely fast,because no computations happens. Or you can change the transformations to a very simple one,it will be fast too. It does not solve your problem,but if I'm right,it will help you locate your problem.

Spark : How to speedup foreachRDD?

2 Answers