We have a Spark streaming application which ingests data @10,000/ sec ... We use the foreachRDD operation on our DStream( since spark doesn't execute unless it finds the output operation on DStream)
so we have to use the foreachRDD output operation like this , it takes upto to 3 hours ...to write a singlebatch of data (10,000) which is slow
CodeSnippet 1:
requestsWithState.foreachRDD { rdd =>
rdd.foreach {
case (topicsTableName, hashKeyTemp, attributeValueUpdate) => {
val client = new AmazonDynamoDBClient()
val request = new UpdateItemRequest(topicsTableName, hashKeyTemp, attributeValueUpdate)
try client.updateItem(request)
catch {
case se: Exception => println("Error executing updateItem!\nTable ", se)
}
}
case null =>
}
}
}
So i thought the code inside foreachRDD might be the problem so commented it out to see how much time it takes ....to my surprise ...even with nocode inside the foreachRDD it still run's for 3 hours
CodeSnippet 2:
requestsWithState.foreachRDD {
rdd => rdd.foreach {
// No code here still takes a lot of time ( there used to be code but removed it to see if it's any faster without code) //
}
}
Please let us know if we are missing anything or an alternative way to do this as i understand without a output operation on DStream spark streaming application will not run .. at this time i can't use other output operations ...
Note : To isolate the problem and make sure that dynamo code is not problem ...i ran with empty loop .....look's like foreachRDD is slow on it's own when iterating over a huge record set coming in @10,000/sec ...and not the dynamo code as empty foreachRDD and with dynamo code took the same time ...
ScreenShot showing all the stages that are executed and time taken by foreachRDD even though it's jus looping and no code inside
Time taken by the foreachRDD empty loop
Task distribution for large running task among 9 worker nodes for the foreachRDD empty loop ...

