I'm running R script on Sagemaker using Batch Transform Job and feature bring your own docker. Process works fine for smaller datasets, but when I try to do run bigger ones, job fails after 40 minutes with following error: "model server did not respond to /invocations request within 600 seconds". Cloudwatch logs shows CPU is 100% utilized and memory below 10%. It seems like it can not respond to ping. Is there any way we can override this 600 seconds to some higher value? Or is there any way to limit CPU utilization of running container?
1 Answers
0
votes
There seems to be a fixed timeout for inference requests as stated here: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-batch-code.html#your-algorithms-batch-code-how-containers-should-respond-to-inferences
SageMaker is designed in a way that training the model takes most of the time and inference should be fast (as this is the case for neural networks etc.).
In case you are also estimating your model in your R script - you would need to do that in the training part of SageMaker.
For your application it might be a better option to run your job on AWS Batch (https://aws.amazon.com/batch/)