0
votes

I want to run my Glue Job in parallel. Basically, I am starting my Glue Job from Step Function, which is dependent on finishing the previous state which is Lambda putting msgs on SQS. Then my Glue Job is taking msg from SQS one by one. I want to speed up such processing on GLUE Job side, by running it in parallel.

In Step Function I can see two ways to achieve parallelism:

  • "Map" state
  • "Parallel" state

According to AWS doc: "While the Parallel state executes multiple branches of steps using the same input, a Map state will execute the same steps for multiple entries of an array in the state input."

But, in my case "The Input" inside Step Function is useless, as I am using SQS. When going with "Parallel" state, I would need to duplicate the same "step" in state machine.. (code duplication), and when going with "Map" state, I would need to create some kind of artificial array just to force parallelism. Not sure If I understand it correctly, or if there is another way. Please suggest and help!

1

1 Answers

0
votes

There's no need to create an "artificial array" while using Map state in State Machine, because SQS itself doesn't allow one message to be passed to the multiple clients simultaneously, and no matter how many Glue jobs are polling from the same SQS queue your message will only be processed by one Glue job at a time.

One thing you must take care of is SQS visibility timeout, that is the time the message stays invisible to the clients, after it is passed to one client. Kindly always keep visibility timeout greater than the processing time for single message, so that Glue job removes it before it is passed to the other jobs. On failures, it is safely passed to the other job or the same job for retry.