1
votes

I want to process different batch files using S3-SQS-Lambda architecture and looking at 3 possible design approaches

  1. Option 1 - Process batch file as a whole at once

    • File delivered to S3
    • First Lambda will trigger and create message in SQS
    • Second lambda will trigger and will process batch file at once
  2. Option 2 - Process batch file with each message processed separately

    • File delivered to S3
    • First Lambda will trigger and create messages in SQS per each line in batch file with each line corresponding to a message
    • Second Lambda will trigger and will process one message at a time
  3. Option 3 - Process batch file with multiple messages processed concurrently

    • File delivered to S3
    • First Lambda will trigger and create messages in SQS per each line in batch file with each line corresponding to a message
    • Second Lambda will trigger and will process multiple messages at a time

I am inclined to use option 3 as it seems to be middle option from architecture, scalability, processing/cost standpoint, but would like pointers from experts on how they compare these options.

1

1 Answers

0
votes

Prefer simplicity until you have a proven need for complexity.

All three of those options look architecturally valid. But for different conditions:

  1. This requires no additional infrastructure for you to manage. So long as a single lambda can always complete a batch within an acceptable time frame, I would always prefer this option. It is simple and easy to reason about.
  2. Use this if you can demonstrate that each message in the batch takes a few seconds to process and you want to work through the batch as quickly as possible. This is because you'll be going massively parallel to do the work, which will incur additional complexity and overhead, so if it only takes a few ms to process a message, then you won't realise the time saving and will be better with option...
  3. Use this option if the batch size from a file is too big for a single lambda to process in a timely fashion (e.g. option one is not suitable), and through experimentation you have discovered that there is an ideal batch size (e.g. the overhead of splitting and running a lambda dominates at low numbers of messages, but at, say, 100 messages, it becomes faster to process in parallel).

Start with Option 1, which will be quick and easy to set up. If it takes too long to process, then you've demonstrated that there is a need for complexity, and will need to move to Options 2 or 3. I would consider Option 2 to be a sub-set of Option 3. So write to batching logic and experiment to see what batch size offers the performance you need.