2
votes

I've got an application for which I only need the bandwidth of 1 Kinesis shard, but I need many lambda function invocations in parallel to keep up with the record processing. My record size is on the high end (some of them encroach on the 1000 KB limit), but the incoming rate is only 1 MB/s, as I'm using a single EC2 instance to populate the stream. Since each record contains an internal timestamp, I don't care about processing them in order. Basically I have several months' worth of data that I need to migrate, and I want to do it in parallel.

The processed records provide records for a database cluster that can handle 1000 concurrent clients, so my previous solution was to split my Kinesis stream into 50 shards. However, this has proved expensive, since all I need the shards for is to parallelize the processing. I'm using less than 1% of the bandwidth, and I had to increase the retention period.

Long term, I imagine the answer involves splitting my records up, so that the consumption time isn't such a huge multiple of the production time. That's not an option right now, but I realize I'm abusing the system slightly.

Is there a way I can have one order-preserving lambda function associated with a single-shard Kinesis stream, and let it invoke another lambda function asynchronously on a batch of records? Then I could use a single Kinesis shard (or other data source) and still enjoy massively parallel processing.

Really all I need is an option in the Lambda Event Source configuration for Kinesis to say "I don't care about preserving order of these records." But then I suppose keeping up with the iterator position on failed executions becomes more of a challenge.

1
Can you chain your lambda functions? The first function will get the meta-event and it will mainly split it to smaller events that you can trigger another lambda function with. The second lambda function can be triggered in parallel.Guy
I believe so, but now I need to cache my records somewhere handy (like DynamoDB) to handle failures and retries correctly, and since lambda functions can't live longer than 300 seconds, I can't have a long running orchestrator function, so it has to survive expiring (and getting re-invoked).Jay Carlton
It depends on the type of errors you might have. For example, if you have "poison pills" in your data, you simply want to throw them away. You can also consider having a chain of Kinesis streams as your intermediate buffering mechanism. Another mechanism is a "dead letter queue" for these exceptions, also in Kinesis or SQS, depends on the frequency of such errors.Guy
Did you consider using SQS instead? For example, using Elastic Beanstalk and SQS as worker environment (docs.aws.amazon.com/elasticbeanstalk/latest/dg/…) is very similar to lambda with kinesis, but might be better fit for your use case.Guy
Thanks, I'll have a look at SQS. There's another issue you run into when invoking Lambda via the Event method. Namely, there's a 128 KB limit to the payload size in that scenario, so all we can really afford pass directly to lambda is information on where to obtain your actual payload (e.g. an S3 bucket and key).Jay Carlton

1 Answers

3
votes

According to somebody that works in AWS, it is possible to attach several Lambda functions to the same Kinesis stream. That said, I'm testing it with no success for now.

EDIT:

It's working properly.