6
votes

I'm building a server-less web-tracking system which serves its tracking pixel using AWS API Gateway, which calls a Lambda function whenever a tracking request arrives to write the tracking event into a Kinesis stream.

The Lambda function itself does not do anything fancy. It just a takes the incoming event (its own argument) and writes it to the stream. Essentially, it's just:

import boto3
kinesis_client = boto3.client("kinesis")

kinesis_stream = "my_stream_name"

def return_tracking_pixel(event, context):
    ...
    new_record = ...(event)
    kinesis_client.put_record(
        StreamName=kinesis_stream,
        Data=new_record,
        PartitionKey=...
    )
    return ...

Sometimes I experience a weird spike in the Lambda execution duration that causes some of my Lambda function invocations to time-out and the tracking requests to be lost.

This is the graph of 1-minute invocation counts of the Lambda function in the in affected time period:

enter image description here

Between 20:50 and 23:10 I suddenly see many invocation errors (1-minute error counts):

enter image description here

which are obviously caused by the Lambda execution time-out (maximum duration in 1-minute intervals):

enter image description here

There is nothing weird going on neither with my Kinesis stream (data-in, number of put records, put_record success count etc., all looks normal), nor with my API GW (number of invocations corresponds to number of API GW calls, well within the limits of the API GW).

What could be causing the sudden (and seemingly randomly occurring) spike in the Lambda function execution duration?

EDIT: neither the lambda functions are being throttled, which was my first idea.

1
Anything interesting being logged by the function? Peppering your code with some otherwise unnecessary log chatter might help you at least prove what point the function is arriving at. If in VPC, and if assigned multiple subnets, one of the subnets may be misconfigured and be unable to access Kinesis. Increasing (temporarily) the timeout to max might get you a log of the real error, an exception, stack trace, etc. that currently isn't thrown because the platform times out your execution faster than the error can be thrown.Michael - sqlbot
@Michael-sqlbot i'll try to log more, but the whole function is really just about 5 lines of code... the only two complicated operations that are happening in there are kinesis_client = boto3.client("kinesis") and kinesis_client.put_record(...)... and i can't really avoid those, can i?grepe
You're right, it does seem very simple and straightforward, but it's critical to figure out precisely how much progress your code is making through those few steps. If there are timeouts or retries on the put_record() call, you need to know that it actually started trying, and you need more runtime to let the request actually fail hard rather than being terminated by the Lambda timeout.Michael - sqlbot
I agree that you should add some logging to your function to isolate any potential issues within the function. Also, temporary spikes can occur from Lambda cold-start behavior, but that should be negligable for an API receiving consistent traffic.RyanG
is anybody else writing to the kinesis stream? is it possible that you are reaching some sore of limit and you need to increase the number of shards on the stream?Mircea

1 Answers

2
votes

Just to add my 2 cents, because there's not much investigative work without extra logging or some X-Ray analysis.

AWS Lambda sometimes will force recycle containers which will feel like cold starts even though your function is being reasonably exercised and warmed up. This might bring all cold start related issues, like extra delays for ENIs if your Lambda has an attached VPC and so on... but even for a simple function like yours, 1 second timeout is sometimes too optimistic for a cold start.

I don't know of any documentation on those forced recycles, other than some people having evidence for it.

"We see a forced recycle about 7 times a day." source

"It also appears that even once warmed, high concurrency functions get recycled much faster than those with just a few in memory." source

I wonder how you could confirm this is the case. Perhaps you could check those errors appearing in Cloud Watch log streams to be from containers that never appeared before.