0
votes

I have a kinesis stream (20 shards) with about a 1 day data lag that is being consumed by a KCL based kinesis consumer. The consumer is deployed with 20 ECS instances, so each instance has a thread pulling data from each shard.

Based on documentation, it looks like a single getRecords call can fetch upto 10,000 records or a maximum payload size of 10 MB. However, when I monitor the consumer logs, not all shards seem to reach this limit. The records fetched with a single getRecords call is very inconsistent across the consumer instances. Some calls fetch around 100-400 records, while some calls fetch around 4000-5000 records. On rare occasions, some calls fetch 9999 records. As a result, the data lag is not getting reduced.

The consumer takes around 5 minutes to process 10,000 records so the read throughput is not being reached as well.

Is there an explanation for this or metrics that I could look into, to debug this issue further?

2
How frequent getRecords is being called (for a shard) in the situation you describe?Ofek Hod

2 Answers

1
votes

Depending on your record sizes, this might be because of the following Kinesis service limit:

Each shard can support up to a maximum total data read rate of 2 MB per second via GetRecords. If a call to GetRecords returns 10 MB, subsequent calls made within the next 5 seconds throw an exception.

You might want to consider adding more shards if this is indeed the limit you are running into.

0
votes

I am not familiar with your specific application, so it might not be relevant, but perhaps KDS enhanced fanout can help:

https://docs.aws.amazon.com/streams/latest/dev/enhanced-consumers.html

You might also want to consider increasing the retention period.

Do note that extra charges apply to both suggestions.