Kinesis consumer getRecords does not return 10,000 records consistently

Question

I have a kinesis stream (20 shards) with about a 1 day data lag that is being consumed by a KCL based kinesis consumer. The consumer is deployed with 20 ECS instances, so each instance has a thread pulling data from each shard.

Based on documentation, it looks like a single getRecords call can fetch upto 10,000 records or a maximum payload size of 10 MB. However, when I monitor the consumer logs, not all shards seem to reach this limit. The records fetched with a single getRecords call is very inconsistent across the consumer instances. Some calls fetch around 100-400 records, while some calls fetch around 4000-5000 records. On rare occasions, some calls fetch 9999 records. As a result, the data lag is not getting reduced.

The consumer takes around 5 minutes to process 10,000 records so the read throughput is not being reached as well.

Is there an explanation for this or metrics that I could look into, to debug this issue further?

How frequent getRecords is being called (for a shard) in the situation you describe? — Ofek Hod

stevenv stevenv · Accepted Answer · 2020-09-07T08:55:01

Depending on your record sizes, this might be because of the following Kinesis service limit:

Each shard can support up to a maximum total data read rate of 2 MB per second via GetRecords. If a call to GetRecords returns 10 MB, subsequent calls made within the next 5 seconds throw an exception.

You might want to consider adding more shards if this is indeed the limit you are running into.

Kinesis consumer getRecords does not return 10,000 records consistently

2 Answers