I have a kinesis stream (20 shards) with about a 1 day data lag that is being consumed by a KCL based kinesis consumer. The consumer is deployed with 20 ECS instances, so each instance has a thread pulling data from each shard.
Based on documentation, it looks like a single getRecords call can fetch upto 10,000 records or a maximum payload size of 10 MB. However, when I monitor the consumer logs, not all shards seem to reach this limit. The records fetched with a single getRecords call is very inconsistent across the consumer instances. Some calls fetch around 100-400 records, while some calls fetch around 4000-5000 records. On rare occasions, some calls fetch 9999 records. As a result, the data lag is not getting reduced.
The consumer takes around 5 minutes to process 10,000 records so the read throughput is not being reached as well.
Is there an explanation for this or metrics that I could look into, to debug this issue further?