2
votes

We have a kinesis stream with three shards and our kinesis application has three instances. We are able to see that records are published to all three of our shards but our kinesis application is able to process records from only one shard. Workers listening to other two shards are constantly going to sleep.

Cloud watch metrics for one shard where outgoing bytes is zero enter image description here

Before 28th April our kinesis application was able to process records from all three shards enter image description here

Any idea what could be causing it?

1
Hi, Have you found the solution, same thing is happening with meAwadesh
Yes @Awadesh, we found what was causing the issue. Will try to add an answer for the explanationvinit
@vinit could you please post the answer.Niketh Sudhakaran
@NikethSudhakaran better late than never :)vinit

1 Answers

1
votes

Kinesis Internally has a timeout while reading records on a getRecords api call on their internal storage. When your write rate is high, there might be scenarios where kinesis internally is not able to get all the new records because of the timeout. This scenario generally arises when you are writing high number of records but reading at a rate lower than the default 1 per sec. Kinesis guarantees that it will be able to return all the records, when your read rate is >= 1 getRecord/second(It could be at max 5 per sec) otherwise, your shard iterator age will keep on increasing. Note - You can get all the records if you are reading slowly, but in case your shard iterator age goes for a toll ie. you are lagging behind, you have to increase the frequency of reads and thats when you will be able to control the shard iterator age

Shard Iterator age is the metric where you can find out how much your read is lagging behind the latest record in a shard/stream. If your iterator age is 10 hours, it means you are currently reading a record that was written to the shard 10 hours earlier.

Also a stream is not a queue. You can't wait for your processing to complete to checkpoint as you do in SQS(visibilityTimeout). You have to checkpoint immediately or you don't checkpoint at all