3
votes

From the docs, it is clear that when a KCL application is started with TRIM_HORIZON as the iterator type, the records are read from the start of the stream. The docs also mention that the state of the application is maintained in the DynamoDB table by using checkpointing.

However I do not find any reference where how this DynamoDB table information is used by KCL application.

In specific my problem is as follows - I have stream with retention period of 168 hours which is a lot of data. Say my KCL(started with iterator at TRIM_HORIZON) was in sync with the incoming data and was processing records at the end of the stream and chekcpointing at regular intervals. Now if I restart my KCL, will it start from the beginning of the stream to read the data (168 hours before) but still use DynamoDB table to see the checkpoint and skip to the latest records or is the checkpoint information not used at all and the stream is read from the start irrespective?

In the latter case, it is unnecessary reprocessing of huge amount of data.

Should I be manually using the sequence number from the DynamoDB table to get the shard iterator?

1

1 Answers

1
votes

When a KCL application is restarted, it automatically restores its state from DynamoDB table, so you don't need to do anything manually. Processing continues from the last checkpoint that was made before restart, so be ready to process few duplicate items if the restart occurred unexpectedly and the application didn't have a chance to checkpoint before exit (though there may be other reasons for duplicates)

When restarting, be sure to provide the same application name as you did on the previous start. Otherwise KCL would treat this situation as creating a new separate application, would create a new DynamoDB table and start entirely independent processing.