I am using apache kafka to produce and consume a file 5GB in size. I want to know if there is a way where the message from the topic is automatically removed after it is consumed. Do I have any way to keep track of consumed messages? I don't want to delete it manually.
5 Answers
In Kafka, the responsibility of what has been consumed is the responsibility of the consumer and this is also one of the main reasons why Kafka has such great horizontal scalability.
Using the high level consumer API will automatically do this for you by committing consumed offsets in Zookeeper (or a more recent configuration option is using by a special Kafka topic to keep track of consumed messages).
The simple consumer API make you deal with how and where to keep track of consumed messages yourself.
Purging of messages in Kafka is done automatically by either specifying a retention time for a topic or by defining a disk quota for it so for your case of one 5GB file, this file will be deleted after the retention period you define has passed, regardless of if it has been consumed or not.
As per my Knowledge you can Delete the consumed data form the logs by reducing the Storage time. Default time for the log is set for 168 hours and then the Data is automatically removed from the Kafka-Topic which you created. So, my suggestion is to reduce the go to the server.properties
which is located in the config folder and the change the 168 to a minimum time. so their is no data after the specific amount of time which you have set for the log.retention.hours.So your issue will be solved.
log.retention.hours=168
Keep coding
You cannot delete a Kafka message on consumption
Kafka does not have a mechanism to directly delete a message when it is consumed.
The closest thing I found at an attempt to do this is this trick but it is untested and by design it will not work on the most recent messages:
A potential trick to do this is to use a combination of (a) a compacted topic and (b) a custom partitioner (c) a pair of interceptors.
The process would follow:
- Use a producer interceptor to add a GUID to the end of the key before it is written.
- Use a custom partitioner to ignore the GUID for the purposes of partitioning
- Use a compacted topic so you can then delete any individual message you need via producer.send(key+GUID, null)
- Use a consumer interceptor to remove the GUID on read.
But you should not need this capability.
Have 1 or more consumers, and want a message to be consumed only once in total by them?
Put them in the same consumer group.
Want to avoid too many messages filling up the disk?
Set up retention in terms of disk space and or time.
You can use consumer_group : Kafka guarantees that a message is only ever read by a single consumer in the group. https://www.tutorialspoint.com/apache_kafka/apache_kafka_consumer_group_example.htm
I just ran in this issue and built a script that can be run periodically to 'mark' consumed records as deleted. Kafka will not free the space immediately but delete partitions with offsets outside of the 'active' ones.
https://gist.github.com/ThePsyjo/b717d2eaca2deb09b8130b3e917758f6