0
votes

I am writing a Kafka Sink Connector to Elasticsearch.

I implemented start, put, flush, close methods in SinkTask class.

But, I don't know what Kafka Sink Connector's behavior work exactly.

If Connect Worker repeatedly performs all of this task of getting the SinkRecord from the Kafka Broker via the put() method, processing it internally, and then sending the data to Elasticsearch, I wonder when offset commit is operated internally and it is associated with the flush() method.

Also, I wonder if the order of this repetitive work is fixed. For example, can flush or commit be done before put is finished?

I am developing a connect that receives data from a remote broker and puts data into the elasticsearch of another remote server. In this situation, we are testing what happens if the network of the server on which the connect is running is momentarily disconnected. I do not understand how Kafka Connect works.

If you know that messages can be lost or duplicated in this situation, please ask for explanation.

Thank you.

1
Out of interest, how is what you're writing different from the existing Elasticsearch Connector ? - Robin Moffatt

1 Answers

0
votes

when connect is running is momentarily disconnected

The offset that is consumed from your Connect worker should update the __consumer_offsets internal Kafka topic.

As long as you are monitoring the /status endpoint of your Connector for a non-failure status, and you can restart it within the retention period of the Kafka topic you're consuming, there should be little to no data loss.

Also, as pointed out in the comments kafka-connect-elasticsearch already exists, and you can check that code for commit and flush semantics. If you do fork, it please put in a PR to help the community ;)


Also, Logstash has Kafka inputs and Elasticsearch outputs, so if you are running ELK stack, then that might be a better option than writing your own Connector