1
votes

I am having a kafka topic with 2 million messages and my flush size is 100000 with default partitions in distributed mode with 4 workers and I am able to see the data is written to HDFS immediately in few seconds in (10 to 15 seconds).

I see that there is a +tmp directory created and folder and the topic is created every time a new connector is triggered.

Is it the behavior of kafka connect to write it this fast every time or is it storing the data in HDFS already and moves it to the topic directory based on the connector properties?

If I want to calculate the latency for this,how can I calculate it?

And if I stop and delete the topic directory in both /topics and /temp and retrigger the same topic will it again pull data from Kafka or will it get the data from some place in hdfs as a backup?

Need clarity on how this is happening.Please let me know if my understanding is not right.

1

1 Answers

2
votes

Is it the behavior of kafka connect to write it this fast every time or is it storing the data in HDFS already and moves it to the topic directory based on the connector properties?

Both. It's a Kafka Consumer buffering in memory, and writing to a Write-Ahead Log on HDFS. The +tmp folder holds temporary files, which get "compressed" together into larger HDFS files, and moved into the final location along with a Kafka Consumer offset commit.

calculate the latency for this

You can use Control Center w/ the Monitoring Interceptors enabled on your Brokers and Connect instances.

Otherwise, you can describe the consumer group in the CLI for the connector to see the consumer lag.

if I stop and delete the topic directory in both /topics and /temp and retrigger the same topic will it again pull data from Kafka

Up to a certain version of HDFS connect, I believe it resumed from offsets stored in HDFS. In order to restart from the very beginning in any other Kafka Consumer, you must delete or reset the consumer group.

If you deleted the topic directory, there is no other place in HDFS that Connect references, so there is no backups