0
votes

For our pipeline, we have about 40 topics (10-25 partitions each) that we want to write into the same HDFS directory using HDFS 3 Sink Connectors in standalone mode (distributed doesn't work for our current setup). We have tried running all the topics on one connector but encounter problems recovering offsets if it needs to be restarted.

If we divide the topics among different standalone connectors, can they all write into the same HDFS directory? Since the connectors then organize all files in HDFS by topic, I don't think this should be an issue but I'm wondering if anyone has experience with this setup.

Basic example: Connector-1 config

name=connect-1
connector.class=io.confluent.connect.hdfs3.Hdfs3SinkConnector
topics=topic1
hdfs.url=hdfs://kafkaOutput

Connector-2 config

name=connect-2
connector.class=io.confluent.connect.hdfs3.Hdfs3SinkConnector
topics=topic2
hdfs.url=hdfs://kafkaOutput
1

1 Answers

0
votes

distributed doesn't work for our current setup

You should be able to run connect-distibured in the exact same nodes as connect-standalone is ran.

We have tried running all the topics on one connector but encounter problems recovering offsets if it needs to be restarted

Yeah, I would suggest not bundling all topics into one connector.

If we divide the topics among different standalone connectors, can they all write into the same HDFS directory?

That is my personal recommendation, and yes, they can because the HDFS path is named by the topic name, futher split by the partitioning scheme


Note: The following allow applies to all other storage connectors (S3 & GCS)