Is it possible to specify Kafka bootstrap servers for two different clusters for a single pipeline using KafkaIO.read?

Question

I am currently using Google Cloud Dataflow and Apache Beam to consume messages from a Kafka topic that exists in two different Kafka clusters, with both clusters containing the same topic names but different data in the topics. The Kafka clusters are separated because they contain data from separate regions.

I am just wondering if it is possible to consume data from both of the clusters by listing all of the bootstrap servers for both clusters in a single KafkaIO.read Dataflow pipeline step?

.withBootstrapServers("CLUSTER1_SERVER:PORT,CLUSTER2_SERVER:PORT");

I was reading documentation regarding Kafka bootstrap servers and it wasn't clear to me if after connecting to a bootstrap server, messages would only be consumed from the first successful bootstrap server connection cluster, or if it would try all bootstrap servers provided and consume from all clusters found. If the former is the case, then I will need to create a second Dataflow pipeline to process the messages from the second cluster, but it would be much easier if I could process messages from both clusters in a single pipeline.

Any information would be greatly appreciated.

Could you please share the documentation you have followed and Dataflow version? Thanks! — aga
@muscat I followed the documentation on this page: kafka.apache.org/documentation and the Dataflow/Apache Beam version I am currently using is 2.18 — eagerbeaver

chamikara chamikara · Accepted Answer · 2020-05-14T15:52:11

Beam KafkaIO just passes this flag to Kafka's ConsumerConfig's BOOTSTRAP_SERVERS_CONFIG flag. I think this parameter is for passing in multiple brokers from the same Kafka cluster for failover. Not for passing in servers from different Kafka clusters. See here for details regarding the Kafka architecture. I suspect when you specify servers from multiple clusters it just picks the first live one.

Is it possible to specify Kafka bootstrap servers for two different clusters for a single pipeline using KafkaIO.read?

2 Answers