kafka-connect returning 409 in distributed mode

Question

I'm running a kafka-connect distributed setup.

I was testing with a single machine/process setup (still in distributed mode), which worked fine, now I'm working with 3 nodes (and 3 connect processes), logs do not contain errors, but when I submit an s3-connector request through the rest-api, it returns: {"error_code":409,"message":"Cannot complete request because of a conflicting operation (e.g. worker rebalance)"}.

When I stop the kafka-connect process on one of the nodes, I can actually submit the job and everything is running fine.

I have 3 brokers in my cluster, the partition number of the topic is 32.

This is the connector I'm trying to launch:

{
    "name": "s3-sink-new-2",
    "config": {
        "connector.class": "io.confluent.connect.s3.S3SinkConnector",
        "tasks.max": "32",
        "topics": "rawEventsWithoutAttribution5",
        "s3.region": "us-east-1",
        "s3.bucket.name": "dy-raw-collection",
        "s3.part.size": "64000000",
        "flush.size": "10000",
        "storage.class": "io.confluent.connect.s3.storage.S3Storage",
        "format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
        "schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
        "partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
        "partition.duration.ms": "60000",
        "path.format": "\'year\'=YYYY/\'month\'=MM/\'day\'=dd/\'hour\'=HH",
        "locale": "US",
        "timezone": "GMT",
        "timestamp.extractor": "RecordField",
        "timestamp.field": "procTimestamp",
        "name": "s3-sink-new-2"
    }
}

Nothing in the logs indicate a problem, and I'm really lost here.

I think I figured it out, I have 3 workers, and 32 partitions/tasks. I think kafka-connect is trying to evenly distribute the work between 3 workers, and is unable to (32 / 3 = 10.66667). I will test tomorrow with 4 workers. — OmriManor
I've seen this error before when the rest.advertised.host.name cannot be resolved over each worker — OneCricketeer
thanks for the comment, the documentation on this configuration parameter is lacking to say the least. What exactly should this resolve to? the host-name of one of the workers?, I thought they communicated strictly through kafka. — OmriManor
It needs to be set to the external host or IP of the machine, and the port is set with rest.port. But the rebalancing and REST requests communicate between the workers directly, not just through Kafka, from what I've noticed. If this isn't the issue, the consumer group is literally rebalancing and there might be other instabilities in the cluster, not just Connect — OneCricketeer

Wojciech Sznapka Wojciech Sznapka · Accepted Answer · 2019-03-27T22:22:55

I had the same problem with my setup on Kubernetes. The issue was that I had CONNECT_REST_ADVERTISED_HOST_NAME set to same value on each of 16 nodes. It causes constant rebalancing issue. Have unique value and you should be fine.

The solution for K8S, which works for me:

- env:
  - name: CONNECT_REST_ADVERTISED_HOST_NAME
    valueFrom:
      fieldRef:
        fieldPath: status.podIP

kafka-connect returning 409 in distributed mode

3 Answers