kafka connect - ExtractTopic transformation with hdfs sink connector throws NullPointerException

Question

I am using confluent hdfs sink connector 5.0.0 with kafka 2.0.0 and I need to use ExtractTopic transformation (https://docs.confluent.io/current/connect/transforms/extracttopic.html). My connector works fine but when I add this transformation I get NullPointerException, even on simple data sample with only 2 attributes.

ERROR Task hive-table-test-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask:482)
java.lang.NullPointerException
    at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:352)
    at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:109)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:464)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150)
    at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
    at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Here is configuration of connector:

name=hive-table-test
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=hive_table_test

key.converter=io.confluent.connect.avro.AvroConverter
value.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=${env.SCHEMA_REGISTRY_URL}
value.converter.schema.registry.url=${env.SCHEMA_REGISTRY_URL}
schema.compatibility=BACKWARD

# HDFS configuration
# Use store.url instead of hdfs.url (deprecated) in later versions. Property store.url does not work, yet
hdfs.url=${env.HDFS_URL}
hadoop.conf.dir=/etc/hadoop/conf
hadoop.home=/opt/cloudera/parcels/CDH/lib/hadoop
topics.dir=${env.HDFS_TOPICS_DIR}

# Connector configuration
format.class=io.confluent.connect.hdfs.avro.AvroFormat
flush.size=100
rotate.interval.ms=60000

# Hive integration
hive.integration=true
hive.metastore.uris=${env.HIVE_METASTORE_URIS}
hive.conf.dir=/etc/hive/conf
hive.home=/opt/cloudera/parcels/CDH/lib/hive
hive.database=kafka_connect

# Transformations
transforms=InsertMetadata, ExtractTopic
transforms.InsertMetadata.type=org.apache.kafka.connect.transforms.InsertField$Value
transforms.InsertMetadata.partition.field=partition
transforms.InsertMetadata.offset.field=offset

transforms.ExtractTopic.type=io.confluent.connect.transforms.ExtractTopic$Value
transforms.ExtractTopic.field=name
transforms.ExtractTopic.skip.missing.or.null=true

I am using schema registry, data is in avro format and I am sure the given attribute name is not null. Any suggestions? What I need is basically to extract content of given field and use it as a topic name.

EDIT:

It happens even on simple json like this in avro format:

{
   "attr": "tmp",
   "name": "topic1"
}

Would be useful to see the actual data you're sending and transforming — OneCricketeer
okay, I added sample - it happens even on simple json with 2 fields (see above). — Patrik
What is the name of input topic for Connector configuration? Could you include whole Connector configuration? — Bartosz Wardziński
Yes, I added the whole configuration. Name of the input topic is hive_table_test. — Patrik

Bartosz Wardziński Bartosz Wardziński · Accepted Answer · 2019-01-26T14:06:47

Short answer is because, you change the name of the topic in your Transformation.

Hdfs Connector for each topic partition has separate TopicPartitionWriter. When SinkTask, that is responsible for processing messages is created in open(...) method for each partition TopicPartitionWriter is created.

When it processed SinkRecords, based on topic name and partition number it looks up for TopicPartitionWriter and try to append record to its buffer. In your case it couldn't find any write for message. The topic name was changed by Transformation and for that pair (topic, partition) any TopicPartitionWriter was not created.

SinkRecords, that are passed to HdfsSinkTask::put(Collection<SinkRecord> records), have partitions and topic already set, so you don't have to apply any Transformations.

I think io.confluent.connect.transforms.ExtractTopic should be rather used for SourceConnector.

kafka connect - ExtractTopic transformation with hdfs sink connector throws NullPointerException

1 Answers