We have a system (say System A) that receives time series data via HTTP and this data is being persisted in OpenTSDB via the REST interface of the OpenTSDB. I would now like to introduce Apache Kafka into the system. The idea for me would be to have a Kafka server running, where System A as soon as receiving time series messages, publishes this message to the Apache Kafka server.
I can then have a consumer that reads from the topic and writes this data to the OpenTSDB. I have a couple of questions with this approach:
With respect to the architecting the Producer and Consumer:
Can I have a standalone client where I will write consumers that just consume from the Kafka topic and write the messages into OpenTSDB
The producers will be part of System A and will publish messages to the respective topic
With respect to Kafka topics, the time series data is some metrics that have a key and a value and example of which is as below:
"metric.metricType.tagName"
I will be having hundreds or even perhaps thousands of these different tagNames. How do I structure this information and represent this as a topic in Apache Kafka. I'm not sure if there is a limit on the number of topics that I could create.
Should I have one topic per tagName? What is the deal with partitioning the topic?
With respect to Apache Kafka partitioning, I have the following questions:
If I have a topic "Topic A" and have set partitions to 4 for this topic, and if my producer writes to this partition, in which partition of this topic will this message be available? Is the same message available across each partition within the same topic?
If I write a consumer for this partitioned topic, how will this behave, I mean, will this consumer receive the message from the partition?
If I have multiple consumers for this partitioned topic, will all of those consumers get the same messages? I mean if there are 4 partitions in the topic (TP1, TP2, TP3, TP4) and I have 4 consumer group (CG1, CG2, CG3, CG4) where in each consumer group, I have one consumer that reads the messages from the respective topic partition (C1 reads from TP1, C2 reads from TP2 and so on). Will I end up having duplicate messages if all my consumer groups writes the messages it receives to the same database?