using kafka-streams to conditionally sort a json input stream

Question

I am new to developing kafka-streams applications. My stream processor is meant to sort json messages based on a value of a user key in the input json message.

Message 1: {"UserID": "1", "Score":"123", "meta":"qwert"}
Message 2: {"UserID": "5", "Score":"780", "meta":"mnbvs"}
Message 3: {"UserID": "2", "Score":"0", "meta":"fghjk"}

I have read here Dynamically connecting a Kafka input stream to multiple output streams that there is no dynamic solution.

In my use-case I know the user keys and output topics that I need to sort the input stream. So I am writing separate processor applications specific to each user where each processor application matches a different UserID.

All the different stream processor applications read from the same json input topic in kafka but each one only writes the message to a output topic for a specific user if the preset user condition is met.

public class SwitchStream extends AbstractProcessor<String, String> {
        @Override
        public void process(String key, String value) {
            HashMap<String, String> message = new HashMap<>();
            ObjectMapper mapper = new ObjectMapper();
            try {
                message = mapper.readValue(value, HashMap.class);
            } catch (IOException e){}

            // User condition UserID = 1
            if(message.get("UserID").equals("1")) {
                context().forward(key, value);
                context().commit();
            }
        }

        public static void main(String[] args) throws Exception {
            Properties props = new Properties();
            props.put(StreamsConfig.APPLICATION_ID_CONFIG, "sort-stream-processor");
            props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
            props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
            props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());

            props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
            TopologyBuilder builder = new TopologyBuilder();
            builder.addSource("Source", "INPUT_TOPIC");
            builder.addProcessor("Process", SwitchStream::new, "Source");
            builder.addSink("Sink", "OUTPUT_TOPIC", "Process");

            KafkaStreams streams = new KafkaStreams(builder, props);
            streams.start(); 
       }
}

Question 1: Is it possible to achieve the same functionality easily using the High-Level Streams DSL instead if the Low-Level Processor API? (I admit I found it harder understand and follow the other online examples of the High-Level Streams DSL)

Question 2: The input json topic is getting input at a high rate 20K-25K EPS. My processor applications don't seem to be able to keep pace with this input stream. I have tried deploying multiple instances of each process but the results are nowhere close to where I want them to be. Ideally each processor instance should be able to process 3-5K EPS.

Is there a way to improve my processor logic or write the same processor logic using the high level streams DSL? would that make a difference?

Matthias J. Sax Matthias J. Sax · Accepted Answer · 2017-05-07T21:47:43

You can do this in high-level DSL via filter() (you effectively implemented a filter as you only return a message if it's userID==1). You could generalize this filter pattern, by using KStream#branch() (see the docs for further details: http://docs.confluent.io/current/streams/developer-guide.html#stateless-transformations). Also read the JavaDocs: http://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/streams

KStreamBuilder builder = new KStreamBuilder();
builder.stream("INPUT_TOPIC")
       .filter(new Predicate() {
           @Overwrite
           boolean test(String key, String value) {
               // put you processor logic here
               return message.get("UserID").equals("1")
           }
        })
       .to("OUTPUT_TOPIC");

About performance. A single instance should be able to process 10K+ records. It's hard to tell without any further information what the problem might be. I would recommend to ask at Kafka user list (see http://kafka.apache.org/contact)

using kafka-streams to conditionally sort a json input stream

1 Answers