Make apache storm topology to use latest offset from kafka

Question

I have a kafkaspout, 2 bolts to process data, 2 bolts to store processed data in mongodb

I am using apache flux to create topology where I am reading data into spout from kafka. Everything is running fine but every time I run the topology, it processes all the msgs in kafka from the start. and once it processes all the msgs, it does not wait for more msgs and crashes.

How can I make storm topology to process latest msgs only.

here is my topology file .yaml

name: "kafka-topology"

components:
# MongoDB mapper
  - id: "block-mapper"
    className: "org.apache.storm.mongodb.common.mapper.SimpleMongoMapper"
    configMethods:
      - name: "withFields"
        args: # The following are the tuple fields to map to a MongoDB document
          - ["block"]
# MongoDB mapper
  - id: "transaction-mapper"
    className: "org.apache.storm.mongodb.common.mapper.SimpleMongoMapper"
    configMethods:
      - name: "withFields"
        args: # The following are the tuple fields to map to a MongoDB document
          - ["transaction"]

  - id: "stringScheme"
    className: "org.apache.storm.kafka.StringScheme"

  - id: "stringMultiScheme"
    className: "org.apache.storm.spout.SchemeAsMultiScheme"
    constructorArgs:
      - ref: "stringScheme"

  - id: "zkHosts"
    className: "org.apache.storm.kafka.ZkHosts"
    constructorArgs:
      - "172.25.33.191:2181"

  - id: "spoutConfig"
    className: "org.apache.storm.kafka.SpoutConfig"
    constructorArgs:
      # brokerHosts
      - ref: "zkHosts"
      # topic
      - "blockdata"
      # zkRoot
      - ""
      # id
      - "myId"
    properties:
      - name: "scheme"
        ref: "stringMultiScheme"
      - name: "ignoreZkOffsets"
        value: flase



config:
  topology.workers: 1
  # ...

# spout definitions
spouts:
  - id: "kafka-spout"
    className: "org.apache.storm.kafka.KafkaSpout"
    constructorArgs:
      - ref: "spoutConfig"
    parallelism: 1

# bolt definitions
bolts:
  - id: "blockprocessing-bolt"
    className: "org.apache.storm.flux.wrappers.bolts.FluxShellBolt"
    constructorArgs:
      # command line
      - ["python", "process-bolt.py"]
      # output fields
      - ["block"]
    parallelism: 1
    # ...
  - id: "transprocessing-bolt"
    className: "org.apache.storm.flux.wrappers.bolts.FluxShellBolt"
    constructorArgs:
      # command line
      - ["python", "trans-bolt.py"]
      # output fields
      - ["transaction"]
    parallelism: 1
    # ...

  - id: "mongoBlock-bolt"
    className: "org.apache.storm.mongodb.bolt.MongoInsertBolt"
    constructorArgs:
      - "mongodb://172.25.33.205:27017/testdb"
      - "block"
      - ref: "block-mapper"
    parallelism: 1
    # ...

  - id: "mongoTrans-bolt"
    className: "org.apache.storm.mongodb.bolt.MongoInsertBolt"
    constructorArgs:
      - "mongodb://172.25.33.205:27017/testdb"
      - "transaction"
      - ref: "transaction-mapper"
    parallelism: 1
    # ...



  - id: "log"
    className: "org.apache.storm.flux.wrappers.bolts.LogInfoBolt"
    parallelism: 1
    # ...

#stream definitions
# stream definitions define connections between spouts and bolts.
# note that such connections can be cyclical
# custom stream groupings are also supported

streams:


  - name: "kafka --> block-Processing" # name isn't used (placeholder for logging, UI, etc.)
    from: "kafka-spout"
    to: "blockprocessing-bolt"
    grouping:
      type: SHUFFLE

  - name: "kafka --> transaction-processing" # name isn't used (placeholder for logging, UI, etc.)
    from: "kafka-spout"
    to: "transprocessing-bolt"
    grouping:
      type: SHUFFLE

  - name: "block --> mongo"
    from: "blockprocessing-bolt"
    to: "mongoBlock-bolt"
    grouping:
      type: SHUFFLE

  - name: "transaction --> mongo"
    from: "transprocessing-bolt"
    to: "mongoTrans-bolt"
    grouping:
      type: SHUFFLE

I have tried adding property to spoutconfig for fetching latest msgs only like this

 - id: "spoutConfig"
className: "org.apache.storm.kafka.SpoutConfig"
constructorArgs:
  - ref: "zkHosts"
  - "blockdata"
  - ""
  - "myId"
properties:
  - name: "scheme"
    ref: "stringMultiScheme"
  - name: "startOffsetTime"
    ref: "EarliestTime"

  - name: "forceFromStart"
    value: false

But It gives error no matter what I place in ref of startOffsetTime

Exception in thread "main" java.lang.IllegalArgumentException: Can not set long field org.apache.storm.kafka.KafkaConfig.startOffsetTime to null value

Stig Rohde Døssing Stig Rohde Døssing · Accepted Answer · 2018-05-26T16:23:47

You need to set the startOffsetTime to kafka.api.OffsetRequest.LatestTime. As you can see at https://github.com/apache/storm/tree/64af629a19a82591dbf3428f7fd6b02f39e0723f/external/storm-kafka#kafkaconfig, the default setting will go to the earliest offset available.

The exception you're hitting seems unrelated. It looks like a Curator/Zookeeper incompatibility.

Edit: I think you're hitting this issue https://issues.apache.org/jira/browse/STORM-2978. 1.2.2 should be out soon, please try upgrading once it releases.

Edit edit: If you want to work around it without upgrading, edit the pom for your topology so it includes a dependency on Zookeeper 3.4 and not 3.5.

Make apache storm topology to use latest offset from kafka

1 Answers