Axon Event Processing Timeout

Question

I am using an Axon Event Tracking processor. Sometimes events take longer that 10 seconds to process.

This seems to cause the message to be processed again and this appears in the log "Releasing claim of token X/0 failed. It was owned by another node."

~~If I up the number of segments it does not log this BUT the event is still processed twice so I think this might be misleading.~~ (I think I was mistaken about this)

I have tried adjusting the fetchDelay, cleanupDelay and tokenClaimInterval. None of which has fixed this. Is there a property or something that I am missing?

Edit

The scenario taking longer than 10 seconds is making a HTTP request to an external service.

I'm using axon 4.1.2 with all default configuration when using with Spring auto configuration. I cannot see the Releasing claim on token and preparing for retry in [timeout]s log.

I was having this issue with a single segment and 2 instances of the application. I realised I hadn't increased the number of segments like I thought I had.

After further investigation I have discovered that adding an additional segment seems to have stopped this. Even if I have for example 2 segments and 6 applications it still doesn't reappear, however I'm not sure how this is different to my original scenario of 1 segment and 2 application?

I didn't realise it would be possible for multiple threads to grab the same tracking token and process the same event. It sounds like the best action would be to put an idem-potency check before the HTTP call?

I'd like to help you Dan, but I require a bit more info here. The following would be helpful to know: What version of Axon are you running? How much instances of this application are you running concurrently? How many threads have been configured for this TrackingEventProcessor (TEP)? How many segments have been configured for this TEP? What kind of operation takes 10 seconds or longer when it comes to event handling? — Steven
Any other specifics around the configuration of this TEP would also be helpful to know, so please sure those as well. Updating your original question with this info would be most clear I think. — Steven

Steven Steven · Accepted Answer · 2020-08-17T13:17:48

The Releasing claim of token [event-processor-name]/[segment-id] failed. It was owned by another node. message can only occur in three scenarios:

You are performing a merge operation of two segments which fails because the given thread doesn't own both segments.
The main event processing loop of the TrackingEventProcessor is stopped, but releasing the token claim fails because the token is already claimed by another thread.
The main event processing loop has caught an Exception, making it retry with a exponential back-off, and it tries to release the claim (which might fail with the given message).

I am guessing it's not options 1 and 2, so that would leave us with option 3. This should also mean you are seeing other WARN level messages, like:

Releasing claim on token and preparing for retry in [timeout]s

Would you be able to share whether that's the case? That way we can pinpoint a little better what the exact problem is you are encountering.

By the way, very likely you have several processes (event handling threads of the TrackingEventProcessor) stealing the TrackingToken from one another. As they're stealing an un-updated token, both (or more) will handled the same event. Hence why you see the event handler being invoked twice.

Obviously undesirable behavior and something we should resolve for you. I would like to ask you to provide answers to my comments under the question, as right now I have to little to go on. Let us figure this out @Dan!

Update

Thanks for updating your question @dan, that's very helpful. From what you've shared, I am fairly confident that both instances are stealing the token from one another. This does depend though on whether both are using the same database for the token_entry table (although I am assuming they are).

If they are using the same table, then they should "nicely" share their work, unless one of them takes to long. If it takes to long, the token will be claimed by another process. This other process in this case is the thread of the TEP of your other application instance. The "claim timeout" is defaulted to 10 seconds, which also corresponds with the long running event handling process.

This claimTimeout is adjustable though, by invoking the Builder of the JpaTokenStore/JdbcTokenStore (depending on which you are using / auto wiring) and calling the JpaTokenStore.Builder#claimTimeout(TemporalAmount) method. And, I think this would be required on your end, giving the fact you have a long running operation.

There are of course different ways of tackling this. Like, making sure the TEP is only ran on a single instance (not really fault tolerant though), or offloading this long running operation to a schedule task which is triggered by the event.

But, I think we've found the issue at least, so I'd suggest to tweak the claimTimeout and see if the problem persists. Let us know if this resolves the problem on your end @dan!

Axon Event Processing Timeout

1 Answers