I'm doing performance characterization of my organization's ActiveMQ 5.8.0 broker configuration, focusing on the impact of WAN latency between brokers in a simple network-of-brokers configuration that contains only two brokers. I have access to a WAN emulator device that lets me change the simulated WAN latency independently in each direction across the network link between the two brokers, while leaving all other network connections (particularly those between clients and their local broker) with 0ms nominal latency.
When WAN latencies are ~150ms or lower, I'm seeing end-to-end latencies that are exactly what we would expect to see based on the WAN latency, but for higher latencies, end-to-end latency is triple what we would expect. I've scoured Google, SO, and the AMQ mailing list archives but haven't found anyone who's talked about this behavior nor anything that could explain what's going on.
ActiveMQ Setup:
- Non-persistent messages. No transactions, no delay from persistence technologies, etc.
- Two brokers, with the WAN emulator between them. This WAN emulator does not affect traffic between clients and brokers, only between the two brokers. It is creating simulated latency, but no other network effects (packet loss, out-of-order packets, etc.).
- One queue for messages from producer to consumer, and one exclusive reply queue for responses from consumer to producer.
- One producer, connected to Broker 1, using Camel in an InOut pattern, producing 20KB messages as quickly as it can and then consuming the response from the exclusive reply queue before sending the next message.
- One consumer, connected to Broker 2, also using Camel to consume the message from the producer and reply to the exclusive reply queue.
- Messages from the producer contain a header value on which the consumer has a selector. All messages produced by the producer are consumed by the consumer.
- Because the prducer uses the (synchronous) Camel InOut pattern, the next message is not produced until the previous message's response has been consumed. This means that pre-fetch buffers and producer flow control are not a factor here.
Because our messages are non-persistent, I would expect the one-way end-to-end latency to be the WAN latency in that direction plus a small (2-5ms) constant for processing, Camel's overhead, etc., and the round-trip end-to-end latency to be the sum of the two one-way WAN latencies plus about double that small constant (about 5-10ms).
Relevant Configuration Snippets:
ActiveMQ config file, irrelevant boilerplate content removed:
<beans boilerplateSchemaStuff>
<bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">boilerPlateStuff</bean>
<broker xmlns="http://activemq.apache.org/schema/core"
brokerName="${broker.name}"
useJmx="true"
persistent="false"
schedulePeriodForDestinationPurge="60000"
networkConnectorStartAsync="true"
cacheTempDestinations="true"
timeBeforePurgeTempDestinations="300000"
allowTempAutoCreationOnSend="true">
<destinationPolicy>
<policyMap>
<policyEntries>
<policyEntry queue=">" producerFlowControl="true" memoryLimit="10mb">
<pendingQueuePolicy>
<vmQueueCursor/>
</pendingQueuePolicy>
<slowConsumerStrategy>
<abortSlowConsumerStrategy abortConnection = "true"/>
</slowConsumerStrategy>
</policyEntry>
</policyEntries>
</policyMap>
</destinationPolicy>
<managementContext>
<managementContext createConnector="false" />
</managementContext>
<networkConnectors>
<networkConnector name="REMOTE_QUEUES" uri="${broker.remote.uri}"
networkTTL="3" decreaseNetworkConsumerPriority="true"
conduitSubscriptions="false" dynamicOnly="true" />
</networkConnectors>
<plugins>
<statisticsBrokerPlugin/>
<discardingDLQBrokerPlugin dropAll = "true" dropTemporaryTopics = "true"
dropTemporaryQueues = "true"/>
</plugins>
<systemUsage>reasonable limits, not believed to cause this issue since producer flow control isn't occurring</systemUsage>
<transportConnectors>
<transportConnector name="openwire" uri="tcp://0.0.0.0:${tcp.port}"/>
<transportConnector name="stomp" uri="stomp://0.0.0.0:${stomp.port}"/>
</transportConnectors>
</broker>
<import resource="file:${ACTIVEMQ_CONFIG_SC}/conf/jetty.xml"/>
</beans>
Producer Spring config:
<beans boilerplateSchemaStuff>
<bean id="messageGen" class="teststubs.MessageGenerator">
<property name="name" value="MessageGenerator"/>
</bean>
<bean id="amqMessageThroughputProcessor" class="teststubs.AmqMessageThroughputProcessor">
<property name="name" value="QueueResponseConsumer${token}"/>
</bean>
<bean id="token" class="java.lang.String">
<constructor-arg value="${token}" />
</bean>
<camel:camelContext allowUseOriginalMessage="false">
<camel:jmxAgent id="agent" disabled="true" />
<camel:endpoint id="sendDestination"
uri="jms:${destination.type}:PerfTest#{T(org.apache.commons.lang3.text.WordUtils).capitalize('${destination.type}')}?requestTimeout=10000&replyTo=PerfTestResponseQueue${token}&replyToType=Exclusive&concurrentConsumers=1&mapJmsMessage=false"
xmlns="http://camel.apache.org/schema/spring"/>
<camel:route>
<camel:from uri="timer://sendTask?period=1&fixedRate=true&repeatCount=1" />
<camel:loop copy="true">
<camel:constant>10000</camel:constant>
<camel:bean ref="messageGen" method="create20KbTimestampedMessage"/>
<camel:setHeader headerName="Token">
<camel:simple>ref:token</camel:simple>
</camel:setHeader>
<camel:to ref="sendDestination" pattern="InOut" />
<camel:process ref="amqMessageThroughputProcessor" />
</camel:loop>
</camel:route>
</camel:camelContext>
</beans>
Consumer Spring config:
<beans boilerplateSchemaStuff>
<bean id="amqMessageThroughputProcessor" class="teststubs.AmqMessageThroughputProcessor">
<property name="name" value="QueueConsumer"/>
</bean>
<camel:camelContext>
<camel:jmxAgent id="agent" disabled="true" />
<camel:endpoint id="receiveDestination"
uri="jms:${destination.type}:PerfTest#{T(org.apache.commons.lang3.text.WordUtils).capitalize('${destination.type}')}?concurrentConsumers=1&mapJmsMessage=false&selector=Token='${token}'"
xmlns="http://camel.apache.org/schema/spring"/>
<camel:route streamCache="true">
<camel:from ref="receiveDestination" />
<camel:process ref="amqMessageThroughputProcessor" />
</camel:route>
</camel:camelContext>
</beans>
Observed Results:
With low latencies (<= 100ms) in both directions, this is exactly what I see; for 100ms in each direction, one-way end-to-end latency is about 105ms, and round-trip latency is about 210ms. When I increase the latency to 150ms in each direction, it starts out as expected (155ms one-way, 310ms round-trip) but after some amount of time (a few seconds to a few minutes, from what I've seen) the latency triples to about 455ms one-way and 910ms round-trip. This is a sudden jump in latency; one message reports 155ms one-way and 310ms round-trip, and the next message and all subsequent messages report 455ms and 910ms, rather than a slow build to those higher latencies. There are no errors in the logs of either broker nor the two clients at the time the latency jumps (nor at any other time).
Varying the WAN latencies further (staying above 150ms one-way latency) results in immediate changes in latency to maintain the one-way end-to-end latency of (3 x one-way WAN latency) + 5ms
and the round-trip end-to-end latency of (3 x (forward one-way WAN latency + return one-way WAN latency) + 10ms
. Since the increase in latency appears to be directly proportional to the change in WAN latency, it seems safe to say that it's being caused by something that is crossing the network rather than something introducing a (fixed) delay on the producer, consumer, or either broker.
Because I can vary the simulated latency in the two directions independently, I've been able to determine that the 3x one-way latency I saw when both directions had the same latency was actually (2 x forward one-way WAN latency) + (1 x return one-way WAN latency)
for the forward-direction one-way end-to-end latency, as compared to the expected value of 1 x forward one-way WAN latency
. Subtracting the two shows that the unexpected additional latency is (1 x forward one-way WAN latency) + (1 x return one-way WAN latency)
, indicating that there's one unexpected round-trip per message happening between the brokers (maybe also between each brokers and its client?). The same result is found for the response message as well.
Possible Causese:
The round-trip time between the two broker hosts is reported by ping as 300ms (as expected), so I don't have any reason to believe that this is being caused by the WAN emulator introducing the wrong simulated latency, though if there's some reason why ping's reported RTT can't be trusted, I'm open to trying to explore this further. But in the absence of a concrete reason not to trust the widely-used ping utility, that doesn't seem like the issue. Update: I've used a previously-written-and-tested network traffic tool to confirm that TCP packets are delivered in one direction in 1 x forward one-way WAN latency
even though ActiveMQ TCP packets are taking an additional round-trip. I believe this further indicates that the WAN emulator is not introducing this behavior.
I also don't have any reason to believe that any meaningful part of the latency stems from Camel, because with the WAN emulator set to 0ms latency in both directions, I can push ~200 messages per second. I believe that Camel also doesn't do anything across the network interface between the brokers (Camel would only access that network link by sending ActiveMQ messages, and the ActiveMQ web console shows no evidence of additional messages being sent), but here too, if someone knows something concrete to indicate that Camel could be causing this behavior, I'm open to investigating that.
My initial guess was that messages were being retransmitted, but I can't see any evidence of that; when latency triples, the throughput across the WAN emulator drops to just over 1/3 of its original value, whereas if messages were being retransmitted between the brokers I'd expect it to be at least 2/3 of the original value (1/3 for the first transmission and 1/3 for the second transmission, plus a small amount for whatever the size of the retransmission request message). I'd also expect the Dequeue Count on the web console to be greater than the Enqueue Count, but that's not happening either. So I don't believe that's what's going on either (and even if it was, it wouldn't explain why it happened for certain latencies but not others, nor why for a given latency it would start out not happening and then suddenly start happening).
Does anyone have any ideas of what could be causing this behavior, or even what ActiveMQ code would be my best starting point if I have to go reading the source looking for clues?