Traefik breaking headers and Tomcat keeping stream alive until timeout

Question

TL;DR

My Spring Boot REST endpoint returns HTTP status immediately but sometimes waits 60 seconds before returning the result. This is caused by a Feign-proxied request's headers being altered (Transfer-Encoding being replaced with Content-Length) The context and results of the investigation are detailed below.

Facts

A (Spring Boot + Feign + Tomcat) -> Traefik -> B (Rest resource, Spring Boot + Tomcat)
Traefik & B always respond immediately, A always returns the 200 status code immediately.
Based on unknown criteria, A uses a KeepAlive stream and returns only after precisely 60 seconds (max idle thread?).
B uses the Transfer-Encoding: chunked header, but Traefik replaces it with Content-Length depending on unknown criteria.
The pause comes from a KeepAliveStream not being closed. I found several workarounds/solutions, but I'd love to have an explanation also.
- Removing the Content-Length header solves the issue. Debugging sun.www.net.HttpClient confirms that having a Content-Length header triggers the use of a KeepAlive stream.
- Calling A with the Connection: close header seems to solve the issue (same reason: this prevents the use of the KeepAliveStream).
- Replacing Tomcat with Jetty in A seems to solve the issue, as it seems to rely on other HTTP libs.
- Replacing A's Feign's Apache HttpClient with OkHttp solves the problem.

Remaining questions

Why doesn't Feign/Tomcat/HttpClient close once the whole body is available (which is immediately)?
Bonus question: Why/When/Based on what does Traefik alter the headers, and what are the rules?

The mystery of the lost bytes

One of our latest tests was to use -v with curl and we saw this while A was pausing:

$ curl -i -v http://localhost:8100/a/proxied-endpoint
#...
< Content-Length: 1843
<
{ [1793 bytes data]

So, the service hangs and waits for 50 missing bytes. When interrupted though, it returns the whole response. I'm thinking of an encooding issue there, but don't understand where it could happen.

Replacing the content length with 1793 (or lower) makes the endpoint makes the endpoint return immediately.

There is a discrepancy between the way the Content-Length header is computed and the way our client calculates it upon receiving the body.

Detailed context

The situation

I'm having an issue with a Spring Boot REST controller (A) that acts as a proxy to another service (B). The endpoint basically looks like this:

@GetMapping("/{resource}")
public ResponseEntity<List<Something>> getAll(@PathVariable resource) {
    return myFeignProxy.getAll(resource);
}

There's a Traefik reverse proxy between A and B. In summary: A -> Traefik -> B.

In all cases, the Feign proxies answers in less than 100 ms and the endpoint returns the HTTP status (200) immediately. However, in most cases, the body is not returned immediately. A's Spring Boot waits for precisely 60 seconds (this is really not random).

Whether the body will be returned immediately or after 60 s seems to depend upon resource: some resources are always available immediately, the others have the wait. Once again, this does not seem random.

[Edit]: Investigation has shown that, in the cases where A pauses, Traefik replaced B's original Transfer-Encoding header with Content-Length. Based on this header, sun.net.www.HttpClient would decide to use a KeepAliveStream. The problem is that this stream then doesn't close.

Versions

Spring Boot: 2.2.6
Tomcat: 9.0.33
Feign: (determined by Spring Cloud 2.1.2)
Traefik: 2.2.11

What it's not

It is not an issue with the proxied service (B) being slow. In all cases, myFeignProxy responds in a few ms and the endpoint returns the 200 HTTP status immediately.

I've tried changing Feign client timeouts, without any impact.

I also see no correlation between the pause, the size of the body and the time of response of the feign proxy.

Resource	Size (KB)	Feign time (ms)	60s pause
1	1.87	34	yes
2	3.29	35	no
3	1.55	34	yes
4	10.05	81	yes

The problem is not related to Spring Security either, as entirely removing it (configuration and dependencies) does not change the symptoms.

Updated investigations

Technical layer causing the pause

The pause seems to come from Tomcat. Replacing the Tomcat starter with the Jetty starter (in A) eliminates the issue (all requests answer immediately). That being said, it doesn't explain the problem.

Trace log analysis

It appears that, for an endpoint where the pause appears, there is one additional line in logs during the calls. See below for examples. The parameters for the HttpURLConnection also seem to be different, though I do not understand why.

Case without pause

DEBUG [nio-8100-exec-9] s.n.www.protocol.http.HttpURLConnection  : sun.net.www.MessageHeader@784b4a945 pairs: {GET /xxx HTTP/1.1: null}{Accept: application/json}{User-Agent: Java/11.0.7}{Host: xxx}{Connection: keep-alive}
DEBUG [nio-8100-exec-9] s.n.www.protocol.http.HttpURLConnection  : sun.net.www.MessageHeader@2a3818a612 pairs: {null: HTTP/1.1 200 OK}{Cache-Control: no-cache, no-store, max-age=0, must-revalidate}{Content-Type: application/json}{Date: Tue, 20 Apr 2021 07:47:47 GMT}{Expires: 0}{Pragma: no-cache}{Strict-Transport-Security: max-age=31536000 ; includeSubDomains}{Vary: accept-encoding}{X-Content-Type-Options: nosniff}{X-Frame-Options: DENY}{X-Xss-Protection: 1; mode=block}{Transfer-Encoding: chunked}

Case with pause

DEBUG [nio-8100-exec-6] s.n.www.protocol.http.HttpURLConnection  : sun.net.www.MessageHeader@7bff99e75 pairs: {GET /xxx HTTP/1.1: null}{Accept: application/json}{User-Agent: Java/11.0.7}{Host: xxx}{Connection: keep-alive}
TRACE [nio-8100-exec-6] s.n.www.protocol.http.HttpURLConnection  : KeepAlive stream used: https://xxx/xxx
DEBUG [nio-8100-exec-6] s.n.www.protocol.http.HttpURLConnection  : sun.net.www.MessageHeader@5aed6c9312 pairs: {null: HTTP/1.1 200 OK}{Cache-Control: no-cache, no-store, max-age=0, must-revalidate}{Content-Type: application/json}{Date: Tue, 20 Apr 2021 07:57:42 GMT}{Expires: 0}{Pragma: no-cache}{Strict-Transport-Security: max-age=31536000 ; includeSubDomains}{Vary: accept-encoding}{X-Content-Type-Options: nosniff}{X-Frame-Options: DENY}{X-Xss-Protection: 1; mode=block}{Content-Length: 803}

When finally responding after the pause (not present when responding immediately)

DEBUG [nio-8100-exec-7] o.apache.tomcat.util.threads.LimitLatch  : Counting down[http-nio-8100-exec-7] latch=1
DEBUG [nio-8100-exec-7] org.apache.tomcat.util.net.NioEndpoint   : Calling [org.apache.tomcat.util.net.NioEndpoint@63668501].closeSocket([org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@cfdc708:org.apache.tomcat.util.net.NioChannel@6e7c15b6:java.nio.channels.SocketChannel[connected local=/0:0:0:0:0:0:0:1:8100 remote=/0:0:0:0:0:0:0:1:52501]])

The additional log ("KeepAlive stream used") occurs in sun.net.www.http.HttpClient. I seem to understand the decision to use this KeepAlive depends on the proxied response's headers.

Headers analysis and Traefik meddling

Traefik changes the headers between A and B.

B always returns its response with Transfer-Encoding: chunked. Traefik sometimes replaces it with Content-Length and the correct size for the payload.

Nothing is configured in our Traefik instance concerning headers. The rules used to decide between Transfer-Encoding and Content-Length seem hard to grasp:

It seems to depend on the endpoint being called or its payload.
It also seems to depend on something from the caller, as I don't always get the same header depending whether I'm calling from A or from curl.

This explains why the problem is not reproducible when both applications are on the local machine, since there is the no Traefik between them.

About the rules that Traefik applies, it appears the HTTP version plays a role.

$ curl -s -o /dev/null -D - --http1.1 https://traefik/b/endpoint
HTTP/1.1 200 OK
# ...
Transfer-Encoding: chunked

$ curl -s -o /dev/null -D - --http2 https://traefik/b/endpoint
HTTP/2 200
# ...
content-length: 2875

Traefik always returns the same headers for a given endpoint, so we can think that the headers also depend on the address or, more presumably, on the payload (a given endpoint always returns the same payload for this service).

First version of B that doesn't work

Performing a git bisect, I discovered the 60-second pause problem appeared when the proxied service (B) started using ZonedDateTime instead of LocalDateTime in its DTO. The only change is that date time fields now have an offset in the response body, there is no impact on the headers. Yet, the Feign client works fine for LocalDateTimes and pauses for ZonedDateTimes.

Forcing the connection to close

Passing the Connection: close header makes the pause disappear in A. The response body is returned immediately.

HttpClient does not use the KeepAliveStream in this case.

Trying to reproduce with a mock B

I wrote a quick mock service B. It returns the Content-Type header and the content. What's interesting is that:

If mock-B returns the Content-Length header, then A has a 60-second pause.
If mock-B does not return the Content-Length header, then A returns immediately.

This is consistent with previous tests indicating that the Content-Length header plays a role, ~~but I am still unable to understand which, since it is present in some Traefik occurrences that A still returns immediately.~~

mock-B

const port = 8080;

const http = require('http');
const path = require('path');
const fs = require('fs');

const countBytesInString = str => encodeURI(str).split(/%..|./).length - 1

const requestListener = (req, res) => {
    console.log(`\n>> Mock call to endpoint ${req.url}`);

    fs.readFile(path.join(__dirname, `endpoints${req.url}.json`), 'utf8' , (err, data) => {
        if (err) {
            console.error(err)
            return;
        }

        const contentLength = countBytesInString(data);
        console.log(`Content-Length: ${contentLength}`);

        res.setHeader('Content-Type', 'application/json');
        res.setHeader('content-length', contentLength);
        res.writeHead(200);
        res.end(data);
    })
};

const server = http.createServer(requestListener);
server.listen(port);
console.log(`Mock server listening on port ${port}`);

Since you are not encountering this issue when both are running on your local machine, maybe have a closer look into the underlying feignClient being used? Maybe enable some more logging for the actual client? (Apache HTTP client, OkHttp, .... ). May be network related... — wjans
Do you think it can be network-related, @wjans? That's a natural assumption, but I can't explain how. I tried adding a log between the call to Feign and the return. The log is always called immediately with all requested data present, and the body is sent one minute after the message was logged. I'd think that eliminates Feign since it's already done answering, shouldn't it? Right now I'm studying a Thread Dump taken during the pause. I'll look further into your suggestion if it doesn't yield any result. Thanks! — Chop
Ow, so you mean the call myFeignProxy.getAll(resource) returns immediately? Then I would indeed try to take a thread dump, or debugging maybe? I was thinking about network since that's the only part you eliminate with and without the issue. Or does something get retrieved via the network while serializing the response? — wjans
Yes, myFeignProxy.getAll() returns immediately and all data is there, deserialized (tested through debugging and through logging). I didn't see anything blocking in the Thread Dump. I went on with the debugging, going through several filters in the chain before abandonning, and the 60-second pause was still added later in the process. I'm trying to remove Spring Security entirely to see if it changes anything. — Chop
Completely unsure, but just an idea: what happens when the content length from the header is more than the actual content sent? I would assume, Traefik would wait for more data in such a case. — Arne Burmeister

Chop Chop · Accepted Answer · 2021-04-27T10:49:17

Explaining the causes

We've finally understood the mechanism that leads to the issue.

A -> Traefik -> B

B returns a list of objects with a ZonedDateTime field ("validFrom":"2021-12-24 23:59:57+01:00") and the header Transfer-Encoding: chunked.
Traefik replaces the Transfer-Encoding: chunked with a Content-Length, computed from the body of the request.
A receives the response, deserializes the objects, then reserializes them but in the UTC timezone ("validFrom":"2021-12-24 22:59:57Z"), but it reuses the Content-Length from Traefik without recalculating it.

As a consequence, the body from is shorter than the announced Content-Length (each ZonedDateTime takes five bytes less when A sends it than when Traefik computes the content length).

The client however has been announced a Content-Length and is waiting for the missing bytes.

Possible solution

The solution we have in mind right now is to tell Feign and its calling controller that it returns a ResponseEntity<String> instead of a ResponseEntity<List<MyObject>>.

Pros:

B's response is returned as-is, so no more problem due to a varying content length.
A does not spend CPU-time deserializing then immediately reserializing the response.

Cons:

The OpenApi doc of A won't show the type of return (unless the Open API annotation allow to specify the return model). That's what I'll test later today.