TL;DR
My Spring Boot REST endpoint returns HTTP status immediately but sometimes waits 60 seconds before returning the result.
This is caused by a Feign-proxied request's headers being altered (Transfer-Encoding
being replaced with Content-Length
)
The context and results of the investigation are detailed below.
Facts
- A (Spring Boot + Feign + Tomcat) -> Traefik -> B (Rest resource, Spring Boot + Tomcat)
- Traefik & B always respond immediately, A always returns the 200 status code immediately.
- Based on unknown criteria, A uses a KeepAlive stream and returns only after precisely 60 seconds (max idle thread?).
- B uses the
Transfer-Encoding: chunked
header, but Traefik replaces it withContent-Length
depending on unknown criteria. - The pause comes from a KeepAliveStream not being closed. I found several workarounds/solutions, but I'd love to have an explanation also.
- Removing the
Content-Length
header solves the issue. Debugging sun.www.net.HttpClient confirms that having aContent-Length
header triggers the use of a KeepAlive stream. - Calling A with the
Connection: close
header seems to solve the issue (same reason: this prevents the use of the KeepAliveStream). - Replacing Tomcat with Jetty in A seems to solve the issue, as it seems to rely on other HTTP libs.
- Replacing A's Feign's Apache HttpClient with OkHttp solves the problem.
- Removing the
Remaining questions
- Why doesn't Feign/Tomcat/HttpClient close once the whole body is available (which is immediately)?
- Bonus question: Why/When/Based on what does Traefik alter the headers, and what are the rules?
The mystery of the lost bytes
One of our latest tests was to use -v
with curl and we saw this while A was pausing:
$ curl -i -v http://localhost:8100/a/proxied-endpoint
#...
< Content-Length: 1843
<
{ [1793 bytes data]
So, the service hangs and waits for 50 missing bytes. When interrupted though, it returns the whole response. I'm thinking of an encooding issue there, but don't understand where it could happen.
Replacing the content length with 1793
(or lower) makes the endpoint makes the endpoint return immediately.
There is a discrepancy between the way the Content-Length
header is computed and the way our client calculates it upon receiving the body.
Detailed context
The situation
I'm having an issue with a Spring Boot REST controller (A) that acts as a proxy to another service (B). The endpoint basically looks like this:
@GetMapping("/{resource}")
public ResponseEntity<List<Something>> getAll(@PathVariable resource) {
return myFeignProxy.getAll(resource);
}
There's a Traefik reverse proxy between A and B. In summary: A -> Traefik -> B.
In all cases, the Feign proxies answers in less than 100 ms and the endpoint returns the HTTP status (200) immediately. However, in most cases, the body is not returned immediately. A's Spring Boot waits for precisely 60 seconds (this is really not random).
Whether the body will be returned immediately or after 60 s seems to depend upon resource
: some resources are always available immediately, the others have the wait. Once again, this does not seem random.
[Edit]: Investigation has shown that, in the cases where A pauses, Traefik replaced B's original Transfer-Encoding
header with Content-Length
.
Based on this header, sun.net.www.HttpClient
would decide to use a KeepAliveStream.
The problem is that this stream then doesn't close.
Versions
Spring Boot: 2.2.6
Tomcat: 9.0.33
Feign: (determined by Spring Cloud 2.1.2)
Traefik: 2.2.11
What it's not
It is not an issue with the proxied service (B) being slow. In all cases, myFeignProxy
responds in a few ms and the endpoint returns the 200 HTTP status immediately.
I've tried changing Feign client timeouts, without any impact.
I also see no correlation between the pause, the size of the body and the time of response of the feign proxy.
Resource | Size (KB) | Feign time (ms) | 60s pause |
---|---|---|---|
1 | 1.87 | 34 | yes |
2 | 3.29 | 35 | no |
3 | 1.55 | 34 | yes |
4 | 10.05 | 81 | yes |
The problem is not related to Spring Security either, as entirely removing it (configuration and dependencies) does not change the symptoms.
Updated investigations
Technical layer causing the pause
The pause seems to come from Tomcat. Replacing the Tomcat starter with the Jetty starter (in A) eliminates the issue (all requests answer immediately). That being said, it doesn't explain the problem.
Trace log analysis
It appears that, for an endpoint where the pause appears, there is one additional line in logs during the calls. See below for examples. The parameters for the HttpURLConnection also seem to be different, though I do not understand why.
Case without pause
DEBUG [nio-8100-exec-9] s.n.www.protocol.http.HttpURLConnection : sun.net.www.MessageHeader@784b4a945 pairs: {GET /xxx HTTP/1.1: null}{Accept: application/json}{User-Agent: Java/11.0.7}{Host: xxx}{Connection: keep-alive}
DEBUG [nio-8100-exec-9] s.n.www.protocol.http.HttpURLConnection : sun.net.www.MessageHeader@2a3818a612 pairs: {null: HTTP/1.1 200 OK}{Cache-Control: no-cache, no-store, max-age=0, must-revalidate}{Content-Type: application/json}{Date: Tue, 20 Apr 2021 07:47:47 GMT}{Expires: 0}{Pragma: no-cache}{Strict-Transport-Security: max-age=31536000 ; includeSubDomains}{Vary: accept-encoding}{X-Content-Type-Options: nosniff}{X-Frame-Options: DENY}{X-Xss-Protection: 1; mode=block}{Transfer-Encoding: chunked}
Case with pause
DEBUG [nio-8100-exec-6] s.n.www.protocol.http.HttpURLConnection : sun.net.www.MessageHeader@7bff99e75 pairs: {GET /xxx HTTP/1.1: null}{Accept: application/json}{User-Agent: Java/11.0.7}{Host: xxx}{Connection: keep-alive}
TRACE [nio-8100-exec-6] s.n.www.protocol.http.HttpURLConnection : KeepAlive stream used: https://xxx/xxx
DEBUG [nio-8100-exec-6] s.n.www.protocol.http.HttpURLConnection : sun.net.www.MessageHeader@5aed6c9312 pairs: {null: HTTP/1.1 200 OK}{Cache-Control: no-cache, no-store, max-age=0, must-revalidate}{Content-Type: application/json}{Date: Tue, 20 Apr 2021 07:57:42 GMT}{Expires: 0}{Pragma: no-cache}{Strict-Transport-Security: max-age=31536000 ; includeSubDomains}{Vary: accept-encoding}{X-Content-Type-Options: nosniff}{X-Frame-Options: DENY}{X-Xss-Protection: 1; mode=block}{Content-Length: 803}
When finally responding after the pause (not present when responding immediately)
DEBUG [nio-8100-exec-7] o.apache.tomcat.util.threads.LimitLatch : Counting down[http-nio-8100-exec-7] latch=1
DEBUG [nio-8100-exec-7] org.apache.tomcat.util.net.NioEndpoint : Calling [org.apache.tomcat.util.net.NioEndpoint@63668501].closeSocket([org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@cfdc708:org.apache.tomcat.util.net.NioChannel@6e7c15b6:java.nio.channels.SocketChannel[connected local=/0:0:0:0:0:0:0:1:8100 remote=/0:0:0:0:0:0:0:1:52501]])
The additional log ("KeepAlive stream used") occurs in sun.net.www.http.HttpClient
. I seem to understand the decision to use this KeepAlive depends on the proxied response's headers.
Headers analysis and Traefik meddling
Traefik changes the headers between A and B.
B always returns its response with Transfer-Encoding: chunked
.
Traefik sometimes replaces it with Content-Length
and the correct size for the payload.
Nothing is configured in our Traefik instance concerning headers.
The rules used to decide between Transfer-Encoding
and Content-Length
seem hard to grasp:
- It seems to depend on the endpoint being called or its payload.
- It also seems to depend on something from the caller, as I don't always get the same header depending whether I'm calling from A or from curl.
This explains why the problem is not reproducible when both applications are on the local machine, since there is the no Traefik between them.
About the rules that Traefik applies, it appears the HTTP version plays a role.
$ curl -s -o /dev/null -D - --http1.1 https://traefik/b/endpoint
HTTP/1.1 200 OK
# ...
Transfer-Encoding: chunked
$ curl -s -o /dev/null -D - --http2 https://traefik/b/endpoint
HTTP/2 200
# ...
content-length: 2875
Traefik always returns the same headers for a given endpoint, so we can think that the headers also depend on the address or, more presumably, on the payload (a given endpoint always returns the same payload for this service).
First version of B that doesn't work
Performing a git bisect, I discovered the 60-second pause problem appeared when the proxied service (B) started using ZonedDateTime instead of LocalDateTime in its DTO. The only change is that date time fields now have an offset in the response body, there is no impact on the headers. Yet, the Feign client works fine for LocalDateTimes and pauses for ZonedDateTimes.
Forcing the connection to close
Passing the Connection: close
header makes the pause disappear in A.
The response body is returned immediately.
HttpClient does not use the KeepAliveStream in this case.
Trying to reproduce with a mock B
I wrote a quick mock service B.
It returns the Content-Type
header and the content.
What's interesting is that:
- If mock-B returns the
Content-Length
header, then A has a 60-second pause. - If mock-B does not return the
Content-Length
header, then A returns immediately.
This is consistent with previous tests indicating that the Content-Length
header plays a role, but I am still unable to understand which, since it is present in some Traefik occurrences that A still returns immediately.
mock-B
const port = 8080;
const http = require('http');
const path = require('path');
const fs = require('fs');
const countBytesInString = str => encodeURI(str).split(/%..|./).length - 1
const requestListener = (req, res) => {
console.log(`\n>> Mock call to endpoint ${req.url}`);
fs.readFile(path.join(__dirname, `endpoints${req.url}.json`), 'utf8' , (err, data) => {
if (err) {
console.error(err)
return;
}
const contentLength = countBytesInString(data);
console.log(`Content-Length: ${contentLength}`);
res.setHeader('Content-Type', 'application/json');
res.setHeader('content-length', contentLength);
res.writeHead(200);
res.end(data);
})
};
const server = http.createServer(requestListener);
server.listen(port);
console.log(`Mock server listening on port ${port}`);
feignClient
being used? Maybe enable some more logging for the actual client? (Apache HTTP client, OkHttp, .... ). May be network related... – wjansmyFeignProxy.getAll(resource)
returns immediately? Then I would indeed try to take a thread dump, or debugging maybe? I was thinking about network since that's the only part you eliminate with and without the issue. Or does something get retrieved via the network while serializing the response? – wjansmyFeignProxy.getAll()
returns immediately and all data is there, deserialized (tested through debugging and through logging). I didn't see anything blocking in the Thread Dump. I went on with the debugging, going through several filters in the chain before abandonning, and the 60-second pause was still added later in the process. I'm trying to remove Spring Security entirely to see if it changes anything. – Chop