0
votes

Hoping someone can point me in the right direction. We have inherited a vcl for a shopware site.

The basic setup is:

nginx ssl termination -> varnish -> varnish rr to 4 apache backends

The setup works fine, but when we put it under any form of load (around 250 concurrent), nginx starts reporting that it can't connect to varnish:

2021/07/23 12:08:15 [error] 25348#25348: *7223 upstream timed out (110: Connection timed out) while reading upstream, client: xx, server: _, request: "GET / HTTP/1.1", upstream: "http://xx/", host: "xx"

And varnish reports it has broken pipes:

*   << Request  >> 5996617   
-   Begin          req 5996616 rxreq
-   Timestamp      Start: 1627038436.200604 0.000000 0.000000
-   Timestamp      Req: 1627038436.200604 0.000000 0.000000
-   ReqStart       xx 42403
-   ReqMethod      GET
-   ReqURL         /
-   ReqProtocol    HTTP/1.1
-   ReqHeader      X-Real-IP: xx
-   ReqHeader      X-Forwarded-For: xx
-   ReqHeader      X-Forwarded-Proto: https
-   ReqHeader      X-Forwarded-Port: 443
-   ReqHeader      Host: xx
-   ReqHeader      Connection: close
-   ReqHeader      Accept: */*
-   ReqHeader      Accept-Language: en
-   ReqHeader      User-Agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; .NET CLR 1.1.4322; .NET CLR 1.0.3705; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
-   ReqHeader      Accept-Encoding: gzip, deflate
-   ReqHeader      Dotcom-ProxyIDTag: 13
-   ReqUnset       X-Forwarded-For: xx
-   ReqHeader      X-Forwarded-For: xx, xx
-   VCL_call       RECV
-   ReqURL         /
-   ReqURL         /
-   ReqURL         /
-   ReqURL         /
-   ReqURL         /
-   ReqURL         /
-   ReqURL         /
-   ReqURL         /
-   ReqURL         /
-   ReqURL         /
-   ReqURL         /
-   ReqURL         /
-   ReqURL         /
-   ReqHeader      Surrogate-Capability: shopware=ESI/1.0
-   ReqUnset       X-Forwarded-Proto: https
-   ReqHeader      X-Forwarded-Proto: https
-   ReqHeader      X-UA-Device: desktop
-   ReqHeader      X-UA-Vendor: generic
-   ReqHeader      X-UA-Forced: no
-   ReqUnset       Accept-Encoding: gzip, deflate
-   ReqHeader      Accept-Encoding: gzip
-   VCL_return     hash
-   VCL_call       HASH
-   VCL_return     lookup
-   Hit            4653975
-   VCL_call       HIT
-   VCL_acl        NO_MATCH editors
-   VCL_return     deliver
-   RespProtocol   HTTP/1.1
-   RespStatus     200
-   RespReason     OK
-   RespHeader     Date: Fri, 23 Jul 2021 10:54:07 GMT
-   RespHeader     Server: Apache
-   RespHeader     Expires: Thu, 19 Nov 1981 08:52:00 GMT
-   RespHeader     Pragma: no-cache
-   RespHeader     Cache-Control: max-age=14400, public, s-maxage=14400
-   RespHeader     X-Shopware-Allow-Nocache: admin-1, price-1
-   RespHeader     X-Shopware-Cache-Id: ;c3;
-   RespHeader     Set-Cookie: session=; path=/
-   RespHeader     Set-Cookie: lastAffiliate=na; path=/
-   RespHeader     Server-Id: web3
-   RespHeader     Content-Type: text/html; charset=UTF-8
-   RespHeader     X-Content-Length: 17190
-   RespHeader     Content-Encoding: gzip
-   RespHeader     Vary: Accept-Encoding
-   RespHeader     X-Varnish: 5996617 4653975
-   RespHeader     Age: 788
-   RespHeader     Via: 1.1 varnish-v4
-   VCL_call       DELIVER
-   VCL_acl        NO_MATCH editors
-   RespUnset      X-Shopware-Allow-Nocache: admin-1, price-1
-   RespUnset      X-Shopware-Cache-Id: ;c3;
-   RespUnset      Cache-Control: max-age=14400, public, s-maxage=14400
-   RespHeader     Cache-Control: max-age=0, private
-   RespUnset      Set-Cookie: session=; path=/
-   RespUnset      Set-Cookie: lastAffiliate=na; path=/
-   RespHeader     X-Cache: HIT
-   RespHeader     X-Cache-Hits: 3611
-   VCL_return     deliver
-   Timestamp      Process: 1627038436.200763 0.000159 0.000159
-   RespHeader     Accept-Ranges: bytes
-   RespHeader     Transfer-Encoding: chunked
-   Debug          "RES_MODE 18"
-   RespHeader     Connection: close
-   Link           req 5996618 esi
-   Debug          "Write error, retval = -1, len = 14, errno = Broken pipe"
-   Timestamp      Resp: 1627038513.810060 77.609456 77.609297
-   ReqAcct        522 0 522 456 5368 5824
-   End            

Varnish stat shows:

Uptime mgt:     1+04:38:26                                                               Hitrate n:       10       21       21
Uptime child:   1+04:38:26                                                                  avg(n):   0.4843   0.4762   0.4762

    NAME                                        CURRENT        CHANGE       AVERAGE        AVG_10       AVG_100      AVG_1000
MAIN.uptime                                  1+04:38:26
MAIN.sess_conn                                    10107          5.99           .            4.97          4.89          4.89
MAIN.client_req_400                                   9          0.00           .            0.00          0.00          0.00
MAIN.client_req                                   10815          5.99           .            4.97          4.89          4.89
MAIN.cache_hit                                    14952          8.99           .            7.83          7.69          7.69
MAIN.cache_miss                                   16483          8.99           .            7.83          7.69          7.69
MAIN.backend_conn                                   890          0.00           .            0.18          0.15          0.15
MAIN.backend_reuse                                24251          7.99           .           10.20         10.14         10.14
MAIN.backend_recycle                              25012          9.99           .           10.55         10.38         10.38
MAIN.fetch_length                                 24022          9.99           .           10.55         10.38         10.38
MAIN.fetch_chunked                                  755          0.00           .            0.00          0.00          0.00
MAIN.fetch_eof                                        2          0.00           .            0.00          0.00          0.00
MAIN.fetch_none                                      33          0.00           .            0.00          0.00          0.00
MAIN.fetch_304                                      319          0.00           .            0.00          0.00          0.00
MAIN.fetch_failed                                    13          0.00           .            0.00          0.00          0.00
MAIN.pools                                            2          0.00           .            2.00          2.00          2.00
MAIN.threads                                        600          0.00           .          600.00        600.00        600.00
MAIN.threads_created                                738          0.00           .            0.00          0.00          0.00
MAIN.threads_destroyed                              138          0.00           .            0.00          0.00          0.00
MAIN.busy_sleep                                35286304      62859.00        342.00      34852.46      32989.42      32989.42
MAIN.busy_wakeup                                     51          0.00           .            0.00          0.00          0.00
MAIN.sess_queued                                    138          0.00           .            0.00          0.00          0.00
MAIN.n_object                                       344         -2.00           .          344.38        344.33        344.33
MAIN.n_objectcore                                   598          0.00           .          600.20        601.14        601.14
MAIN.n_objecthead                                   865          0.00           .          865.00        865.00        865.00
vvv MAIN.uptime                                                                                                  INFO  1-25/96

Not all requests fail, prob about 70%.

The backends are hardly being hit, and the varnish error happens even when I remove nginx and just have the site on http with varnish accepting the http connections.

Running 4.1.11 on Centos7 (have also tried v6, and on Centos8 and Ubuntu 20 just incase). Open files increased, no resources seem to be getting maxed.

Any pointer appreciated. I feel like the busysleep is where I should be looking, but cant see any reason for it to keep growing.

Thanks


Looks like the hit is generating an esi subrequest:

-   Timestamp      Start: 1627115035.167388 0.000000 0.000000
-   BereqMethod    GET
-   BereqURL       /?action=shopMenu&controller=index&module=widgets&sCategory=3
-   BereqProtocol  HTTP/1.1
-   BereqHeader    X-Real-IP: 
-   BereqHeader    X-Forwarded-Port: 443
-   BereqHeader    Accept: */*
-   BereqHeader    Accept-Language: en
-   BereqHeader    User-Agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; .NET CLR 1.1.4322; .NET CLR 1$
-   BereqHeader    Dotcom-ProxyIDTag: 5
-   BereqHeader    Host: 
-   BereqHeader    X-Forwarded-For: 
-   BereqHeader    Surrogate-Capability: shopware=ESI/1.0
-   BereqHeader    X-Forwarded-Proto: https
-   BereqHeader    X-UA-Device: desktop
-   BereqHeader    X-UA-Vendor: generic
-   BereqHeader    X-UA-Forced: no
-   BereqHeader    Accept-Encoding: gzip
-   BereqHeader    X-Varnish: 4489358
-   VCL_call       BACKEND_FETCH
-   VCL_return     fetch
-   BackendOpen    414 boot.web4  50444
-   BackendStart   
-   Timestamp      Bereq: 1627115035.167506 0.000118 0.000118
-   Timestamp      Beresp: 1627115035.368178 0.200790 0.200672
-   BerespProtocol HTTP/1.1
-   BerespStatus   200
-   BerespReason   OK
-   BerespHeader   Date: Sat, 24 Jul 2021 08:23:55 GMT

The backends are responding quickly to those requests, and I am not seeing any failed requests in Apache. Do I need to look at caching those requests in the vcl?

1

1 Answers

0
votes

What is interesting about the logs is that we're dealing with a cache hit. This should mean that the timeout is not caused by a slow backend.

However, the following log lines could indicate that an ESI subrequest is responsible for the timeout:

-   Link           req 5996618 esi
-   Debug          "Write error, retval = -1, len = 14, errno = Broken pipe"
-   Timestamp      Resp: 1627038513.810060 77.609456 77.609297

ESI subrequest

The transaction ID for this ESI subrequest is 5996618 and the last column's value for the Timestamp log line shows that it took Varnish 77 seconds to generate the response whereas the previous Timstamp entry was only at 159 microseconds, as illustrated below:

-   Timestamp      Process: 1627038436.200763 0.000159 0.000159

It's a fair assumption that the ESI placeholder respresents stateful data that has to come from the backend and can take some time, especially under heavy load.

I'm also assuming that the broken pipe is not caused by Varnish, but caused by an Nginx timeout. This would make sense because Nginx doesn't wait around forever.

Next steps

The next step would be to examine the full log, including the log items for the ESI call.

Depending on the purpose of the ESI call, we can have a look at how we can alleviate some stress from the origin and get the response times back to normal.

If this ESI subrequest endpoint is hit too much, it would make sense to synthesize the response and perform some of the decision-making in Varnish.

Another approach is to use rely on faster data access in your application for this endpoint by storing projected data in Redis.

If you need more help, please describe the ESI situation in greater detail, add complete logs and maybe also add the full VCL file for reference.

What I'm seeing in https://pastebin.com/nMwd72B4 are ESI subrequests that respond very quickly. Some of those ESI calls are cached, other aren't.

Hit-For-Miss

But I am seeing a big problem in your VCL configuration:

set beresp.ttl = 0s;
set beresp.uncacheable = true;
return (deliver);

Uncacheable data should never end up on the waiting list, because Varnish's built-in request coalescing feature will never be able to satisfy these requests in parallel. So they end up being processed serially, which is very very slow. Especially under heavy load.

Normally, uncacheable responses are added to the Hit-For-Miss cache, where the decision not to cache is cached. Sounds ironic, but it does serve a purpose.

Objects on the Hit-For-Miss list will automatically bypass the waiting list. This is done for responses that have set beresp.uncacheable = true.

The problem in your case is that the TTL is set to zero, which is a big no-no!

The built-in VCL behavior sets the TTL of Hit-For-Miss objects to 2 minutes as seen on https://github.com/varnishcache/varnish-cache/blob/6.0/bin/varnishd/builtin.vcl#L164.

This is what you would need:

set beresp.ttl = 120s;
set beresp.uncacheable = true;
return (deliver);

I'm not saying this will solve all your problems, but it might contribute to better performance because serialization will no longer be in effect.

I'm also seeing quite some cache hits in the logs that have Set-Cookie headers in them.

Varnish's built-in VCL behavior doesn't cache objects that contain Set-Cookie headers as seen in https://github.com/varnishcache/varnish-cache/blob/6.0/bin/varnishd/builtin.vcl#L158, because a Set-Cookie header implies a level of personalization that doesn't apply to everyone receiving the response.

If this is done on purpose, please ignore this remark. If not, these are some of your options:

  • Strip off the Set-Cookie header where possible
  • Bypass the cache for certain responses that contain a Set-Cookie header.
  • Override the value of the Set-Cookie header in your VCL code

I'm not seeing any slow responses

The Pastbin link you provided doesn't include any Timestamp tags that are particularly slow. Maybe these log items were recorded when there wasn't a lot of load on the system.

I a curious to see a full log transactions including the ESI subrequests that matches the performance of - Timestamp Resp: 1627038513.810060 77.609456 77.609297 I saw in your original question.

However, do focus on the part about the waiting list, maybe that 'll solve your problem.