Hoping someone can point me in the right direction. We have inherited a vcl for a shopware site.
The basic setup is:
nginx ssl termination -> varnish -> varnish rr to 4 apache backends
The setup works fine, but when we put it under any form of load (around 250 concurrent), nginx starts reporting that it can't connect to varnish:
2021/07/23 12:08:15 [error] 25348#25348: *7223 upstream timed out (110: Connection timed out) while reading upstream, client: xx, server: _, request: "GET / HTTP/1.1", upstream: "http://xx/", host: "xx"
And varnish reports it has broken pipes:
* << Request >> 5996617
- Begin req 5996616 rxreq
- Timestamp Start: 1627038436.200604 0.000000 0.000000
- Timestamp Req: 1627038436.200604 0.000000 0.000000
- ReqStart xx 42403
- ReqMethod GET
- ReqURL /
- ReqProtocol HTTP/1.1
- ReqHeader X-Real-IP: xx
- ReqHeader X-Forwarded-For: xx
- ReqHeader X-Forwarded-Proto: https
- ReqHeader X-Forwarded-Port: 443
- ReqHeader Host: xx
- ReqHeader Connection: close
- ReqHeader Accept: */*
- ReqHeader Accept-Language: en
- ReqHeader User-Agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; .NET CLR 1.1.4322; .NET CLR 1.0.3705; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
- ReqHeader Accept-Encoding: gzip, deflate
- ReqHeader Dotcom-ProxyIDTag: 13
- ReqUnset X-Forwarded-For: xx
- ReqHeader X-Forwarded-For: xx, xx
- VCL_call RECV
- ReqURL /
- ReqURL /
- ReqURL /
- ReqURL /
- ReqURL /
- ReqURL /
- ReqURL /
- ReqURL /
- ReqURL /
- ReqURL /
- ReqURL /
- ReqURL /
- ReqURL /
- ReqHeader Surrogate-Capability: shopware=ESI/1.0
- ReqUnset X-Forwarded-Proto: https
- ReqHeader X-Forwarded-Proto: https
- ReqHeader X-UA-Device: desktop
- ReqHeader X-UA-Vendor: generic
- ReqHeader X-UA-Forced: no
- ReqUnset Accept-Encoding: gzip, deflate
- ReqHeader Accept-Encoding: gzip
- VCL_return hash
- VCL_call HASH
- VCL_return lookup
- Hit 4653975
- VCL_call HIT
- VCL_acl NO_MATCH editors
- VCL_return deliver
- RespProtocol HTTP/1.1
- RespStatus 200
- RespReason OK
- RespHeader Date: Fri, 23 Jul 2021 10:54:07 GMT
- RespHeader Server: Apache
- RespHeader Expires: Thu, 19 Nov 1981 08:52:00 GMT
- RespHeader Pragma: no-cache
- RespHeader Cache-Control: max-age=14400, public, s-maxage=14400
- RespHeader X-Shopware-Allow-Nocache: admin-1, price-1
- RespHeader X-Shopware-Cache-Id: ;c3;
- RespHeader Set-Cookie: session=; path=/
- RespHeader Set-Cookie: lastAffiliate=na; path=/
- RespHeader Server-Id: web3
- RespHeader Content-Type: text/html; charset=UTF-8
- RespHeader X-Content-Length: 17190
- RespHeader Content-Encoding: gzip
- RespHeader Vary: Accept-Encoding
- RespHeader X-Varnish: 5996617 4653975
- RespHeader Age: 788
- RespHeader Via: 1.1 varnish-v4
- VCL_call DELIVER
- VCL_acl NO_MATCH editors
- RespUnset X-Shopware-Allow-Nocache: admin-1, price-1
- RespUnset X-Shopware-Cache-Id: ;c3;
- RespUnset Cache-Control: max-age=14400, public, s-maxage=14400
- RespHeader Cache-Control: max-age=0, private
- RespUnset Set-Cookie: session=; path=/
- RespUnset Set-Cookie: lastAffiliate=na; path=/
- RespHeader X-Cache: HIT
- RespHeader X-Cache-Hits: 3611
- VCL_return deliver
- Timestamp Process: 1627038436.200763 0.000159 0.000159
- RespHeader Accept-Ranges: bytes
- RespHeader Transfer-Encoding: chunked
- Debug "RES_MODE 18"
- RespHeader Connection: close
- Link req 5996618 esi
- Debug "Write error, retval = -1, len = 14, errno = Broken pipe"
- Timestamp Resp: 1627038513.810060 77.609456 77.609297
- ReqAcct 522 0 522 456 5368 5824
- End
Varnish stat shows:
Uptime mgt: 1+04:38:26 Hitrate n: 10 21 21
Uptime child: 1+04:38:26 avg(n): 0.4843 0.4762 0.4762
NAME CURRENT CHANGE AVERAGE AVG_10 AVG_100 AVG_1000
MAIN.uptime 1+04:38:26
MAIN.sess_conn 10107 5.99 . 4.97 4.89 4.89
MAIN.client_req_400 9 0.00 . 0.00 0.00 0.00
MAIN.client_req 10815 5.99 . 4.97 4.89 4.89
MAIN.cache_hit 14952 8.99 . 7.83 7.69 7.69
MAIN.cache_miss 16483 8.99 . 7.83 7.69 7.69
MAIN.backend_conn 890 0.00 . 0.18 0.15 0.15
MAIN.backend_reuse 24251 7.99 . 10.20 10.14 10.14
MAIN.backend_recycle 25012 9.99 . 10.55 10.38 10.38
MAIN.fetch_length 24022 9.99 . 10.55 10.38 10.38
MAIN.fetch_chunked 755 0.00 . 0.00 0.00 0.00
MAIN.fetch_eof 2 0.00 . 0.00 0.00 0.00
MAIN.fetch_none 33 0.00 . 0.00 0.00 0.00
MAIN.fetch_304 319 0.00 . 0.00 0.00 0.00
MAIN.fetch_failed 13 0.00 . 0.00 0.00 0.00
MAIN.pools 2 0.00 . 2.00 2.00 2.00
MAIN.threads 600 0.00 . 600.00 600.00 600.00
MAIN.threads_created 738 0.00 . 0.00 0.00 0.00
MAIN.threads_destroyed 138 0.00 . 0.00 0.00 0.00
MAIN.busy_sleep 35286304 62859.00 342.00 34852.46 32989.42 32989.42
MAIN.busy_wakeup 51 0.00 . 0.00 0.00 0.00
MAIN.sess_queued 138 0.00 . 0.00 0.00 0.00
MAIN.n_object 344 -2.00 . 344.38 344.33 344.33
MAIN.n_objectcore 598 0.00 . 600.20 601.14 601.14
MAIN.n_objecthead 865 0.00 . 865.00 865.00 865.00
vvv MAIN.uptime INFO 1-25/96
Not all requests fail, prob about 70%.
The backends are hardly being hit, and the varnish error happens even when I remove nginx and just have the site on http with varnish accepting the http connections.
Running 4.1.11 on Centos7 (have also tried v6, and on Centos8 and Ubuntu 20 just incase). Open files increased, no resources seem to be getting maxed.
Any pointer appreciated. I feel like the busysleep is where I should be looking, but cant see any reason for it to keep growing.
Thanks
Looks like the hit is generating an esi subrequest:
- Timestamp Start: 1627115035.167388 0.000000 0.000000
- BereqMethod GET
- BereqURL /?action=shopMenu&controller=index&module=widgets&sCategory=3
- BereqProtocol HTTP/1.1
- BereqHeader X-Real-IP:
- BereqHeader X-Forwarded-Port: 443
- BereqHeader Accept: */*
- BereqHeader Accept-Language: en
- BereqHeader User-Agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; .NET CLR 1.1.4322; .NET CLR 1$
- BereqHeader Dotcom-ProxyIDTag: 5
- BereqHeader Host:
- BereqHeader X-Forwarded-For:
- BereqHeader Surrogate-Capability: shopware=ESI/1.0
- BereqHeader X-Forwarded-Proto: https
- BereqHeader X-UA-Device: desktop
- BereqHeader X-UA-Vendor: generic
- BereqHeader X-UA-Forced: no
- BereqHeader Accept-Encoding: gzip
- BereqHeader X-Varnish: 4489358
- VCL_call BACKEND_FETCH
- VCL_return fetch
- BackendOpen 414 boot.web4 50444
- BackendStart
- Timestamp Bereq: 1627115035.167506 0.000118 0.000118
- Timestamp Beresp: 1627115035.368178 0.200790 0.200672
- BerespProtocol HTTP/1.1
- BerespStatus 200
- BerespReason OK
- BerespHeader Date: Sat, 24 Jul 2021 08:23:55 GMT
The backends are responding quickly to those requests, and I am not seeing any failed requests in Apache. Do I need to look at caching those requests in the vcl?