0
votes

I have arangodb 3.1.16 installed on an AWS C4 Instance. I have a Foxx Service trying to run in production. It is getting an average of 10 packets of 200 octets per second, and returning a flow of 20 packets of 200 octets per second.

Each time I start running my process, the foxx service runs with consistent performance for an hour and then suddenly stops. I do not have access to my foxx api anymore : all requests get connection timeout errors, and do not print on the foxx logs. I do not have access to the web interface anymore : the page just doesn’t load.

After a minute or so, the foxx logs show me an error message : 'ArangoError 18: lock timeout’

After an other minute the logs show me requests that are usually fast but took a very long time (WARNING {queries} slow query: took: 1770.862498)

Using "journalctl -xe", I learned that after a foreign IP tried to connect, I got = "Job dev-xvdb.device/start timed out"

I managed to restart arango using :

ps -eaf |grep arangod
sudo kill #
sudo apt-get --reinstall install arangodb3=3.1.16

How can I solve this recurring issue ?

"journalctl -xe" gives me :

Apr 04 15:03:10 my-ip systemd[1]: arangodb3.service: Failed with result 'exit-code’.
-- Subject: Unit arangodb3.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit arangodb3.service has begun starting up.
Apr 04 15:03:10 my-ip arangodb3[11481]:  * Starting arango database server arangod
Apr 04 15:03:10 my-ip arangodb3[11481]:  * database version check failed, maybe you need to run 'upgrade'?
Apr 04 15:03:10 my-ip systemd[1]: arangodb3.service: Control process exited, code=exited status=1
Apr 04 15:03:10 my-ip systemd[1]: Failed to start LSB: arangodb.
-- Subject: Unit arangodb3.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit arangodb3.service has failed.
-- 
-- The result is failed.
Apr 04 15:03:10 my-ip systemd[1]: arangodb3.service: Unit entered failed state.
Apr 04 15:03:10 my-ip systemd[1]: arangodb3.service: Failed with result 'exit-code'.
Apr 04 15:03:10 my-ip sudo[11346]: pam_unix(sudo:session): session closed for user root
Apr 04 15:03:17 my-ip sshd[11502]: Did not receive identification string from UNKNOWN IP 1
Apr 04 15:03:21 my-ip sshd[11503]: Connection closed by UNKNOWN IP 2 port 54736 [preauth]
Apr 04 15:03:21 my-ip sshd[11507]: Did not receive identification string from UNKNOWN IP 2
Apr 04 15:03:21 my-ip sshd[11506]: fatal: Unable to negotiate with UNKNOWN IP 2 port 54730: no matching host key type found. Their offer: ssh-dss [preauth]
Apr 04 15:03:21 my-ip sshd[11504]: Connection closed by UNKNOWN IP 2 port 54732 [preauth]
Apr 04 15:03:22 my-ip sshd[11505]: Connection closed by UNKNOWN IP 2 port 54734 [preauth]
Apr 04 15:03:40 my-ip systemd[1]: dev-xvdb.device: Job dev-xvdb.device/start timed out.
Apr 04 15:03:40 my-ip systemd[1]: Timed out waiting for device dev-xvdb.device.
-- Subject: Unit dev-xvdb.device has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit dev-xvdb.device has failed.
-- 
-- The result is timeout.
Apr 04 15:03:40 my-ip systemd[1]: Dependency failed for File System Check on /dev/xvdb.
-- Subject: Unit [email protected] has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit [email protected] has failed.
-- 
-- The result is dependency.
Apr 04 15:03:40 my-ip systemd[1]: Dependency failed for /mnt.
-- Subject: Unit mnt.mount has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit mnt.mount has failed.
-- 
-- The result is dependency.
Apr 04 15:03:40 my-ip systemd[1]: mnt.mount: Job mnt.mount/start failed with result 'dependency'.
Apr 04 15:03:40 my-ip systemd[1]: [email protected]: Job [email protected]/start failed with result 'dependency'.
Apr 04 15:03:40 my-ip systemd[1]: dev-xvdb.device: Job dev-xvdb.device/start failed with result 'timeout'.

I tried :

sudo curl --dump - -X GET http://127.0.0.1:8529/_api/version && echo

It gives me :

HTTP/1.1 401 Unauthorized
 Www-Authenticate: Bearer token_type="JWT", realm="ArangoDB"
Server: ArangoDB
Connection: Keep-Alive
Content-Type: text/plain; charset=utf-8
Content-Length: 0

I tried :

ps auxw | fgrep arangod

It gives me :

root     10439  0.0  0.1  82772  8664 ?        Ss   10:09   0:00 /usr/sbin/arangod --uid arangodb --gid arangodb --pid-file /var/run/arangodb/arangod.pid --temp.path /var/tmp/arangod --log.foreground-tty false --supervisor
arangodb 10440  5.7 94.5 12901776 7242340 ?    Sl   10:09  16:36 /usr/sbin/arangod --uid arangodb --gid arangodb --pid-file /var/run/arangodb/arangod.pid --temp.path /var/tmp/arangod --log.foreground-tty false --supervisor
ubuntu   11339  0.0  0.0  12916  1000 pts/0    R+   14:59   0:00 grep -F --color=auto arangod

arangod restart gives me :

2017-04-04T15:01:16Z [11344] INFO ArangoDB 3.1.16 [linux] 64bit, using VPack 0.1.30, ICU 54.1, V8 5.0.71.39, OpenSSL 1.0.2g  1 Mar 2016
2017-04-04T15:01:16Z [11344] INFO using SSL options: SSL_OP_CIPHER_SERVER_PREFERENCE, SSL_OP_TLS_ROLLBACK_BUG
2017-04-04T15:01:16Z [11344] FATAL could not open shutdown file '/var/log/arangodb3/restart/SHUTDOWN': internal error

'service arangodb3 restart’ gives me (after a short wait time) :

Job for arangodb3.service failed because the control process exited with error code. See "systemctl status arangodb3.service" and "journalctl -xe" for details.

'systemctl status arangodb3.service' gives me :

 arangodb3.service - LSB: arangodb
Loaded: loaded (/etc/init.d/arangodb3; bad; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2017-04-04 15:03:10 UTC; 34s ago
Docs: man:systemd-sysv-generator(8)
Process: 11352 ExecStop=/etc/init.d/arangodb3 stop (code=exited, status=0/SUCCESS)
Process: 11481 ExecStart=/etc/init.d/arangodb3 start (code=exited, status=1/FAILURE)

Tasks: 83

Memory: 6.5G

 CPU: 73ms
CGroup: /system.slice/arangodb3.service
├─10439 /usr/sbin/arangod --uid arangodb --gid arangodb --pid-file /var/run/arangodb/arangod.pid --temp.path /var/tmp/arangod --log.foreground-tty false --supervisor
└─10440 /usr/sbin/arangod --uid arangodb --gid arangodb --pid-file /var/run/arangodb/arangod.pid --temp.path /var/tmp/arangod --log.foreground-tty false --supervisor
Apr 04 15:03:10 my-ip systemd[1]: Starting LSB: arangodb...
Apr 04 15:03:10 my-ip arangodb3[11481]:  * Starting arango database server arangod
Apr 04 15:03:10 my-ip arangodb3[11481]:  * database version check failed, maybe you need to run 'upgrade'?
Apr 04 15:03:10 my-ip systemd[1]: arangodb3.service: Control process exited, code=exited status=1
Apr 04 15:03:10 my-ip systemd[1]: Failed to start LSB: arangodb.
Apr 04 15:03:10 my-ip systemd[1]: arangodb3.service: Unit entered failed state.
1

1 Answers

1
votes

From your log output it seems that the mounted disk volume goes away.

If the storage goes away under any kind of Database there is no reasonable way to continue working.

Thus the effects you see is that the ArangoDB isn't able to work with its data anymore - from its perspective its simply not there anymore.

One effect observed by others is that I/O credits on AWS dry up, which could also be the reason for what you see above.

https://aws.amazon.com/blogs/aws/new-burst-balance-metric-for-ec2s-general-purpose-ssd-gp2-volumes/

If I got that correctly, you can get more credits if you choose a bigger volume size. If that doesn't help, you either need to lower your test scenario, or choose a different hosting approach that doesn't have limitations on I/O operations.