ArangoDB cluster stop after one node failure

Question

I have ArangoDB cluster with 3 nodes. First one has service.config :

> ExecStart=/usr/bin/arangodb \
        --starter.data-dir=/var/lib/arangodb3/cluster \
        --server.storage-engine=rocksdb \
        --auth.jwt-secret=/etc/arangodb3/arangodb.secret \
        --agents.agency.supervision-grace-period=30 \
        --log.file=true \
        --log.dir=/var/log/arangodb3/cluster \
        --log.verbose
TimeoutStopSec=60

and two other nodes have:

> ExecStart=/usr/bin/arangodb \
        --starter.data-dir=/var/lib/arangodb3/cluster \
        --server.storage-engine=rocksdb \
        --auth.jwt-secret=/etc/arangodb3/arangodb.secret \
        --agents.agency.supervision-grace-period=30 \
        --starter.join arangodb01.domain.com \
        --log.file=true \
        --log.dir=/var/log/arangodb3/cluster \
        --log.verbose

It works fine until any node stop. After one node stop no requests processed. I see in "[root@arangodb01 ~]# journalctl -u arangodb " only :

>We're master, try to remain it component=arangodb\
>Master changed callback from [arangobd01 IP]:57722 component=arangodb\
>Received GET /hello request from [arangobd02 IP]:38436 component=arangodb

Is is possible to work if only 2 nodes work in cluster?

UPD: I face problem with shard migrtion/ This is the problem of my cluster

I found the reason my cluster failure. Some shards do not properly migrate to healthy node with name like "DBServer0001-02" and stuck on Leader with name like "PRMR-8dd447ee-84ac-4c8f-85b6-2117377b8c7e". If I ask for p in "healthy" shard I got normal answer, if I ask for some info from "bad" sharde I got "Query: Query execution aborted." Does anyone know solution what to do with shards like "s50140107 PRMR-8dd447ee-84ac-4c8f-85b6-2117377b8c7e no followers"? — Дмитрий Горчаков
In my case this is shutdown node, but shards didn't moved from it to any other server: "PRMR-8dd447ee-84ac-4c8f-85b6-2117377b8c7e": { "Timestamp": "2020-08-03T12:58:20Z", "SyncStatus": "SHUTDOWN", "Status": "FAILED", "Host": "", "ShortName": "DBServer0003", "Engine": "", "Version": "", "SyncTime": "2020-08-03T12:57:50Z", "LastAckedTime": "2020-08-03T12:57:50Z", "Endpoint": "", "Role": "DBServer", "CanBeDeleted": false — Дмитрий Горчаков
also noticed that in a healthy state, some collections do not have follower nodes, is it OK? — Дмитрий Горчаков

Дмитрий Горчаков Дмитрий Горчаков · Accepted Answer · 2020-08-04T07:25:42

Finally I found the reason of "bad" shards. For every collection you create in cluster check "replicationFactor" in info section of collection. By default it is 1 and if you create it with the help of API it is "replicationFactor: (cluster only)"

ArangoDB cluster stop after one node failure

1 Answers