0
votes

I'm currently playing with consul. For that I have a vagrant test setup with 4 VMs:

1: consul server, 2 through 4: nodes.

Each node is running a consul agent, registrator and some micro-services (all with Docker).

After starting the cluster, all services and nodes are marked as "passing" in consul.

So far so good.

Now when I shut down one of the nodes, consul marks the "Serf Health Status" as failed, but the HTTP Health check is still marked as "passing" although the whole VM is shut down.

According to the consul documentation the health check timeout should be 10 seconds so I assumed the health checks to be marked as failed 10 seconds after shutdown of the VM. Any idea why it doesn't?

2

2 Answers

1
votes

Consul will remove nodes that it has not received an acknowledgement from after three days (72 hours).

You can perform a curl command against a consul server via the http API to deregister a check or a service.

  1. first get the service name and the checks for that service

http://consulserver:8500/v1/health/checks/<service-name>

it will return something like this: [{"Node":"b7ea2063deb5","CheckID":"service:myapp","Name":"Service 'myapp' check","Status":"passing","Notes":"runs SELECT 1","Output":" online \n--------\n 1\n(1 row)\n\n","ServiceID":"myapp","ServiceName":"myapp","CreateIndex":11488,"ModifyIndex":11491}]

then mark that health check as failed using the "CheckID":

/v1/agent/check/fail/

This endpoint is used with a check that is of the TTL type. When this endpoint is accessed via a GET, the status of the check is set to critical, and the TTL clock is reset.

http://consulserver:8500/v1/health/fail/service:myapp

If the response is CheckID does not have associated TTL

then your check is not of type TTL.

more information about the different check types can be found here:

https://www.consul.io/docs/agent/checks.html

It is very difficult to give you proper commands to run without any actual output from the responses you receive when querying the http API.

You could also try to deregister the entire service if it is still there by running

/v1/agent/service/deregister/

The deregister endpoint is used to remove a service from the local agent. The ServiceID must be passed after the slash. The agent will take care of deregistering the service with the Catalog. If there is an associated check, that is also deregistered.

The return code is 200 on success.

https://www.consul.io/docs/agent/http/agent.html#agent_service_deregister

1
votes

Okay, got this. It seems to be consul logic. As soon as the SERF fails, the last state of the service is maintained. Once I use the correct health-url (http://localhost:8500/v1/health/service/my-cool-service-name?passing), consul returns only the two remaining services as expected, unless of the "passing" state when looking directly at the service.