Healthcheck interval not being followed

I have noted that the healthcheck interval is not being followed as soon as a service is deployed on more than one host. For example, I have a services which is deployed on two hosts and has essentially zero HTTP traffic but, the HTTP rpm is showing 4.34K. Looking at the logs it shows the health check request going so fast the UI can’t event keep up. The rancher-compose has the interval set to 10000 so I am not sure why it is so frequent.

While this service has the most noticeable issue, many of my other services that use the same healthcheak also occur a lot more frequent than specified. It would appear that the majority of them occur about every 2 seconds every though the interval is set to 10 seconds. I am using rancher 1.1.0 and the rancher-compose looks like:

passenger:
health_check:
port: 80
interval: 10000
recreate_on_quorum_strategy_config:
quorum: 1
initializing_timeout: 15000
unhealthy_threshold: 3
strategy: recreateOnQuorum
response_timeout: 4000
request_line: 'GET “/health” “HTTP/1.1\r\nHost: myservice.mydomain.com”'
healthy_threshold: 2

Thanks!

Health checks are run on 3 separate hosts (assuming you have at least 3 hosts…) at the specified interval. So 10 sec / 3 checks would result in a request to each container every ~3 seconds. Rancher and/or external load balancers (like ELB) will do their own checks and add additional requests.

Though 4,340/min is obviously high… Is it -> Rancher LB -> Containers? What scale and check interval for each? Are there multiple balancers pointed at the same target containers?

Hey @vincent

There are more than three hosts within the environment, but only two hosts with that service. My current setup is a single service per host which includes a rancher LB (I am not using ELB) on each host. Therefore, the scale for the service with the runaway healthcheck is two RancherOS hosts with the single service deployed (including the rancher LB) on each host.

Thanks!

If you have 4 or more hosts then the healthcheck will be run on 3 hosts which are different from the one the target container is running on. So if there are 2 target containers there should be 3 checks to each every 10 seconds.

@alena any ideas on what to look for that would explain ~72 health check requests per second?

@vincent I can expect the little offset given the fact that even with the same interval, 3 health checking hosts might perform the health check in not sync manner. But that definitely doesn’t explain 72 requests per second.

@JD_Trout

A couple of question:

  1. what version of rancher are you using? We had some bug long ago when healtcheck config had duplicated records for the same ip address, and as a result the health check requests were duplicated too.
  2. is your environment relatively stable, or do you have containers constantly stopping and starting?

Could you also share the following information:

from network agent containers of your host(s), copy the file /etc/healthcheck/healthcheck.cfg; I’d like to look at them to rule out (or confirm) that it is caused by 1)

I experienced an issue this week that may be related. On Saturday one of our hosts locked up showing some ridiculous load via prometheus. Didn’t have time to debug so I just killed the host but it started to happen again today and this time I noticed it was correlated with unusually high log volume. Looking at the logs, it seems that the rancher health checks for one service had started to spiral out of control. I captured a histogram of the healthcheck count by container: https://cloudup.com/cdxnar27kcn.

This is a service running at scale 3, and the spike in healthchecks only seemed to affect one container at a time, but it didn’t seem to matter which host the container was on. In fact, it looks like once an overloaded container died, the container that replaced it was immediately bombarded again.

We’re using rancher 1.2.0-pre2. I don’t have access to the healthcheck configuration for the original hosts at the time this happened, but if it would still be helpful I can collect and share the current cfg files.

Hello @alena

  1. I am using rancher 1.1.0. I have been holding off on upgrading until 1.2.0. However, I would be happy to upgrade to 1.1.4 if you think it would help.

  2. The environment is stable, containers rarely restart.

Here is the output of the healthcheck.cfg:

global
log 127.0.0.1 local0
log 127.0.0.1 local1 notice
maxconn 4096
maxpipes 1024
chroot /var/lib/haproxy
user haproxy
group haproxy
daemon
stats socket /var/run/haproxy.sock mode 600 level admin
stats timeout 2m

defaults
    log global
    mode        tcp
    option      tcplog
    option  dontlognull
    option  redispatch
    option forwardfor
    retries 3
    timeout connect 5000
    timeout client 50000
    timeout server 50000


backend 76dece47-b937-4d8e-8020-c2eba42a1dc5_backend
        mode http
        balance roundrobin
        timeout check 3000
        option httpchk GET "/health" "HTTP/1.1\r\nHost: service2.mydomain.com"
        server cattle-76dece47-b937-4d8e-8020-c2eba42a1dc5_1 10.42.92.16:80 chec
k port 80 inter 10000 rise 2 fall 3

backend a2a5a796-256d-4762-bc20-4c55cee8c776_backend
        mode tcp
        balance roundrobin
        timeout check 2000

        server cattle-a2a5a796-256d-4762-bc20-4c55cee8c776_1 10.42.174.106:42 ch
eck port 42 inter 2000 rise 2 fall 3

backend 72a95657-7d4e-490f-b03b-36ed2ecb9c70_backend
        mode tcp
        balance roundrobin
        timeout check 2000

        server cattle-72a95657-7d4e-490f-b03b-36ed2ecb9c70_1 10.42.117.21:42 che
ck port 42 inter 2000 rise 2 fall 3

backend d6d5c1b2-0121-4ace-915f-041d0d80453d_backend
        mode tcp
        balance roundrobin
        timeout check 2000

        server cattle-d6d5c1b2-0121-4ace-915f-041d0d80453d_1 10.42.142.184:42 ch
eck port 42 inter 2000 rise 2 fall 3

backend ee6ee084-7454-498b-80ab-9b57c418c275_backend
        mode http
        balance roundrobin
        timeout check 4000
        option httpchk GET "/health" "HTTP/1.1\r\nHost: service3.mydomain.com"
        server cattle-ee6ee084-7454-498b-80ab-9b57c418c275_1 10.42.223.249:80 ch
eck port 80 inter 10000 rise 2 fall 3

backend 906fc25a-9e5f-4de3-a2ee-abd26691ec56_backend
        mode http
        balance roundrobin
        timeout check 3000
        option httpchk GET "/health" "HTTP/1.1\r\nHost: service2.mydomain.com"
        server cattle-906fc25a-9e5f-4de3-a2ee-abd26691ec56_1 10.42.245.153:80 ch
eck port 80 inter 10000 rise 2 fall 3

backend 00ec4bab-2de5-4fed-8ccf-6c8bbdb09aad_backend
        mode tcp
        balance roundrobin
        timeout check 2000

        server cattle-00ec4bab-2de5-4fed-8ccf-6c8bbdb09aad_1 10.42.212.115:42 ch
eck port 42 inter 2000 rise 2 fall 3

backend d158f9dd-8975-4c06-8a8b-1f4baf563355_backend
        mode http
        balance roundrobin
        timeout check 4000
        option httpchk GET "/health" "HTTP/1.1\r\nHost: service1.mydomain.com"
        server cattle-d158f9dd-8975-4c06-8a8b-1f4baf563355_1 10.42.222.232:80 ch
eck port 80 inter 10000 rise 2 fall 3

# Need to listen on something since we have no front ends
listen web
        bind  127.0.0.1:42

If you want more of them perhaps I can email them so the thread does not become cluttered.

Another point of note, I have a service in staging (read no external traffic) that has 4 host and is reporting 268rpm. The health check config is just like the rest, set at an interval of 10 seconds, but NR is telling me that I am getting health request sub one second.

Hello @alena

Were you able to rule out or confirm the issue you spoke about from looking at my config?

Cheers,

-JD-

@JD_Trout the config looks fine, there are no duplicated records. The health check frequency might be related to something else. I suspect it can either happen because:

  1. there are duplicated haproxy processes -> can you check it by executing “ps aux | grep haproxy” inside network agent container?
  2. health check config being reloaded too frequently due to the changes in the system (we reload the config on every instance stop/start).

It does sound like a bug we should look at, can you please file it on https://github.com/rancher/rancher/issues? I’ll pick it up to my queue as soon as it is done

@alena - I had a host that was acting up (not deploying the containers it should have) so it was removed from Rancher. Once I did that, the health check requests returned to a sane level. Looks like the issue could have been the network agent container on that agent.

I will report the bug.

Thanks!

@JD_Trout thank you! If it happens again - host misbehaving - I’d like to debug this a bit more on the network agent.

I have this happen fairly frequently as well - suddenly the CPU on a host will spike and I can see it being DOSd with healthchecks. I have not discovered the trigger as of yet. Rancher 1.1.3 on RancherOS 0.6.1 on AWS.

I end up “upgrading” the affected service and removing the healthcheck to stop it.

@alena I am not sure if this is related, but the logs on the rancher server are spewing lots of message involving:

Failed to Starting for instance... Server Error: Internal Server Error ("cannot join network of a non running container

AND

404 Client Error: Not Found ("No such container: 01db6728-093b-4445-8b25-101584effe6d"

AND

ERROR [:] [] [] [] [utorService-904] [i.c.p.e.e.i.ProcessEventListenerImpl] Unknown exception running process [instance.purge:44742] on [251] org.jooq.exception.DataChangedException: Database record has been changed

Don’t think it is related. Might be happening because you have sidekick instance using primary instance network, and the primary instance fails to start by some reason.

DataBaseRecord has been changed is a different issue, but it should not disrupt the health checks