Health checks not initializing properly?

Howdy! I made a dive into messing with container orchestration today, and found that Rancher seemed like a nice product. However, I’m coming across a weird behavior where only the first host in my environment has a working health check container. Any ideas why that might be?

The health check containers show as “initializing” for a while, until they eventually restart and start all over again. I’ve tried deleting the environment and re-creating it, but that didn’t seem to do anything.

The hosts are all running CentOS 7, with Docker installed via curl https://releases.rancher.com/install-docker/1.12.sh | sh I grabbed from here.

And, here’s some log stuff from one of the containers in question. Looks like it’s failing to connect to something? I’ve set up nginx to do HTTPS in front of Rancher, if that might be related.

2/12/2017 10:42:55 PMtime="2017-02-13T03:42:55Z" level=info msg="healthCheck -- no changes in haproxy config\n"
2/12/2017 10:43:10 PMtime="2017-02-13T03:43:09Z" level=info msg="Scheduling apply config"
2/12/2017 10:43:10 PMtime="2017-02-13T03:43:10Z" level=info msg="healthCheck -- no changes in haproxy config\n"
2/12/2017 10:43:12 PMtime="2017-02-13T03:43:12Z" level=info msg="Scheduling apply config"
2/12/2017 10:43:12 PMtime="2017-02-13T03:43:12Z" level=info msg="healthCheck -- reloading haproxy config with the new config changes\n[WARNING] 043/034312 (1834) : config : 'option forwardfor' ignored for proxy 'web' as it requires HTTP mode.\n[WARNING] 043/034312 (1834) : config : 'option forwardfor' ignored for backend 'cattle-742c425e-e93b-459c-92e9-61e3c293460d_405118c3-7efb-4e69-b38e-ffaa7a756fbb_1' as it requires HTTP mode.\n"
2/12/2017 10:43:13 PMtime="2017-02-13T03:43:13Z" level=info msg="Monitoring 1 backends"
2/12/2017 10:43:13 PMtime="2017-02-13T03:43:13Z" level=info msg="Scheduling apply config"
2/12/2017 10:43:13 PMtime="2017-02-13T03:43:13Z" level=info msg="healthCheck -- no changes in haproxy config\n"
2/12/2017 10:43:14 PMtime="2017-02-13T03:43:14Z" level=info msg="Scheduling apply config"
2/12/2017 10:43:14 PMtime="2017-02-13T03:43:14Z" level=info msg="healthCheck -- no changes in haproxy config\n"
2/12/2017 10:43:15 PMtime="2017-02-13T03:43:15Z" level=info msg="742c425e-e93b-459c-92e9-61e3c293460d_405118c3-7efb-4e69-b38e-ffaa7a756fbb_1=DOWN"
2/12/2017 10:43:43 PMtime="2017-02-13T03:43:43Z" level=info msg="Scheduling apply config"
2/12/2017 10:43:43 PMtime="2017-02-13T03:43:43Z" level=info msg="healthCheck -- no changes in haproxy config\n"
2/12/2017 10:43:44 PMtime="2017-02-13T03:43:44Z" level=info msg="Scheduling apply config"
2/12/2017 10:43:44 PMtime="2017-02-13T03:43:44Z" level=info msg="healthCheck -- reloading haproxy config with the new config changes\n[WARNING] 043/034344 (1850) : config : 'option forwardfor' ignored for proxy 'web' as it requires HTTP mode.\n[WARNING] 043/034344 (1850) : config : 'option forwardfor' ignored for backend 'cattle-742c425e-e93b-459c-92e9-61e3c293460d_93ac5189-3dc6-4291-a493-d634b2ad1803_1' as it requires HTTP mode.\n[WARNING] 043/034344 (1850) : config : 'option forwardfor' ignored for backend 'cattle-742c425e-e93b-459c-92e9-61e3c293460d_405118c3-7efb-4e69-b38e-ffaa7a756fbb_1' as it requires HTTP mode.\n"
2/12/2017 10:43:45 PMtime="2017-02-13T03:43:45Z" level=info msg="Scheduling apply config"
2/12/2017 10:43:45 PMtime="2017-02-13T03:43:45Z" level=info msg="healthCheck -- no changes in haproxy config\n"
2/12/2017 10:43:45 PMtime="2017-02-13T03:43:45Z" level=info msg="Monitoring 2 backends"
2/12/2017 10:43:47 PMtime="2017-02-13T03:43:47Z" level=info msg="742c425e-e93b-459c-92e9-61e3c293460d_93ac5189-3dc6-4291-a493-d634b2ad1803_1=DOWN"
1 Like

Running into literally the same exact thing. The first healthcheck from the first node I added worked - all the rest don’t.

1 Like

This basically means cross-host networking isn’t working. The first one can only check itself so it works. Once they’re more than one they try to check each other and fail. In @Crackerjam 's case the problem is all the hosts are registered as the same IP.

1 Like

Hmm, I was wondering why they were all showing their gateway there. How can I fix that, and prevent it from happening in the future?

The automatic IP detection is essentially “what IP did the registration request from the agent to the server come from”. Presumably you have them all behind a NAT so they’re all the same.

Each host needs a unique, mutually reachable IP. Hosts connect directly to one another to provide the overlay mesh network.

You can change the registered IP by reregistering with CATTLE_AGENT_IP set.

I’m not sure why it would be detecting the IP that way. These boxes are all in VMware workstation, bridged to my local network, and all on the same subnet as the rancher installation. Additionally, after re-deploying the agents with the correct IPs, I’m still seeing the same behavior.

I can confirm that my hosts all don’t share the same IP (hosted in different DC’s/providers) - also I just noticed that none of my apps health checks are working properly. I have checkInterval set to 10s and it appears health checks are happening every 1s-2s on all my services in custom stacks. Strangely enough the Packet nodes I added last night all have initialized health checks - it’s only the gcloud instances where HC is in constant initialization as OP said.

Is there any other info I could give to help debug this?

I tested this a little more, and confirmed that my containers can’t talk to each other across the network, which doesn’t seem to make much sense, since they’re on the same VMware host and the same subnet. Any ideas on how I can troubleshoot this more? Or, maybe something obvious I’m overlooking?

Having same problem. Where are the log files? What needs to be running in the infra stack?

More… turning on certified health check stack fails…

Fixed: Was Google Cloud Networking firewall. I had to open udp ports 500 and 4500 for the VPN overlay to work between GCE VMs.

1 Like

I still haven’t gotten mine to work. Firewalld is disabled on all of my docker hosts, and the rancher host, and they’re on a flat network otherwise, so nothing should be blocked anywhere.

I have one more update. I tried deploying Rancher to a new set of VMs, running on Ubuntu instead of Centos, and they worked. So it seems running Cent is my problem, not my virtualization platform or networking. However, this isn’t really operable as my enterprise uses Centos/RHEL exclusively. Could I be missing something here? Aside from disabling selinux and firewalld, I’m not sure what else I can do to open up cross host networking.

@vincent, do you have any ideas?

+1 for also having this issue.

I had misconfigured VPN Peering in Azure between the different Cattle Hosts for the environment. I could talk to my Rancher Server VNET but not Cattle Hosts in the same environment as I was missing a VPN Peering setup between the VNET for my new host and the VNET of my old hosts.

@Crackerjam offtopic: how you manage to create multiple host?

Funny you say that - i’m still running into these issues and I am also exclusively using CentOS 7 on all of my hosts. Haven’t tried Ubuntu as my org uses C7

Originally I would just clone an existing template and fix its hostname and IP, but, I also tried doing multiple fresh installs and noted the same behavior.

1 Like

I actually have another update. I was able to resolve the problem by using the stable branch, instead of just the latest. So, something’s wrong with the latest branch that doesn’t work properly with EL7.

1 Like

I have the same problem, ever since the ipsec update fixing the metadata memory leak the ip address detected bringing up a new host uses the wrong network ip address (external instead of internal for me).

@Crackerjam @thinkdevcode if you have the setup in broken state, please find me on https://slack.rancher.io, I would like to collect some logs/info from the setup.

Also if you are using either CentOS/RHEL, please do share the steps to setup the host. (docker installation, storage etc)