Bare Metal Cluster crashed every morning

I am running several Rancher Cluster under latest stable CoreOS (currently 1353.7.0) on bare metal with Rancher v1.5.6 on 2 - 4 nodes. All systems run stable during the whole day but in the morning all services from all stacks (user and infrastructure) are no longer healty and restarting permanently. Only a reboot of all nodes in parallel resolves the situation.

What I observed is that on the master node (running the Racher server) the IP address changes during night from the physical (172.30.14.52) to some virtual IP address (172.17.0.1) and from that time healthcheck and other services seem to loose connection and restart permanently. All other agent nodes keep the physical address visible in the Infrastructure → Hosts page.

Our first Racher system still running under v1.4.1 with 3 nodes is stable since weeks!

We run Rancher on Bare Metal on CentOS 7.

In the past, our Rancher Agents have occasionally inherited an IP address from the 172.17.x.x range. To fix it, I used the CATTLE_AGENT_IP environment variable, per this FAQ:

https://docs.rancher.com/rancher/v1.5/en/faqs/agents/#how-does-the-host-determine-ip-address-and-how-can-i-change-it-what-do-i-do-if-the-ip-of-my-host-has-changed-due-to-reboot

However, the last time I used CATTLE_AGENT_IP was about 4-5 months ago. It’s been working fine since.

The IP address in use may be related to the default route and the order in which the interfaces are starting.

Great! Thanks a lot for that hint! For the first time all our systems were up and running in the morning!