I have a small cluster (1 node). It was working for a while perfectly. But one day, started getting errors that Rancher couldn’t connect to the API server.
Rebooting the node seem to have kept it alive for a few more hours, then failed again. Also killing the agent container and letting it restart reconnects the node. If I immediately go to see the workload in Rancher 2 UI, it times out and shows an error. If I wait for a while before pulling up the UI, and if I use kubectl
I can manage the node. But after a little while, it goes back to error state.
The Rancher server is in the US, the node in South Africa. Ping time is about 220ms. I can pull up the web applications from the US that are running on the node. It’s not very fast but it works.
Also from the Rancher container, I can curl
the API calls it claims are timing out. E.g. https://x.x.x.x:6443/version
It only take a second to pull up that URL.
So I really don’t see why it disconnects and has such a hard time keeping the connection working.
That same Rancher server has a different cluster on it with about 20 nodes which are all on the same LAN and that works fine.
Any pointer of what I should look into?