Hi,
We are running a mid-sized rancher on top of an OpenNebula VM cluster with 18.04.4 LTS for the worker VMs. Rancher is 2.3.5, with K8S 1.17.2-rancher1-2. Rancher is running on separate containers outside the K8S. Networking is via flannel (canal) without “Project Network Isolation”, which has 10.42.X.Y addresses.
We are intermittently (every few weeks) seeing the issue that in the “kubectl get pods -o wide” (or in the Pod overview in Rancher) we see pod IP addresses from the Docker bridge 172.17.X.Y, rather than the 10.42.X.Y from canal. Hence the 172.17.X.Y pods are not reachable from pods running on different worker nodes. This causes all sorts of havoc, from non-reachable apps, K8S metric-server being unavailable, causing rancher to be unhappy, to even the K8S DNS breaking.
We can rescue this simply by re-deploying the pods, where usually they then get a correct 10.42.X.Y address.
Any ideas how to debug (or fix!) this are welcome. Could that be a race condition ? Where ? Is that a failure of the canal running on each node ? Which logs to look at ?
Yours,
Steffen