Rancher connection to cluster flaps regularly

Hi All,

I’m running rancher v2.2.2 on a standalone VM with two rancher-built on-prem clusters running K8S v1.13.5. Docker on the VM (Ubuntu 18.04.3 LTS) is v18.09.7 build 2d0083d.

The webUI regularly squawks that it’s lost connection to the clusters with a red “Unavailable” button and if I click on the cluster a header reading “This cluster is currently Unavailable ; areas that interact directly with it will not be available until the API is ready.”

The VM is consistently running about 50% CPU utilization (4 cores). I had to add RAM to it a few times, it appears there might be a leak as it crept up to full usage until I finally gave it 32GB and it now seems to top out at 8GB used.

“docker logs rancher | grep -i error” on the VM shows a lot of this:
E0827 20:57:49.614914 6 streamwatcher.go:109] Unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)
E0827 20:57:49.615253 6 streamwatcher.go:109] Unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)
2019-08-27 20:57:55.622514 I | http: TLS handshake error from 127.0.0.1:46340: EOF
E0827 20:58:07.972251 6 request.go:853] Unexpected error when reading response body: &http.httpError{err:“net/http: request canceled (Client.Timeout exceeded while reading body)”, timeout:true}
E0827 20:58:07.972411 6 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ConfigMap: Unexpected error &http.httpError{err:“net/http: request canceled (Client.Timeout exceeded while reading body)”, timeout:true} when reading response body. Please retry.
2019/08/27 20:58:07 [ERROR] ClusterController c-9b748 [user-controllers-controller] failed with : failed to start user controllers for cluster c-9b748: timeout syncing controllers
2019/08/27 20:58:19 [INFO] error in remotedialer server [400]: read tcp 172.17.0.2:443->140.107.117.50:44508: i/o timeout
2019/08/27 20:58:19 [INFO] error in remotedialer server [400]: read tcp 172.17.0.2:443->140.107.117.51:50416: i/o timeout
2019/08/27 20:58:19 [INFO] error in remotedialer server [400]: read tcp 172.17.0.2:443->140.107.117.53:49654: i/o timeout
2019-08-27 20:58:20.605208 I | http: TLS handshake error from 127.0.0.1:46518: EOF
2019/08/27 20:58:19 [INFO] error in remotedialer server [400]: read tcp 172.17.0.2:443->140.107.117.52:35168: i/o timeout
2019/08/27 20:58:19 [INFO] error in remotedialer server [400]: read tcp 172.17.0.2:443->140.107.117.54:38440: i/o timeout
2019-08-27 20:58:28.998570 I | http: TLS handshake error from 127.0.0.1:46566: EOF
2019-08-27 20:58:28.998729 I | http: TLS handshake error from 127.0.0.1:46568: EOF
2019-08-27 20:58:29.001544 I | http: TLS handshake error from 127.0.0.1:46572: EOF
E0827 20:58:38.066194 6 request.go:853] Unexpected error when reading response body: &http.httpError{err:“net/http: request canceled (Client.Timeout exceeded while reading body)”, timeout:true}
E0827 20:58:38.066366 6 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ConfigMap: Unexpected error &http.httpError{err:“net/http: request canceled (Client.Timeout exceeded while reading body)”, timeout:true} when reading response body. Please retry.
2019/08/27 20:58:53 [INFO] error in remotedialer server [400]: read tcp 172.17.0.2:443->140.107.117.52:35268: i/o timeout
2019-08-27 20:59:02.449195 I | http: TLS handshake error from 127.0.0.1:46740: EOF
2019/08/27 20:58:53 [INFO] error in remotedialer server [400]: read tcp 172.17.0.2:443->140.107.117.51:50490: i/o timeout
2019-08-27 20:59:02.471820 I | http: TLS handshake error from 127.0.0.1:46756: EOF
2019-08-27 20:59:02.472121 I | http: TLS handshake error from 127.0.0.1:46758: EOF
2019/08/27 20:58:53 [INFO] error in remotedialer server [400]: read tcp 172.17.0.2:443->140.107.117.53:49764: i/o timeout
E0827 20:59:08.147494 6 request.go:853] Unexpected error when reading response body: &http.httpError{err:“context deadline exceeded (Client.Timeout exceeded while reading body)”, timeout:true}
2019/08/27 20:58:53 [INFO] error in remotedialer server [400]: read tcp 172.17.0.2:443->140.107.117.54:38514: i/o timeout
E0827 20:59:08.147662 6 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ConfigMap: Unexpected error &http.httpError{err:“context deadline exceeded (Client.Timeout exceeded while reading body)”, timeout:true} when reading response body. Please retry.
2019/08/27 20:58:53 [INFO] error in remotedialer server [400]: read tcp 172.17.0.2:443->140.107.117.50:44578: i/o timeout
2019/08/27 20:59:08 [ERROR] ClusterController c-9b748 [user-controllers-controller] failed with : failed to start user controllers for cluster c-9b748: timeout syncing controllers
2019-08-27 20:59:13.932063 I | mvcc: store.index: compact 25150803
2019-08-27 20:59:13.938430 I | mvcc: finished scheduled compaction at 25150803 (took 4.060551ms)
2019/08/27 20:59:26 [INFO] error in remotedialer server [400]: read tcp 172.17.0.2:443->140.107.117.50:44656: i/o timeout
2019/08/27 20:59:26 [INFO] error in remotedialer server [400]: read tcp 172.17.0.2:443->140.107.117.54:38584: i/o timeout
2019/08/27 20:59:26 [INFO] error in remotedialer server [400]: read tcp 172.17.0.2:443->140.107.117.51:50560: i/o timeout
2019/08/27 20:59:26 [INFO] error in remotedialer server [400]: read tcp 172.17.0.2:443->140.107.117.52:35358: i/o timeout
2019/08/27 20:59:26 [INFO] error in remotedialer server [400]: read tcp 172.17.0.2:443->140.107.117.53:49866: i/o timeout

Any guidance would be appreciated!

some additional info:

Realized the red “Unavailable” button in the webUI was toggling on and off pretty much exactly every 15 seconds. Something was timing out somewhere.

Saw that pods/resources related to the built-in ingress controller and to prometheus/grafana were having trouble, including a pod in the cattle-prometheus namespace that had been stuck in crashloopbackoff for over 10K restarts. Disabled both those stacks from the rancher webUI and while it did remove the nginx stack, I had to manually remove all the prometheus resources vi kubectl and delete both namespaces manually.

Deleting the namespaces left them in “terminating” for over an hour, had to reboot all controller nodes to shake them loose. Now my cluster is all green and the webUI API flapping is stopped. Wish I knew why other than “have you tried rebooting.”

Will re-enable the prometheus stack and see how long it runs before it flakes again, and whether that restarts the webUI flapping.

Will update here as events unfold.

Hi @randyrue… we are having the same behave here…

The only point that differs from you is that we don’t use the monitoring stack provided by Rancher…

The steps that you described above helped you to finally resolve the issue ?

Thank you for your attention.