After Upgrade from 2.3.5 to 2.4.2: Failed to communicate with API Server

gaidap · April 20, 2020, 8:48am

Hi everyone.

I’ve a set-up with one master and one worker node (ec2-instances with RancherOS), my Rancher is running on its own node startet with docker run.

I’ve upgraded to 2.4.2 recently. Since then I’m experiencing randomly that my master node is going awol with this Error Message:
rancher Cluster health check failed: Failed to communicate with API server: https://xxx.xxx.xxx.xxx:6443/api/v1/namespaces/kube-system?timeout=30s: dial tcp i/o timeout

RancherOS is at Version 1.5.3
Kubernetes Version is 1.17.4-rancher1-3

After rebooting EC2 instance everything is working fine again for a while. But after some random time it will go unreachable again.

I hope this is enough information. Thanks in advance.

a-nldisr · October 5, 2021, 12:16pm

We see this issue also on various clusters intermittently, did you ever managed to find out what caused it?

Small update what we did:
Rancher 2.5.9
Kubernetes: v1.20.9

What we checked/adjusted:
MTU sizes, we adjusted to the AWS jumbo frames
Checked arp tables
We upgraded our Rancher controlplane Traefik from v1 to v2
We disabled Rancher controlplane nodes docker daemons
We upgraded Rancher controlplane k3s nodes to kubernetes version: v1.21.4+k3s1
We disabled high availability on the rancher controlplane nodes, we run a single instance of rancher CP
We scaled down to 2 nodes in the rancher control plane cluster

We see the errors in the Traefik logs on the Rancher CP nodes.

Issues are intermittent, we have 4 clusters. It affects them 1 by 1 randomly. Not at the same time. Usually restarting the rancher pods solves the issue directly.

a-nldisr · October 19, 2021, 10:00am

Quick update:
We removed Traefik from the Rancher control plane and replaced it with Nginx, we still see sockets being closed and I/O errors. But recovery time is much faster and nearly noticeable. We went from 10-20+ complaints per day from our developers to 1 in 1 week.

Topic		Replies	Views
Cluster unavailable - Failed to communicate with API server - waiting for cluster agent to connect Rancher	0	3347	February 28, 2019
Rancher 2.6.7 Imporing Cluster throws Cluster health check failed: Failed to communicate with API server during namespace check Rancher	2	2439	August 31, 2022
Failing to Communicate with Kubernetes API server after Load Testing Cluster Rancher	5	6078	September 19, 2018
Cluster api access is stuck on a missing node Rancher	4	1281	April 15, 2022
Rancher API Server Rancher	0	1142	July 9, 2020

After Upgrade from 2.3.5 to 2.4.2: Failed to communicate with API Server

Related topics